PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of narLink to Publisher's site
 
Nucleic Acids Res. 2013 January; 41(Database issue): D758–D763.
Published online 2012 November 7. doi:  10.1093/nar/gks1057
PMCID: PMC3531138

MonarchBase: the monarch butterfly genome database

Abstract

The monarch butterfly (Danaus plexippus) is emerging as a model organism to study the mechanisms of circadian clocks and animal navigation, and the genetic underpinnings of long-distance migration. The initial assembly of the monarch genome was released in 2011, and the biological interpretation of the genome focused on the butterfly’s migration biology. To make the extensive data associated with the genome accessible to the general biological and lepidopteran communities, we established MonarchBase (available at http://monarchbase.umassmed.edu). The database is an open-access, web-available portal that integrates all available data associated with the monarch butterfly genome. Moreover, MonarchBase provides access to an updated version of genome assembly (v3) upon which all data integration is based. These include genes with systematic annotation, as well as other molecular resources, such as brain expressed sequence tags, migration expression profiles and microRNAs. MonarchBase utilizes a variety of retrieving methods to access data conveniently and for integrating biological interpretations.

INTRODUCTION

The eastern North American monarch butterfly (Danaus plexippus) undergoes a spectacular long-distance migration in the fall. The monarch has emerged as an excellent model for investigating the general molecular and neural basis of long-distance migration (1,2). The remarkable navigational capabilities of monarchs are part of a genetic program that is initiated in migrants; the butterflies that travel south to Mexico are at least two generations away from the previous generation of fall migrants (3). Fundamental to decoding the genetic basis of the long-distance migration has been the construction of the draft sequence of the monarch genome (4).

The monarch genome and its transcriptome were sequenced de novo using next-generation sequencing technologies (4). The difficulty of assembling the genome from wild-caught butterflies with potentially high heterozygosity was overcome, thus allowing the construction of the initial version of the monarch genome assembly (v1) which consisted of 273 Mb with 16 866 protein-coding genes (4).

Although the original assembly was quite complete for gene coverage, its quality was hindered because of small scaffold size (N50 of 53 kb) and high redundancy (~10%). By implementing new assembling strategies and new libraries, these difficulties have been largely overcome, resulting in a substantial improvement of the monarch butterfly assembly (named v3): 90% of the 249 Mb assembled sequence is now represented by 366 major scaffolds whose minimum length is 160 kb. The improved organization of the monarch genome should allow more precise annotation work. Furthermore, it provides a high quality reference that will facilitate future population genetic studies. For example, researchers now can re-sequence other monarch populations or non-migratory Danaus species to help identify migratory genes.

MonarchBase was developed as a public database for readily accessing the monarch genome, its proteome and related biological processes. The growing amount of genomic data and its continuous qualitative improvement necessitated a centralized database to coordinate the inflow of monarch genomic resources. Compared with public data repository, organism-specific databases provide the community with specialized data sets, powerful retrieving interfaces, a platform for extensive biological interpretations and a site for the integration of a variety of previously dispersed data types. MonarchBase serves not only researchers interested in monarch butterfly biology and the biology of the migration but also the wider lepidopteran community. We report here the development of MonarchBase, its components and the latest version of monarch genome assembly and its corresponding geneset.

RESULTS AND DISCUSSION

Data content

The current data content in MonarchBase is summarized in Table 1.

Table 1.
Data content in current version of MonarchBase

Genome assembly

Assembling genomes with potential high levels of polymorphism has remained a challenge, as haplotypes are assigned to allelic variants, which results in residual redundancy. The occurrence of residual redundancy in the initial assembly has been reported in several studies (8, 12). To remove redundancy from the initial monarch v1 assembly (4), we used both automated and manual methods. In brief, the shorter one of a duplicated pair of sequences was discarded; this was done by considering sequence identity and sequencing depth. Suspicious sequences that were only detected in one sequencing library were also excluded. Paired-end sequencing libraries, from 200 bp to 20 kb (4), were aligned to the non-redundant sequences, step by step, using BOWTIE2 (13). Local alignment mode of BOWTIE2 helped us effectively map Roche 454 libraries (8 and 20 kb), which were not as rigorously analyzed previously (4). Scaffolds were subsequently constructed based on mapped linkages using SSPACE v2.0 (14). The resulting assembly (v3) consists of 5397 scaffolds spanning ~249 Mb (Table 1). The monarch genome was previously estimated to be 0.29 pg by Feulgen image analysis (15). However, the actual assembled genome size for many species is smaller than their early estimated size (7,16,17), partly because of the presence of heterochromatin, which is near impossible to sequence and assemble (12). Compared with the previous version, the latest monarch assembly has a substantial improvement in connectedness (Table 2). Gene coverage in the new geneset (OGS2.0) is also increased, although our previous, initial version showed good quality of gene coverage (Table 2). The monarch whole genome shotgun project has been deposited at DDBJ/EMBL/GenBank under the accession AGBW00000000. The version described in this paper (v3) is the second version, AGBW02000000.

Table 2.
Quality control of the latest monarch assembly v3 compared with v1 and the other lepidopterans

Genome annotation

We identified 25 Mb of sequence as repetitive sequences and transposable elements for the v3 assembly, as described for the v1 assembly (4). We applied a variety of prediction methods to annotate repeat-masked scaffolds and provide accurate gene models (Table 1). Five ab initio prediction sets, including AUGUSTUS (23), GeneMark (24), Genscan (25), GlimmerHMM (26) and SNAP (27), were independently generated as described earlier (4). Importantly, we added data from the recently released geneset of the passion-vine butterfly Heliconius melpomene (8) to help identify butterfly specific genes. All these predicted genesets and the evidence of monarch cDNAs and insect homology were selected by GLEAN (28) to generate a consensus geneset. In addition, we used the MAKER annotation pipeline (29) to build another consensus geneset using the same inputs as used for GLEAN. As a result, GLEAN and MAKER identified 16 216 and 13 969 genes, respectively. According to the evaluation of 389 manually curated gene models and 20 cloned monarch genes, we chose the non-redundant GLEAN set as our new reference geneset, though we kept both GLEAN and MAKER, as well as all other independent prediction genesets, that are available in MonarchBase for browsing (Table 1).

A total of 15 130 of 16 216 GLEAN genes whose existence was supported from either monarch cDNAs or insect homologs were selected as the new official geneset (OGS2.0) for comprehensive annotation (Table 1). We performed BLASTP against both RefSeq (5) and UniRef50 (6) databases to report annotation information. We also performed both BLASTP and BLASTX against the non-redundant NCBI database to help annotate those uncommon genes and pseudogenes.

We used several methods to annotate genes into families and pathways. A local InterProScan (30) was run against the InterPro domain database (31) to map domains and GeneOntology (GO) terms (32) to monarch genes. KEGG is well-known for their collection of manually delineated pathway maps representing the current state of knowledge on the molecular interactions and reactions (33). We queried monarch proteins against KEGG orthology (KO) using BLASTP (1e-5) and assigned them to biological pathways. In addition, we used an OrthoMCL algorithm (34) to analyze gene orthology among 15 species, as described (4), and clustered genes into ortholog groups representing monarch-specific genes, butterfly specific genes (monarch and Heliconius) and lepidopteran-specific genes (monarch, Heliconius and Bombyx), as well as universal genes. For comparative analysis, we performed multiple alignment for each ortholog group using MUSCLE (35) and selected well-aligned blocks using Gblocks (36).

Functional resources

By mapping monarch brain-derived expressed sequence tags (ESTs) (37) to the geneset, previously identified transcripts associated with the oriented flight behavior of migratory butterflies (38) have all been annotated (4). In addition, more than 7000 monarch genes have expression data for comparison between summer and migratory monarchs (38). Using an integration approach, we also found an unexpected sexually dimorphic pattern within the monarch juvenile hormone biosynthesis regulatory pathway (4). RNAseq reads, representing multiple monarch tissues and developmental stages (4), were aligned back to the new assembly using Cufflinks (39) to present alternative splicing patterns. Universal expression value for each gene was calculated based on the normalized transcriptome coverage, as described (4). Small non-coding RNA sequencing data for both summer and migratory butterflies (4) were also integrated with the new assembly.

Database organization

We store and manage data for MonarchBase using MySQL (http://www.mysql.com). Several Common Gateway Interface scripts were developed to process users’ input to search the database, connect to third-party application, parse the result and generate pages for retrieved data. A schematic diagram of database organization is shown in Figure 1.

Figure 1.
Schematic view of the components of MonarchBase and their connections. The green arrows represent the clickable connections between the components. Thin arrows represent the major entrances of MonarchBase accepting users’ input to retrieve data: ...

Genome browser

MonarchBase utilizes a genome browser, implemented with GBrowse 2.0 (40), to navigate annotation along with the genome assembly. GBrowse is a well-known browser that integrates database and interactive web pages for displaying annotations of genomes, and has been applied to a variety of databases (18,22,41). Through GBrowse of MonarchBase, researchers can access data representing consensus genesets, independent genesets, alternative splicing patterns, homolog and cDNA alignments, repeat content, non-coding RNAs and other genomic features.

Accurate prediction of gene models is the most important task of genome annotation work. For consistency among users, we provide, as already indicated, an official reference geneset, OGS2.0, which is superior in overall quality to each of the independent genesets. Because each gene prediction program currently in use has both strengths and weaknesses, displaying all prediction sets is useful to optimize gene models when there are conflicting overlaps between sets.

Retrieved data

MonarchBase has been designed with several entry sites and accepts entry ID, key words or sequence as input to retrieve data for either a single gene or a group of genes (Figure 1). Gene page is the core of MonarchBase, at which researchers can access all related information for each OGS2.0 gene, including gene symbol, genomic position, evidence of monarch cDNA or insect homology, gene family, biological pathway, ortholog group and nucleotide and deduced protein sequence (Figure 1). Each entry in the gene page links to informative web page. MonarchBase can also return a list of monarch genes, coupled with biological interpretation, for retrieving entries of GO, InterPro, KO, ortholog groups or pathways. In addition, users can browse a list of differentially expressed ESTs and expanded/contracted gene families.

BLAST server

Local Basic Local Alignment Search Tool (BLAST) is one of the most useful entrance sites for a genomic database. At MonarchBase, users can search against a variety of monarch genome-wide data, including scaffolds, contigs, genes and ESTs. We also packed 332 930 proteins from genesets of 20 insect species as a single database, which facilitates search for homologs of most insect orders. We used html4blast, a Bioperl module (42), to customize BLAST output. Through extended links, users can click on identifiers to retrieve relevant information conveniently.

Broad application

As monarchs are famous for their long-distance migration, the biological interpretation of the genome has focused on genes potentially involved in the migration. We have manually annotated more than 1000 genes of biological interest for monarch migration biology and curated more than 100 chemoreception genes (4). With the new assembly, we have updated these gene inventories with OGS2.0 gene models; these are available for browsing in MonarchBase. MonarchBase also includes data from other insect species, which are integrated with appropriate links to other databases. We also provided lepidopteran-specific genes, microRNAs and contracted or expanded gene families based on our analysis. Users from other fields can also download multiple datasets for use in their local comparative analyses. Detailed instructions about how to use each component can be checked in the help file of MonarchBase.

FUTURE DIRECTIONS

Population genomic studies for monarchs and other Danaus species should be forthcoming. Identifying variations will be useful for analyzing population substructure and distribution rates, dating the migration of the eastern North American population and eventually uncover candidate migratory genes.

The completeness and contiguity of the monarch genome assembly will be continuously improved as more genomic sequences become available. In addition, the manual curation of additional genes is ongoing and will be updated in MonarchBase. We encourage other research groups to contribute annotations, curations and related datasets via Email (steven.reppert/at/umassmed.edu). Suggestions and requests for additional functions are also welcome.

FUNDING

Funding for open access charge: National Institutes of Health [GM086794-02S1].

Conflict of interest statement. None declared.

ACKNOWLEDGEMENTS

We thank Jeffrey L. Boore for help with initial aspects of the monarch v1 assembly; Alan Ritacco and David Lapointe for assistance with security issue and public access; the Heliconius Genome Consortium for early access to the Heliconius geneset; and Christine Merlin for discussions and comments.

REFERENCES

1. Reppert SM, Gegear RJ, Merlin C. Navigational mechanisms of migrating monarch butterflies. Trends Neurosci. 2010;33:399–406. [PMC free article] [PubMed]
2. Reppert SM. A colorful model of the circadian clock. Cell. 2006;124:233–236. [PubMed]
3. Brower LP. Monarch butterfly orientation: missing pieces of a magnificent puzzle. J. Exp. Biol. 1996;199:93–103. [PubMed]
4. Zhan S, Merlin C, Boore JL, Reppert SM. The monarch butterfly genome yields insights into long-distance migration. Cell. 2011;147:1171–1185. [PMC free article] [PubMed]
5. Pruitt KD, Tatusova T, Browse GR, Maglott DR. NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res. 2012;40:D130–D135. [PMC free article] [PubMed]
6. Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007;23:1282–1288. [PubMed]
7. International Silkworm Genome Consortium. The genome of a lepidopteran model insect, the silkworm Bombyx mori. Insect Biochem. Molec. 2008;38:1036–1045. [PubMed]
8. Dasmahapatra KK, Walters JR, Briscoe AD, Davey JW, Whibley A, Nadeau NJ, Zimin AV, Hughes DS, Ferguson LC, Martin SH, et al. Butterfly genome reveals promiscuous exchange of mimicry adaptations among species. Nature. 2012;487:94–98. [PMC free article] [PubMed]
9. Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997;25:955–964. [PMC free article] [PubMed]
10. Lagesen K, Hallin P, Rodland EA, Staerfeldt HH, Rognes T, Ussery DW. RNAmmer: consistent and rapid annotation of ribosomal RNA genes. Nucleic Acids Res. 2007;35:3100–3108. [PMC free article] [PubMed]
11. Gardner PP, Daub J, Tate J, Moore BL, Osuch IH, Griffiths-Jones S, Finn RD, Nawrocki EP, Kolbe DL, Eddy SR, et al. Rfam: wikipedia, clans and the “decimal” release. Nucleic Acids Res. 2011;39:D141–D145. [PMC free article] [PubMed]
12. Chapman JA, Kirkness EF, Simakov O, Hampson SE, Mitros T, Weinmaier T, Rattei T, Balasubramanian PG, Borman J, Busam D, et al. The dynamic genome of Hydra. Nature. 2010;464:592–596. [PubMed]
13. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat. Methods. 2012;9:357–359. [PMC free article] [PubMed]
14. Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics. 2010;27:578–579. [PubMed]
15. Hebert PDN, Gregory TR. Genome size variation in lepidopteran insects. Can. J. Zool. 2003;81:1399–1405.
16. Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, Scherer SE, Li PW, Hoskins RA, Galle RF, et al. The genome sequence of Drosophila melanogaster. Science. 2000;287:2185–2195. [PubMed]
17. International Chicken Genome Sequencing Consortium. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature. 2004;432:695–716. [PubMed]
18. Duan J, Li R, Cheng D, Fan W, Zha X, Cheng T, Wu Y, Wang J, Mita K, Xiang Z, et al. SilkDB v2.0: a platform for silkworm (Bombyx mori) genome biology. Nucleic Acids Res. 2010;38:D453–D456. [PMC free article] [PubMed]
19. Parra G, Bradnam K, Korf I. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics. 2007;23:1061–1067. [PubMed]
20. McQuilton P, St Pierre SE, Thurmond J. FlyBase 101—the basics of navigating FlyBase. Nucleic Acids Res. 2012;40:D706–D714. [PMC free article] [PubMed]
21. She R, Chu JS, Wang K, Pei J, Chen N. GenBlastA: enabling BLAST to identify homologous gene sequences. Genome Res. 2009;19:143–149. [PubMed]
22. Kim HS, Murphy T, Xia J, Caragea D, Park Y, Beeman RW, Lorenzen MD, Butcher S, Manak JR, Brown SJ. BeetleBase in 2010: revisions to provide comprehensive genomic information for Tribolium castaneum. Nucleic Acids Res. 2010;38:D437–D442. [PMC free article] [PubMed]
23. Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 2006;34:W435–W439. [PMC free article] [PubMed]
24. Borodovsky M, Lomsadze A. Eukaryotic gene prediction using GeneMark.hmm-E and GeneMark-ES. Curr. Protoc. Bioinform. 2011;35:4. 6.1–4.6.10. [PMC free article] [PubMed]
25. Burge C, Karlin S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 1997;268:78–94. [PubMed]
26. Majoros WH, Pertea M, Salzberg SL. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics. 2004;20:2878–2879. [PubMed]
27. Korf I. Gene finding in novel genomes. BMC Bioinformatics. 2004;5:59. [PMC free article] [PubMed]
28. Elsik CG, Mackey AJ, Reese JT, Milshina NV, Roos DS, Weinstock GM. Creating a honey bee consensus gene set. Genome Biol. 2007;8:R13. [PMC free article] [PubMed]
29. Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B, Holt C, Sanchez Alvarado A, Yandell M. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 2008;18:188–196. [PubMed]
30. Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, Apweiler R, Lopez R. InterProScan: protein domains identifier. Nucleic Acids Res. 2005;33:W116–W120. [PMC free article] [PubMed]
31. Hunter S, Jones P, Mitchell A, Apweiler R, Attwood TK, Bateman A, Bernard T, Binns D, Bork P, Burge S, et al. InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Res. 2012;40:D306–D312. [PMC free article] [PubMed]
32. The Gene Ontology Consortium. The Gene Ontology: enhancements for 2011. Nucleic Acids Res. 2012;40:D559–D564. [PMC free article] [PubMed]
33. Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 2012;40:D109–D114. [PMC free article] [PubMed]
34. Li L, Stoeckert CJ, Jr, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003;13:2178–2189. [PubMed]
35. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. [PMC free article] [PubMed]
36. Talavera G, Castresana J. Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst. Biol. 2007;56:564–577. [PubMed]
37. Zhu H, Casselman A, Reppert SM. Chasing migration genes: a brain expressed sequence tag resource for summer and migratory monarch butterflies (Danaus plexippus) PloS One. 2008;3:e1345. [PMC free article] [PubMed]
38. Zhu H, Gegear RJ, Casselman A, Kanginakudru S, Reppert SM. Defining behavioral and molecular differences between summer and migratory monarch butterflies. BMC Biol. 2009;7:14. [PMC free article] [PubMed]
39. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 2010;28:511–515. [PMC free article] [PubMed]
40. Donlin MJ. Using the Generic Genome Browser (GBrowse) Curr. Protoc. Bioinform. 2009;28:9. 9.1–9.9.25. [PubMed]
41. Cameron RA, Samanta M, Yuan A, He D, Davidson E. SpBase: the sea urchin genome database and web site. Nucleic Acids Res. 2009;37:D750–D754. [PMC free article] [PubMed]
42. Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H, et al. The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002;12:1611–1618. [PubMed]

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press