|Home | About | Journals | Submit | Contact Us | Français|
The monarch butterfly (Danaus plexippus) is emerging as a model organism to study the mechanisms of circadian clocks and animal navigation, and the genetic underpinnings of long-distance migration. The initial assembly of the monarch genome was released in 2011, and the biological interpretation of the genome focused on the butterfly’s migration biology. To make the extensive data associated with the genome accessible to the general biological and lepidopteran communities, we established MonarchBase (available at http://monarchbase.umassmed.edu). The database is an open-access, web-available portal that integrates all available data associated with the monarch butterfly genome. Moreover, MonarchBase provides access to an updated version of genome assembly (v3) upon which all data integration is based. These include genes with systematic annotation, as well as other molecular resources, such as brain expressed sequence tags, migration expression profiles and microRNAs. MonarchBase utilizes a variety of retrieving methods to access data conveniently and for integrating biological interpretations.
The eastern North American monarch butterfly (Danaus plexippus) undergoes a spectacular long-distance migration in the fall. The monarch has emerged as an excellent model for investigating the general molecular and neural basis of long-distance migration (1,2). The remarkable navigational capabilities of monarchs are part of a genetic program that is initiated in migrants; the butterflies that travel south to Mexico are at least two generations away from the previous generation of fall migrants (3). Fundamental to decoding the genetic basis of the long-distance migration has been the construction of the draft sequence of the monarch genome (4).
The monarch genome and its transcriptome were sequenced de novo using next-generation sequencing technologies (4). The difficulty of assembling the genome from wild-caught butterflies with potentially high heterozygosity was overcome, thus allowing the construction of the initial version of the monarch genome assembly (v1) which consisted of 273 Mb with 16 866 protein-coding genes (4).
Although the original assembly was quite complete for gene coverage, its quality was hindered because of small scaffold size (N50 of 53 kb) and high redundancy (~10%). By implementing new assembling strategies and new libraries, these difficulties have been largely overcome, resulting in a substantial improvement of the monarch butterfly assembly (named v3): 90% of the 249 Mb assembled sequence is now represented by 366 major scaffolds whose minimum length is 160 kb. The improved organization of the monarch genome should allow more precise annotation work. Furthermore, it provides a high quality reference that will facilitate future population genetic studies. For example, researchers now can re-sequence other monarch populations or non-migratory Danaus species to help identify migratory genes.
MonarchBase was developed as a public database for readily accessing the monarch genome, its proteome and related biological processes. The growing amount of genomic data and its continuous qualitative improvement necessitated a centralized database to coordinate the inflow of monarch genomic resources. Compared with public data repository, organism-specific databases provide the community with specialized data sets, powerful retrieving interfaces, a platform for extensive biological interpretations and a site for the integration of a variety of previously dispersed data types. MonarchBase serves not only researchers interested in monarch butterfly biology and the biology of the migration but also the wider lepidopteran community. We report here the development of MonarchBase, its components and the latest version of monarch genome assembly and its corresponding geneset.
The current data content in MonarchBase is summarized in Table 1.
Assembling genomes with potential high levels of polymorphism has remained a challenge, as haplotypes are assigned to allelic variants, which results in residual redundancy. The occurrence of residual redundancy in the initial assembly has been reported in several studies (8, 12). To remove redundancy from the initial monarch v1 assembly (4), we used both automated and manual methods. In brief, the shorter one of a duplicated pair of sequences was discarded; this was done by considering sequence identity and sequencing depth. Suspicious sequences that were only detected in one sequencing library were also excluded. Paired-end sequencing libraries, from 200 bp to 20 kb (4), were aligned to the non-redundant sequences, step by step, using BOWTIE2 (13). Local alignment mode of BOWTIE2 helped us effectively map Roche 454 libraries (8 and 20 kb), which were not as rigorously analyzed previously (4). Scaffolds were subsequently constructed based on mapped linkages using SSPACE v2.0 (14). The resulting assembly (v3) consists of 5397 scaffolds spanning ~249 Mb (Table 1). The monarch genome was previously estimated to be 0.29 pg by Feulgen image analysis (15). However, the actual assembled genome size for many species is smaller than their early estimated size (7,16,17), partly because of the presence of heterochromatin, which is near impossible to sequence and assemble (12). Compared with the previous version, the latest monarch assembly has a substantial improvement in connectedness (Table 2). Gene coverage in the new geneset (OGS2.0) is also increased, although our previous, initial version showed good quality of gene coverage (Table 2). The monarch whole genome shotgun project has been deposited at DDBJ/EMBL/GenBank under the accession AGBW00000000. The version described in this paper (v3) is the second version, AGBW02000000.
We identified 25 Mb of sequence as repetitive sequences and transposable elements for the v3 assembly, as described for the v1 assembly (4). We applied a variety of prediction methods to annotate repeat-masked scaffolds and provide accurate gene models (Table 1). Five ab initio prediction sets, including AUGUSTUS (23), GeneMark (24), Genscan (25), GlimmerHMM (26) and SNAP (27), were independently generated as described earlier (4). Importantly, we added data from the recently released geneset of the passion-vine butterfly Heliconius melpomene (8) to help identify butterfly specific genes. All these predicted genesets and the evidence of monarch cDNAs and insect homology were selected by GLEAN (28) to generate a consensus geneset. In addition, we used the MAKER annotation pipeline (29) to build another consensus geneset using the same inputs as used for GLEAN. As a result, GLEAN and MAKER identified 16 216 and 13 969 genes, respectively. According to the evaluation of 389 manually curated gene models and 20 cloned monarch genes, we chose the non-redundant GLEAN set as our new reference geneset, though we kept both GLEAN and MAKER, as well as all other independent prediction genesets, that are available in MonarchBase for browsing (Table 1).
A total of 15 130 of 16 216 GLEAN genes whose existence was supported from either monarch cDNAs or insect homologs were selected as the new official geneset (OGS2.0) for comprehensive annotation (Table 1). We performed BLASTP against both RefSeq (5) and UniRef50 (6) databases to report annotation information. We also performed both BLASTP and BLASTX against the non-redundant NCBI database to help annotate those uncommon genes and pseudogenes.
We used several methods to annotate genes into families and pathways. A local InterProScan (30) was run against the InterPro domain database (31) to map domains and GeneOntology (GO) terms (32) to monarch genes. KEGG is well-known for their collection of manually delineated pathway maps representing the current state of knowledge on the molecular interactions and reactions (33). We queried monarch proteins against KEGG orthology (KO) using BLASTP (1e-5) and assigned them to biological pathways. In addition, we used an OrthoMCL algorithm (34) to analyze gene orthology among 15 species, as described (4), and clustered genes into ortholog groups representing monarch-specific genes, butterfly specific genes (monarch and Heliconius) and lepidopteran-specific genes (monarch, Heliconius and Bombyx), as well as universal genes. For comparative analysis, we performed multiple alignment for each ortholog group using MUSCLE (35) and selected well-aligned blocks using Gblocks (36).
By mapping monarch brain-derived expressed sequence tags (ESTs) (37) to the geneset, previously identified transcripts associated with the oriented flight behavior of migratory butterflies (38) have all been annotated (4). In addition, more than 7000 monarch genes have expression data for comparison between summer and migratory monarchs (38). Using an integration approach, we also found an unexpected sexually dimorphic pattern within the monarch juvenile hormone biosynthesis regulatory pathway (4). RNAseq reads, representing multiple monarch tissues and developmental stages (4), were aligned back to the new assembly using Cufflinks (39) to present alternative splicing patterns. Universal expression value for each gene was calculated based on the normalized transcriptome coverage, as described (4). Small non-coding RNA sequencing data for both summer and migratory butterflies (4) were also integrated with the new assembly.
We store and manage data for MonarchBase using MySQL (http://www.mysql.com). Several Common Gateway Interface scripts were developed to process users’ input to search the database, connect to third-party application, parse the result and generate pages for retrieved data. A schematic diagram of database organization is shown in Figure 1.
MonarchBase utilizes a genome browser, implemented with GBrowse 2.0 (40), to navigate annotation along with the genome assembly. GBrowse is a well-known browser that integrates database and interactive web pages for displaying annotations of genomes, and has been applied to a variety of databases (18,22,41). Through GBrowse of MonarchBase, researchers can access data representing consensus genesets, independent genesets, alternative splicing patterns, homolog and cDNA alignments, repeat content, non-coding RNAs and other genomic features.
Accurate prediction of gene models is the most important task of genome annotation work. For consistency among users, we provide, as already indicated, an official reference geneset, OGS2.0, which is superior in overall quality to each of the independent genesets. Because each gene prediction program currently in use has both strengths and weaknesses, displaying all prediction sets is useful to optimize gene models when there are conflicting overlaps between sets.
MonarchBase has been designed with several entry sites and accepts entry ID, key words or sequence as input to retrieve data for either a single gene or a group of genes (Figure 1). Gene page is the core of MonarchBase, at which researchers can access all related information for each OGS2.0 gene, including gene symbol, genomic position, evidence of monarch cDNA or insect homology, gene family, biological pathway, ortholog group and nucleotide and deduced protein sequence (Figure 1). Each entry in the gene page links to informative web page. MonarchBase can also return a list of monarch genes, coupled with biological interpretation, for retrieving entries of GO, InterPro, KO, ortholog groups or pathways. In addition, users can browse a list of differentially expressed ESTs and expanded/contracted gene families.
Local Basic Local Alignment Search Tool (BLAST) is one of the most useful entrance sites for a genomic database. At MonarchBase, users can search against a variety of monarch genome-wide data, including scaffolds, contigs, genes and ESTs. We also packed 332 930 proteins from genesets of 20 insect species as a single database, which facilitates search for homologs of most insect orders. We used html4blast, a Bioperl module (42), to customize BLAST output. Through extended links, users can click on identifiers to retrieve relevant information conveniently.
As monarchs are famous for their long-distance migration, the biological interpretation of the genome has focused on genes potentially involved in the migration. We have manually annotated more than 1000 genes of biological interest for monarch migration biology and curated more than 100 chemoreception genes (4). With the new assembly, we have updated these gene inventories with OGS2.0 gene models; these are available for browsing in MonarchBase. MonarchBase also includes data from other insect species, which are integrated with appropriate links to other databases. We also provided lepidopteran-specific genes, microRNAs and contracted or expanded gene families based on our analysis. Users from other fields can also download multiple datasets for use in their local comparative analyses. Detailed instructions about how to use each component can be checked in the help file of MonarchBase.
Population genomic studies for monarchs and other Danaus species should be forthcoming. Identifying variations will be useful for analyzing population substructure and distribution rates, dating the migration of the eastern North American population and eventually uncover candidate migratory genes.
The completeness and contiguity of the monarch genome assembly will be continuously improved as more genomic sequences become available. In addition, the manual curation of additional genes is ongoing and will be updated in MonarchBase. We encourage other research groups to contribute annotations, curations and related datasets via Email (firstname.lastname@example.org). Suggestions and requests for additional functions are also welcome.
Funding for open access charge: National Institutes of Health [GM086794-02S1].
Conflict of interest statement. None declared.
We thank Jeffrey L. Boore for help with initial aspects of the monarch v1 assembly; Alan Ritacco and David Lapointe for assistance with security issue and public access; the Heliconius Genome Consortium for early access to the Heliconius geneset; and Christine Merlin for discussions and comments.