|Home | About | Journals | Submit | Contact Us | Français|
We present an annotation pipeline that accurately predicts exon–intron structures and protein-coding sequences (CDSs) on the basis of full-length cDNAs (FLcDNAs). This annotation pipeline was used to identify genes in 10 plant genomes. In particular, we show that interspecies mapping of FLcDNAs to genomes is of great value in fully utilizing FLcDNA resources whose availability is limited to several species. Because low sequence conservation at 5′- and 3′-ends of FLcDNAs between different species tends to result in truncated CDSs, we developed an improved algorithm to identify complete CDSs by the extension of both ends of truncated CDSs. Interspecies mapping of 71 801 monocot FLcDNAs to the Oryza sativa genome led to the detection of 22 142 protein-coding regions. Moreover, in comparing two mapping programs and three ab initio prediction programs, we found that our pipeline was more capable of identifying complete CDSs. As demonstrated by monocot interspecies mapping, in which nucleotide identity between FLcDNAs and the genome was ~80%, the resultant inferred CDSs were sufficiently accurate. Finally, we applied both inter- and intraspecies mapping to 10 monocot and dicot genomes and identified genes in 210 551 loci. Interspecies mapping of FLcDNAs is expected to effectively predict genes and CDSs in newly sequenced genomes.
Following the sequencing of the Arabidopsis thaliana genome in 20001 and the Oryza sativa genome in 2005,2 several complete or nearly complete flowering plant genomes have been released.3–10 In addition, large-scale sequencing projects of important cereal crops are currently being undertaken by international consortia, including the International Barley Sequencing Consortium (http://www.barleygenome.org) and the International Wheat Genome Sequencing Consortium (http://www.wheatgenome.org).11,12 It is expected that accelerating genome sequencing technologies, such as next-generation sequencing, will lead to a dramatic increase in the genomic DNA data to be annotated.13,14 Therefore, to cope with the deluge of emerging sequence information, the development of an efficient annotation system is needed.
Exon–intron structures and the protein-coding sequences (CDSs) in genome sequences can be predicted either by ab initio predictions or by sequence similarity methods. While ab initio gene prediction programs may produce erroneous exon–intron structures,15–17 sequence similarity approaches generally show better results, although the number of available sequences in the current databases limits the number of genes that can be predicted. Tens of thousands of full-length cDNAs (FLcDNAs), which can be used to accurately determine exon–intron structures,4,18–21 are available for O. sativa, Zea mays, and A. thaliana, and the quality of their genome annotations is generally thought to be high. In other species, however, sufficient numbers of FLcDNAs are not necessarily available for the corresponding genomes. To fully utilize FLcDNA resources, algorithms for interspecies mapping, which show better gene structure annotation than ab initio predictions, have been developed.22,23
Identification of CDS regions is a crucial step in genome annotation. However, low sequence conservation at the 5′- and 3′-ends of FLcDNAs between different species frequently results in truncated CDSs. To solve this problem, we present a pipeline that is based on interspecies FLcDNA mapping to genomes, which predicts exon–intron structures and the CDSs. In this pipeline, both ends of truncated CDSs are extended so that the complete CDSs are obtained. Our FLcDNA mapping pipeline was validated and compared with two other cDNA mapping programs—sim4cc23 and GeneSeqer24—as well as three ab initio gene prediction methods. In addition, we estimated how many sequences would be needed for exhaustive gene identification by interspecies mapping. Finally, for comprehensive gene identification, we applied our pipeline to the 10 genomes of the following flowering plants: O. sativa cv. Nipponbare, Z. mays, Sorghum bicolor, Brachypodium distachyon, A. thaliana, Lotus japonicus, Populus trichocarpa, Glycine max, Vitis vinifera, and Carica papaya.
We retrieved 179 991 FLcDNAs from A. thaliana,18,20,21 O. sativa,25,26 O. rufipogon,27 Hordeum vulgare,28 Z. mays,29–31 G. max,32 P. trichocarpa,33 Triticum aestivum,34 and Solanum lycopersicum35 from the sources listed in Supplementary Table S1. The genome sequences and annotations of O. sativa,2 S. bicolor,3 Z. mays,4 B. distachyon,5 A. thaliana,36 P. trichocarpa,6 V. vinifera,7 C. papaya,8 L. japonicus,9 and G. max10 were downloaded from the web sites listed in Supplementary Table S2. For validation of predicted exon–intron structures and the CDSs, reference data sets, based on the intraspecies mapping method of the Rice Annotation Project,19,37 were used (Supplementary Table S3). To enhance the accuracy of the validations, we used only the loci that had the same CDS structures as the corresponding CDSs from other sources: the MSU6.138 data set for O. sativa, the TAIR 9.0 representative CDSs36 for A. thaliana, and the B73 RefGen_v1 Filtered Gene Set4 for Z. mays. If more than one CDS was predicted in a locus of Z. mays, the longest one was selected.
Figure 1 shows an overview of our new interspecies mapping pipeline that simultaneously identifies CDSs on predicted loci. Repetitive sequences on FLcDNAs were masked by RepeatMasker (http://www.repeatmasker.org) with the MIPS Repeat Element Database (mips-REdat 4.3)39 and the ‘complete set’ of the Triticeae Repeat Sequence Database release 10 (http://wheat.pw.usda.gov/ITMI/Repeats/). Vector sequences were removed by the cross_match program. PolyA stretches were discarded using a custom-made program. Sequences with total unmasked lengths of 30 bp or more were used.
Before precise alignment between FLcDNAs and a genome, we first used blastn to approximately define genomic regions that correspond to FLcDNAs. We used est2genome40 to align FLcDNAs to the corresponding genomic regions selected by blastn. Next, we invented methods to solve three problems with interspecies alignment (Fig. 2). Because tandemly duplicated genes tended to be accidentally combined into one gene during alignments, we divided these regions such that each region contained only one candidate gene and aligned the FLcDNAs to each region (Supplementary Fig. S1). To improve the detection of exon–intron boundaries by est2genome, multiple lines of transcript evidence were employed as follows. We scored introns using linear discriminant analyses, based on a splice site model and alignments around the splice sites, such that the most probable intron is selected (for details, see Supplementary Methods). Finally, we also discarded short introns whose length was <50 bp because the percentage of such short introns identified by intraspecies mapping was 10 times smaller than that by interspecies mapping (Supplementary Fig. S2).
CDSs in predicted transcripts were determined on the basis of homology searches by blastx41 against the Uniprot42 and RefSeq43 databases. If no homologs were found in the protein databases, GeneMark44 was used, or the longest CDSs were employed. Either the 5′- or the 3′-end was frequently truncated after interspecies mapping because of the relatively low sequence conservation observed for CDS termini. Therefore, both ends of the mapped transcripts were extended so that predicted CDSs contained the start and the stop codons. Finally, from a set of overlapping transcripts in a given locus, a single representative transcript that had the longest CDS was selected.
To evaluate our methods, we compared the exon–intron structures and the CDSs derived from the interspecies mapping with the aforementioned reference sets of O. sativa, Z. mays, and A. thaliana. We evaluated only the predictions that overlapped with the reference set. Exon–intron boundaries, introns, all introns within a reference CDS (referred to as ‘all introns’ hereafter), and the entire CDSs were examined. We evaluated mapping results for each FLcDNA. Specificity (SP) was defined as TP/(TP + FP), and sensitivity (SN) was defined as TP/(TP + FN), where TP is the number of true positives, FP is the number of false positives, and FN is the number of false negatives. Single-exon genes were considered only in the CDS evaluation because they have no introns. To evaluate the CDS extension process, we examined whether the start and the stop codons of the reference set were included in the predictions. Reference CDSs that did not overlap with predictions were not used.
To investigate relationships between the nucleotide identity and mapping ratio, the nucleotide identity between a given FLcDNA and the genomic locus to which it was assigned by est2genome was calculated. The mapping ratio represents the percentage of FLcDNAs that were mapped by est2genome divided by all FLcDNAs used. To investigate correlations between nucleotide identity and SP, we selected representative CDSs for each species and calculated the SP of gene predictions between species for genomes and FLcDNAs.
The newly developed annotation pipeline described in the present work was compared with three ab initio gene prediction tools, including GeneZilla,45 GlimmerHMM,46 and GeneMark.hmm.47 The options for O. sativa gene predictions were employed in GlimmerHMM and GeneMark.hmm predictions. For GeneZilla, we randomly selected 5000 genes from the rice annotation37 and used 4124, which contained both the start and the stop codons, to create a training model. Our annotation pipeline also was compared with other cDNA mapping programs, including sim4cc23 and GeneSeqer.24,48 We mapped 71 801 monocot FLcDNAs to the O. sativa genome with the default parameters of sim4cc and GeneSeqer. Alignments that covered 40% or more of FLcDNAs were selected for comparison of prediction efficiency. For the GeneSeqer predictions, the longest representative CDS was selected if multiple transcripts were aligned to one locus.
If FLcDNAs are collected randomly, the number of FLcDNAs per locus obeys a Poisson distribution. We estimated the number of all loci (N) from the number of loci found (L) and the number of FLcDNAs mapped (M) using the following formula49:
Given L and M, N can be estimated. To cover 99% of the genomic loci, L/N = 0.99. Hence, the number of FLcDNAs needed to cover 99% of the loci is:
To evaluate our pipeline, we aligned 71 801 FLcDNAs obtained from three monocots (H. vulgare, Z. mays, and T. aestivum) to the O. sativa genome. From this, a total of 54 004 (75.2%) non-rice monocot FLcDNAs were mapped and clustered into 22 500 loci, from which 22 142 CDSs were predicted in the O. sativa genome. If these CDSs overlapped with the reference CDSs, we examined SP and SN for exon–intron boundaries, introns, all introns, and the entire CDSs (Table 1). As expected,22 mapping-based methods, including our pipeline, generally showed higher SP and SN at the intron and all-introns levels than the ab initio methods; however, both the SP and the SN drastically dropped at the CDS level when using GeneSeqer. This is partly because the 5′- and 3′-ends of CDSs are poorly conserved and cannot be aligned in many cases. In fact, a significant number of O. sativa start or stop codons were not found in all the three mapping-based methods without extension of 5′- and 3′-ends (Table 2). In contrast, the extension of the start and the stop codons in our CDS identification led to the inclusion of more than 88 and 91% of these codons, respectively (Table 2), which were better than the other two methods. After the CDS extension process, the SP and SN of our pipeline increased by 12.2 and 9.8%, respectively, at the CDS level, which suggests that this extension step seems to be of great benefit for accurate CDS prediction.
Low sequence similarity in untranslated regions (UTRs) caused more serious problems. The SP of introns in UTRs was <33%, which was much lower than the above 90% SP of non-UTR-based CDSs (Supplementary Table S4). This result indicates that interspecies mapping should focus on CDSs rather than the entire transcripts.
Improvement of gene structure identification in three steps (detection of tandem duplications, removal of short introns, and exon–intron boundary corrections) of our pipeline was validated (Supplementary Table S5). The pipeline detected candidate regions of tandem duplication in 11 287 alignments (see Supplementary Methods), and they were consequently reflected in 3510 loci. In addition, 10 097 short introns (<50 bp) to be removed were found in 4480 loci. As a result, 7.5% SP and 2.5% SN at the all-introns level were increased in the three steps (Supplementary Table S5). In this way, without sacrificing SN, SP was largely improved.
Although interspecies mapping is useful for detecting loci in related species, FLcDNAs from distantly related species may lead to incorrect predictions. To test the applicability of the interspecies mapping approach, we examined the relationship between species classification and mapping ratios. For this purpose, we used the O. sativa, Z. mays, and A. thaliana genomes because they had large numbers of FLcDNAs that could be used to create high-quality reference sets. We applied monocot FLcDNAs to the O. sativa and Z. mays genomes and dicot FLcDNAs to the A. thaliana genome. We also applied our algorithm to a monocot–dicot comparison. Figure 3 shows the relationship between the mapping ratio and the species classification. In general, mapping within both monocots and dicots yielded a higher mapping ratio than mapping between a monocot and a dicot. Therefore, comparisons between distantly related species, such as a monocot and a dicot, do not seem to be appropriate. Within-monocot comparisons displayed a higher ratio than within-dicot comparisons. In addition, monocot-to-Arabidopsis mapping showed a lower mapping ratio compared with dicot-to-rice/maize mapping, although evolutionary distances of these two comparisons were same. This difference of mapping ratios is possible because A. thaliana might have undergone genome reduction that led to a lineage-specific deletion of genes.
To further ascertain the degree to which nucleotide differences affect the accuracy of the exon–intron structure predictions generated by interspecies mapping, we examined the relationship between the nucleotide sequence identity and the SP of all introns in our representative CDSs (Fig. 4). The correlation coefficient for this relationship is 0.88 (P = 8.6 × 10−9). This suggests that interspecies mapping within closely related species should detect a significant number of true exon–intron structures. If the nucleotide identity is more than 80%, the SP of all introns in a given set of CDSs is expected to be ~60% or more.
Figure 4 also shows that, for the monocot genomes examined, monocot FLcDNAs provided accurate results, whereas the dicot exon–intron structures that were inferred were relatively inaccurate because the dicot genomes used were derived from diverse species. Because all monocots employed in the present study belong to Poaceae (the grass family), their divergence time is relatively short (50–70 million years or less).50 In contrast, the available dicot FLcDNAs belong to multiple families that are more divergent: A. thaliana is in Brassicaceae, S. lycopersicum in Solanaceae, G. max in Fabaceae, and P. trichocarpa in Salicaceae. For example, A. thaliana and P. trichocarpa probably split around 100–120 million years ago.6 In fact, nucleotide identities among the dicots A. thaliana, P. trichocarpa, S. lycopersicum, and G. max ranged from 75% to 78%, which is lower than the ~80% identity detected among monocots. To accurately determine the exon–intron structures within genomes, FLcDNAs from the same family would be preferable.
A genome is composed of regions that are conserved among species and regions that are specific to a lineage, but interspecies mapping can identify only conserved genes. Here, on the basis of the interspecies mapping results, we estimated the numbers of conserved genes in monocots. If random sampling of FLcDNAs is assumed, the number of loci inferred from the number of mapped FLcDNAs obeys a Poisson distribution (see the ‘Materials and methods’ section). We estimated the number of loci shared between O. sativa and the other monocots to be ~18 000 because 17 402 loci were identified among 54 004 transcripts. Zea mays and S. bicolor seem to contain ~21 000 and ~24 000 conserved loci, respectively. Note that, for this estimation, a single genomic region was selected if an FLcDNA was homologous to multiple regions. We also should note that, because the sampling of FLcDNAs was not completely random, the estimated numbers should be the lower limits. In addition, to more accurately estimate the total number of loci, we must consider lineage-specific genes that are undetectable by sequence similarity.
Next, we estimated the number of FLcDNAs needed to annotate homologous CDSs among monocot species. Assuming that ~83 000 FLcDNAs would need to be used to cover 99% of the ~18 000 loci in O. sativa, with a mapping ratio of 0.75, a total of ~110 000 FLcDNAs obtained from closely related species would be necessary for prediction of CDSs in O. sativa. This number may be an underestimate if FLcDNAs are not randomly cloned, but most of the genes should be included if more than 100 000 clones are collected from closely related species. Therefore, because there are more than 120 000 FLcDNAs in monocots, new genomes of monocots, such as wheat and barley, would be effectively annotated by using interspecies mapping.
We applied the interspecies mapping procedure to predict exon–intron structures and CDSs in 10 plant genomes: O. sativa, Z. mays, A. thaliana, S. bicolor, B. distachyon, P. trichocarpa, G. max, V. vinifera, L. japonicus, and C. papaya. Table 3 shows the number of the FLcDNAs used, the mapped FLcDNAs, and the CDS loci. Three examples of our interspecies mapping results for O. sativa were compared with the exon–intron structures of annotation release 2 of the rice genome (RAP2), which was based on an intraspecies mapping procedure37 (Fig. 5). First, three FLcDNAs derived from Z. mays were mapped to the Os11g0635500 locus of RAP2, and their 5′- and 3′-ends were extended. These predicted CDSs displayed exon–intron structures that were identical to those determined by an O. sativa FLcDNA with a CDS and UTRs (Fig. 5A). Second, the representative structure at the Os01g0120300 locus determined by an FLcDNA (Fig. 5B) was apparently truncated at the 5′-end. In contrast, the interspecies mapping results predicted the complete exon–intron structures including the start and the stop codons. Finally, a H. vulgare FLcDNA (AK248420) predicted a novel transcribed locus candidate between Os08g0206900 and Os08g0207000 that had not yet been detected (Fig. 5C). In total, we identified 492 new loci that were missing in release 2 of the rice annotation. The numbers of new loci found in the 10 plant species examined are shown in Table 3. In addition, we provide exon–intron structure and CDS data that were created by a combination of interspecies and intraspecies mapping methods (Supplementary Table S6). This comprehensive data set, which contains 210 551 genes from 10 species, will be useful for a large-scale sequence analysis of these flowering plants.
Because more than 120 000 monocot FLcDNAs are currently available, this FLcDNA data set seems to be large enough to annotate conserved genes in monocot genomes. We could provide relatively a high-quality CDS annotation using intraspecies and interspecies mapping. However, for dicot genomes, the 58 534 dicot FLcDNAs currently available do not seem to be sufficient to predict most of the gene structures. We expect that 50 000 more FLcDNAs from any dicots might improve the coverage of gene prediction in dicot genomes. To complement the shortage of FLcDNAs and explore lineage-specific genes, ab initio prediction programs can be used in combination with interspecies and intraspecies mapping. Interspecies FLcDNA mapping, partly supported by ab initio predictions, will be of great use for cost-effective annotation of newly released agronomically important plant genomes, such as those of wheat and barley.
A web service for the mapping of FLcDNAs to genomic DNA sequences is available (http://fpgp.dna.affrc.go.jp/index.html). Users can submit a sequence of up to 1 Mb and can specify dicot or monocot FLcDNAs to be mapped. After completion of the requested prediction, an URL indicating the prediction results is sent by e-mail. The CDS identification results of 10 plant genomes are available at our web site (http://fpgp.dna.affrc.go.jp/download.html).
This work was supported by a grant from the Ministry of Agriculture, Forestry, and Fisheries of Japan (Genomics for Agricultural Innovation, GIR-1001).
Edited by Katsumi Isono