|Home | About | Journals | Submit | Contact Us | Français|
Motivation:A large part of the maize B73 genome sequence is now available and emerging sequencing technologies will offer cheap and easy ways to sequence areas of interest from many other maize genotypes. One of the steps required to turn these sequences into valuable information is gene content prediction. To date, there is no publicly available gene predictor specifically trained for maize sequences. To this end, we have chosen to train the EuGène software that can combine several sources of evidence into a consolidated gene model prediction.
Supplementary information:Supplementary data are available at Bioinformatics online.
The B73 maize genome sequence is now available (Schnable et al., 2009). We can anticipate that next generation sequencing technologies will soon supply a deluge of genomic data from other maize genotypes. Therefore, in addition to the annotation provided by the maize sequence consortium for the B73 genotype, the community will also need a tool for the annotation of maize genomic sequences produced from other genotypes.
To date, the www.maizesequence.org web site provides the Filtered Gene Set including annotation of 32 540 gene models (RefGen_v1), based on biological evidence. However, no tool is provided to annotate personal sequences yet.
Fgenesh (Salamov and Solovyev, 2000) was among the first ab initio gene prediction softwares available for maize while it was trained for monocot species. Combiner softwares like EuGène (Foissac and Schiex, 2005) can improve their own ab initio prediction results by integrating information from sequence alignment software, from splice site and translation start site prediction software or from other gene finder algorithms, thereby improving prediction quality. EuGène uses probabilistic Markov models to discriminate coding sequences from non-coding ones, or genuine splice sites from false ones. Gene models generated by EuGène are associated with a score based on all the available information. In order to calculate the weight for each information source and to calibrate its ab initio prediction module, EuGène was trained using a maize-curated gene sequence set built in this study.
To build cognate gDNA/cDNA pairs, 6700 (4000 BAC) maize genomic sequences and 5500 full-length cDNA (FLcDNA) were extracted from the NCBI databases and filtered using an automatic pipeline designed on-site. First, cDNA sequences were trimmed using Seqclean (http://www.tigr.org/) and the Univec database (http://www.ncbi.nlm.nih.gov/VecScreen/UniVec.html). Then cDNA redundancy was removed based on an ‘all against all’ cDNA BLAST analysis (E-value threshold 1e-100). Non-redundant cDNAs were then aligned onto the BAC sequences using BLAT (Kent, 2002). To avoid ambiguous alignments such as cDNA mapping to several gDNAs, a spliced alignment was computed using GenomeThreader (Gremme et al., 2005), and the best alignment was retained as the correct gDNA/cDNA pair. Next, for each gDNA/cDNA pair, the coding sequence was determined from an alignment with the corresponding protein from maize or rice. Each of the 247 validated pairs was manually checked before inclusion into the training set.
The third-party prediction softwares used by EuGène maize are SpliceMachine (Degroeve et al., 2005; donor and acceptor splicing site prediction), GenomeThreader (spliced alignment), BlastX (Altschul et al., 1997; protein alignment) and optionally Fgenesh (ab initio gene prediction). The sequence database used for prediction contains 69 306 maize FLcDNAs from GenBank, 20 508 rice mRNAs from RAPdb, 342 491 maize PlantGDB-assembled Unique Transcripts (PUT) from PlantGDB, 593 maize proteins from Uniprot/SwissProt, 19 836 rice proteins from RAPdb and 26 751 Arabidopsis proteins from TAIR (see the EuGène-maize web site for an updated listing of sequence resources and corresponding full references).
The training gene set was compared with 330 curated genes (Haberer et al., 2005) and was found to be representative of maize genes (Table 1). To assess EuGène-maize, we performed gene prediction on eight BACs (AC211245, AC190915, AC204601, AC186187, AC211225, AC193983, AC200414 and AC194325) for which manually curated annotations of 42 genes are available (Liu et al., 2007). We compared these results (Table 2) with predictions from GeneBuilder (Liang et al., 2009) B73 RefGen_v1 (Schnable et al., 2009). Nearly all loci are detected by both predictors; however, GeneBuilder missed several mono-exonic genes. Exon-level assessment shows that the GeneBuilder is more sensitive yet less specific than EuGène. A gene containing 18 exons was incorrectly split by both tools (GRMZM2G119544_E01 and GRMZM2G119496_E01). Another gene (GRMZMM2G086779) containing four exons was split by EuGène only. In two other instances (GRMZM2G520535 and GRMZM2G177098) GeneBuilder incorrectly merged two adjacent genes, whereas EuGène failed only once.
EuGène-maize is available online (see Availability section). Genomic sequences can be masked prior to the prediction step. Masking is computed by RepeatMasker (A.F.A. Smit et al., unpublished data) using the mips Repeat Element Database (Redat 4.3) (Spannagl et al., 2007). RepeatMasker low complexity masking option is disabled. The user may also submit, if available, the output file from the Fgenesh software (version 2.4). The gene prediction computation takes <5 min for a 200 kb genomic sequence and >1 h if the RepeatMasker option is enabled.
The results are compressed into an archive file and e-mailed to the user. The archive contains the parameters and options used for prediction, the submitted sequence, the masked sequence (if relevant), the annotation file (gff, gff3 and fasta format) and a HTML file that allows results to be displayed by a web browser.
The authors greatly appreciate the advice of Thomas Schiex and Jérôme Gouzy, technical support from Christophe Caron and Veronique Martin from MIGALE Plateforme and careful reading by Delphine Vincent. The curated annotation of the 42 maize gene set was kindly provided by Dr Clémentine Vitte.
Funding: French ‘Agence nationale de la recherche’ (ANR); Génoplante BIEP program.
Conflict of Interest: none declared.