Most of the EST data used in this study were obtained from primary clones of the TH1 and TL1 libraries (). In comparisons among the sequences from the libraries TH1, TL1, and EX, the EX library provided the lowest redundancy, and the highest gene discovery rate (54%), defined as the percentage of unique clones among the total set of clones sequenced ( and ). The TE library also had a relatively high gene discovery rate with low redundancy. However, there were many sequences in this library that did not match the Tribolium genome assembly, suggesting the presence of non-Tribolium sequences or low quality sequences (). More extensive sequencing of the TF1 and TB1 libraries is currently underway because the former is enriched in full-length transcripts, and the latter provides the high gene discovery rate.
Summary of cDNA libraries and the results of expressed sequence tag (EST) analysis.
Fig. 1 Diversity of the clones within each Tribolium cDNA library. The function of exponential association was used for the regression with the equation: y = y0 + A1*(1 − exp(−x/t1)) + A2*(1 − exp(−x/t2)), y0, A1, t1, A2, and (more ...)
The current version of the Tribolium
EST database contains a total of 61,228 EST sequences derived from 32,544 cDNA clones, in addition to 10,704 sequences obtained from NCBI (). These sequences collapse into 12,351 clusters (uniESTs) after assembly of 5’ and 3’ reads and elimination of redundancies. TBLASTN against the UniProt database (UniProt, http://www.pir.uniprot.org/
) identified matches for 6,546 uniEsts (53% of the total) having high-scoring segment pairs (HSPs) with highly significant E-values (E<1e-10). Of these, the majority of HSPs were matches to other insect proteins. A portion of HSPs, with moderate E-values, were matches to mammalian sequences, possibly indicating either a bias towards mammalian sequences in the database, or the presence of ancestral genes that are retained in Tribolium
but not in other insects. A portion of the HSPs included genes from plants, yeast, bacteria, and viruses, but these generally had higher E-values and are of questionable significance ().
Histogram showing cumulative frequency distribution of BLAST results (E-value) for Tribolium uniESTs. Results are categorized by taxon producing the highest-scoring pair HSP.
Slightly less than half of the 61,228 EST sequences analyzed in this study were used to support the various gene prediction programs that were merged to form the GLEAN consensus set, while more than half of the EST sequences were entered into GenBank after the GLEAN predictions were made. A comparison of the uniEST data to the GLEAN set revealed that 9,919 uniESTs (87% of the total) map onto 6,463 GLEAN genes (39% of 16,422 GLEAN genes, ), indicating that multiple uniESTs redundantly predict the same gene. The inverse, however, has not been included in those numbers: when an EST clone spanned multiple GLEAN predictions, only one GLEAN gene having the highest match was counted. EST analysis has revealed several examples of GLEAN predictions that incorrectly merged separate genes into a single computed gene. Therefore, the 39% coverage of the GLEAN genes by the uniESTs calculated by this method is probably an underestimate.
Fig. 3 Plot showing similarities of uniESTs to the genome and to gene/protein predictions. (A) shows the match of each individual uniEST to GLEAN models, Genome sequence, and the UniProt database, in all three paired combinations. The plot for similarities to (more ...)
We found that ~1,600 uniESTs lacked corresponding GLEAN predictions (). These included 470 uniESTs with significant matches in UniProt (). It is possible that some of these are novel transcripts in the Tribolium genome, while others could reflect contamination from foreign DNA. An additional 1,129 uniESTs were missed by GLEAN and lack significant matches to UniProt (). These may be rapidly evolving genes, or they may represent untranslated regions of the transcription units. Rapidly evolving genes may represent those specific to Tribolium or to the Coleoptera. A group of 658 uniESTs failed to give high matches either to the genome, to GLEAN, or to UniProt (). Most of these probably represent low quality sequence reads. The TE library contributed the majority of these sequences (424 out of 658), which often consisted of simple repeats and/or short read-lengths. Combining this information and accounting for the redundancy of uniESTs in the GLEAN consensus set, we conservatively estimate that the current uniEST set covers more than 7,500 genes (47% of the estimated total of ~16,000 genes).
shows the uniEST set classified using the Gene Ontology (GO) terms for cellular component, molecular function, and biological process. A broad range of components, functions, and processes is represented in the EST data, indicating the wide diversity of genes that have been captured in this EST project. Of particular note is the large portion of sequences encoding transporter activity (8% of the classification by molecular function). This is possibly due to sequences derived from the TH library presenting the transcripts from hindgut and Malpighian tubules the tissues involved in epithelial transport of solutes.
Fig. 4 Gene Ontology (GO) terms of the uniESTs for cellular component, molecular function, and biological process. The levels were arbitrarily chosen for the best visual presentations. For the same reason, the GO terms containing less than 50 uniESTs (in cellular (more ...)
The genome assembly has been integrated with the linkage maps, resulting in 10 linkage group sequences representing the X chromosome and 9 autosomes, and an 11th
artificial “unknown” linkage group (Tribolium genome consortium, 2007
). The latter was created by connecting all unmapped sequence scaffolds in arbitrary linear order, and does not represent a real chromosome. Mapping the uniESTs onto these 11 linkage groups indicates that, with one exception, () the number of uniESTs on each linkage group was roughly proportional to chromosome sequence length. The one notable exception was linkage group 3, the longest linkage group, which was relatively sparsely endowed with uniESTs. Linkage groups 4, 5, 7 and 8 were slightly overrepresented by uniESTs.
Fig. 5 UniESTs mapped onto linkage groups. The unmapped sequence scaffolds are arbitrarily joined and named as chrUn (linkage group unknown). “Rep.” signifies reptigs, i.e. contigs that are highly repetitive and not included in the chromosomal (more ...)
These EST data provide useful information for studies in Tribolium. Almost 90% of uniESTs map onto predicted genes, attesting to the overall accuracy and usefulness of the GLEAN gene set. Current EST data will be further expanded and utilized to determine intron/exon structure with even greater accuracy, and to identify splicing variants as well as 5’- and 3’-untranslated sequences, all of which are difficult to predict from automated annotations of genome sequence. Furthermore, a large portion (~1,600) of uniESTs lacks corresponding GLEAN models, indicating a continued need for additional EST projects. Additional survey of the Tribolium genome by EST analyses will further improve the automated annotation. Further sequencing of these libraries is being conducted and will be reported in a future publication.