|Home | About | Journals | Submit | Contact Us | Français|
The whole genome sequence of Tribolium castaneum, a worldwide coleopteran pest of stored products, has recently been determined. In order to facilitate accurate annotation and detailed functional analysis of this genome, we have compiled and analyzed all available expressed sequence tag (EST) data. The raw data consist of 61,228 ESTs, including 10,704 obtained from NCBI and an additional 50,524 derived from 32,544 clones generated in our laboratories. These sequences were amassed from cDNA libraries representing six different tissues or stages, namely: whole embryos; whole larvae; larval hindguts and Malpighian tubules; larval fat bodies and carcasses; adult ovaries; and adult heads. Assembly of the 61,228 sequences collapsed into 12,269 clusters (groups of overlapping ESTs representing single genes), of which 10,134 mapped onto 6,463 (39%) of the 16,422 GLEAN gene models (i.e. official Tribolium gene list). Approximately 1,600 clusters (13% of the total) lack corresponding GLEAN models, despite high matches to the genome, suggesting that a considerable number of transcribed sequences were missed by the gene prediction programs or were removed by GLEAN. We conservatively estimate that the current EST set represents more than 7,500 transcription units.
The red flour beetle, Tribolium castaneum, is an important coleopteran pest of stored grain and cereal products. Coleoptera is the most diverse order, by some estimates contributing more than one-third of all eukaryotic species. This remarkably adaptable evolutionary group includes a profusion of devastating agricultural pests, such as the corn rootworm, Colorado potato beetle, elm bark beetle, southern pine beetle, and many others. Tribolium is the most sophisticated and flexible genetic model for the beetles. This insect has a number of physiological adaptations not found in other insect species with fully-sequenced genomes (e.g., Drosophila melanogaster, Anopheles gambiae, and Apis mellifera). For example, Tribolium has a short-germ mode of embryonic development more characteristic of the primitive condition. In addition, Tribolium belongs to a unique group of desiccation-tolerant insects with specialized cryptonephridial organs for active rectal absorption of atmospheric water (Koefoed 1975). Finally, as the only omnivorous arthropod with a fully-sequenced genome, Tribolium has revealed novel innovations involving digestive physiology as well as pest adaptations to plant defense chemistry (Tribolium genome consortium, 2007). Tools for functional genomic evaluation in this model organism are available, including piggyBac transposable element-mediated transgenesis (Berghammer et al. 1999; Lorenzen et al. 2002; 2003; 2007), and RNA interference (RNAi), a technique shown to be particularly effective in Tribolium (Bucher et al. 2002; Tomoyasu and Denell 2004).
The genome sequence of Tribolium was completed in 2005 along with the first version of the genome assembly (Tcas1.0; http://www.hgsc.bcm.tmc.edu/projects/tribolium/). Almost 90% of the ~152-Mb assembly has been aligned with the 10 linkage groups using a genetic recombination map (Lorenzen et al. 2005). After further refinement of the assembly (Tcas2.0), automated annotation was performed utilizing two annotation pipelines and four ab initio gene prediction programs (Tribolium genome consortium, 2007). A final gene set was generated using the GLEAN algorithm (Elsik et al. 2007) to combine the results from diverse gene prediction programs into one consensus set. Based on the number of GLEAN gene models, the genome assembly encodes ~16,500 genes (Tribolium genome consortium, 2007). In order to improve the accuracy of the genome annotation, we undertook a large-scale expressed sequence tag (EST) project.
An efficient approach to genome-scale identification of transcribed sequences is the automated generation of large numbers of EST reads by sequencing one or both ends of randomly selected clones from one or more cDNA libraries. In order to increase the efficiency of new-transcript discovery, the library may be enriched in full-length transcripts and normalized to increase diversity. When aligned with the assembled genome, these sequences improve the accuracy of de novo or preliminary evidence-based gene prediction and annotation by providing more complete data on intron/exon structure and 5’- and 3’-untranslated regions. Specific cDNA clones from genes of interest can be used in various downstream applications, such as functional expression, transgenesis, or RNAi. In addition, EST data obtained from tissue- or stage-specific libraries can provide hints about gene expression patterns. We report here the results of analyses of over 60,000 EST sequences, including >50,000 sequences from five different tissue- or stage-specific cDNA libraries, in addition to >10,000 sequences previously available at NCBI.
Five cDNA libraries were derived from the highly inbred strain Georgia-2 (GA-2; Lorenzen et al. 2002). The tissue- or stage- enriched libraries were: TH (adult hindguts and Malpighian tubules); TL (mixed-stage, whole larvae); TF (larval fatbody and epidermal layer from immune-challenged insects); TO (adult ovaries); and TB (adult heads). Insects used for the TL and TF libraries consisted mostly of feeding-stage, late-instar larvae, but in the case of the TL library, small numbers of pre-pupae and young pupae were also included. Approximately 1000 insects were harvested for each library. Dissections were performed in phosphate buffered saline and tissues were preserved in RNAlater® (Ambion, Austin, TX) at −80°C until the mRNA is extracted for TL and TH libraries. Other tissues were flash-frozen in liquid nitrogen and sent to the commercial source (Invitrogen, Carlsbad, CA) for the library construction. Tissues were obtained from approximately equal numbers of males and females. For the tissues from adult stage, insects were harvested less than one-month after eclosion.
Tissues for the TF library were harvested from larvae that had been immune-challenged as follows: Escherichia coli, Micrococcus luteus, and blastospores of Beauveria bassiana were cultured to log phase, killed with 2% formaldehyde, washed three times in deionized water, and pelleted. Equal volumes of the pellets were combined, and larvae were pricked in the thorax with a fine needle (minuten pin, BioQuip, Inc., Rancho Dominguez, CA) dipped in the combined pellet. After incubating the larvae for 18 h at 30°C, heads, terminal abdominal segments, and alimentary canals were removed, and the remaining carcasses were flash-frozen in liquid nitrogen.
The TL and TH libraries were prepared from total RNA isolated using TRIZOL® Reagent (Invitrogen) followed by messenger RNA (mRNA) isolation using Dynabeads® mRNA Purification Kit (Invitrogen). TH and TL libraries were constructed with the SuperScript™ Plasmid System (Invitrogen) using the pSPORT.CMV6 plasmid vector according to the manufacturer’s protocol. TF, TO, and TB libraries were constructed as described above, but by a commercial source (Invitrogen) from flash-frozen tissue shipped on dry ice. The latter three libraries were normalized using DNA collected from the TL library as bait. In addition, the TF library was constructed from full-length enriched cDNAs using CAP site selection followed by recombination-based Gateway® cloning (Invitrogen).
A total of 32,544 clones were sequenced, including ~24,400 from both the 5’ and 3’ directions and the remainder from only the 5’ direction. Most of the EST sequences obtained from the TH and TL libraries were provided by The Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX. Other libraries were sequenced by SeqWright (Houston, TX) or by the Institute for Integrative Genome Biology (University of California, Riverside). EST sequences were deposited in dbEST at GenBank (http://www.ncbi.nlm.nih.gov/dbEST/index.html) with the accession numbers TO1, ES552901 to ES554556; TB1, ES554600 to ES546047; TL2, ES550987 to ES552900; TH1, ES548430 to ES550986; TF1, ES546048 to ES548429.
Most of the 10,704 EST sequences downloaded from NCBI were derived from either of two libraries. The first set is 2,519 sequences (TE), including 2,466 entries by Savard and Tautz (e.g., DR753993) from an embryonic stage library constructed by Reinhard Schröder, and 53 entries from RACE product (e.g., DR953993) submitted by the same group. The second set is the EX library made by Exelixis Inc. from mixed larva and adults. These sequences were submitted by Schmitt (Open Biosystems, Inc, e.g., EC011169), by Lorenzen et al. (2005), or by Brown (e.g., DN644292). We found a number of redundancies in this set, and duplicates were removed to prepare the nonredundant data set.
Raw sequence data were trimmed to remove poor quality and vector sequence via default parameters using Sequencher™ (Gene Codes Corporation, Ann Arbor, MI). Paired, overlapping 5’ and 3’ sequences of each clone were assembled into contigs. In the absence of overlap, paired reads were force-joined with insertion of an arbitrary 20N linker. UniEST clusters were formed by assembling the sequences greater than 90% identical in a 30 bp window in Sequencher™.
BLAST searches (Altschul et al. 1997) were performed against the GLEAN consensus gene set (05-19-2006 version) (Elsik et al. 2007), the genome assembly (Tcas_2), and the UniProt set obtained from EBI (http://www.ebi.ac.uk/uniprot/database/download.html, 07-21-2006) using BlastStation v. 2.4 (TM software, CA), which has adapted NCBI BLAST 2.2.14. Gene Ontology (GO) annotation was derived using Blast2GO (http://www.blast2go.de/) (Conesa et al. 2005).
Most of the EST data used in this study were obtained from primary clones of the TH1 and TL1 libraries (Table 1). In comparisons among the sequences from the libraries TH1, TL1, and EX, the EX library provided the lowest redundancy, and the highest gene discovery rate (54%), defined as the percentage of unique clones among the total set of clones sequenced (Table 1 and Fig. 1). The TE library also had a relatively high gene discovery rate with low redundancy. However, there were many sequences in this library that did not match the Tribolium genome assembly, suggesting the presence of non-Tribolium sequences or low quality sequences (Table 1). More extensive sequencing of the TF1 and TB1 libraries is currently underway because the former is enriched in full-length transcripts, and the latter provides the high gene discovery rate.
The current version of the Tribolium EST database contains a total of 61,228 EST sequences derived from 32,544 cDNA clones, in addition to 10,704 sequences obtained from NCBI (Table 1). These sequences collapse into 12,351 clusters (uniESTs) after assembly of 5’ and 3’ reads and elimination of redundancies. TBLASTN against the UniProt database (UniProt, http://www.pir.uniprot.org/) identified matches for 6,546 uniEsts (53% of the total) having high-scoring segment pairs (HSPs) with highly significant E-values (E<1e-10). Of these, the majority of HSPs were matches to other insect proteins. A portion of HSPs, with moderate E-values, were matches to mammalian sequences, possibly indicating either a bias towards mammalian sequences in the database, or the presence of ancestral genes that are retained in Tribolium but not in other insects. A portion of the HSPs included genes from plants, yeast, bacteria, and viruses, but these generally had higher E-values and are of questionable significance (Fig. 2).
Slightly less than half of the 61,228 EST sequences analyzed in this study were used to support the various gene prediction programs that were merged to form the GLEAN consensus set, while more than half of the EST sequences were entered into GenBank after the GLEAN predictions were made. A comparison of the uniEST data to the GLEAN set revealed that 9,919 uniESTs (87% of the total) map onto 6,463 GLEAN genes (39% of 16,422 GLEAN genes, Fig. 3A, categories c and d), indicating that multiple uniESTs redundantly predict the same gene. The inverse, however, has not been included in those numbers: when an EST clone spanned multiple GLEAN predictions, only one GLEAN gene having the highest match was counted. EST analysis has revealed several examples of GLEAN predictions that incorrectly merged separate genes into a single computed gene. Therefore, the 39% coverage of the GLEAN genes by the uniESTs calculated by this method is probably an underestimate.
We found that ~1,600 uniESTs lacked corresponding GLEAN predictions (Fig. 3A, category a). These included 470 uniESTs with significant matches in UniProt (Fig. 3B, category a1). It is possible that some of these are novel transcripts in the Tribolium genome, while others could reflect contamination from foreign DNA. An additional 1,129 uniESTs were missed by GLEAN and lack significant matches to UniProt (Fig. 3B, category a2). These may be rapidly evolving genes, or they may represent untranslated regions of the transcription units. Rapidly evolving genes may represent those specific to Tribolium or to the Coleoptera. A group of 658 uniESTs failed to give high matches either to the genome, to GLEAN, or to UniProt (Fig. 3B, category b2). Most of these probably represent low quality sequence reads. The TE library contributed the majority of these sequences (424 out of 658), which often consisted of simple repeats and/or short read-lengths. Combining this information and accounting for the redundancy of uniESTs in the GLEAN consensus set, we conservatively estimate that the current uniEST set covers more than 7,500 genes (47% of the estimated total of ~16,000 genes).
Fig. 4 shows the uniEST set classified using the Gene Ontology (GO) terms for cellular component, molecular function, and biological process. A broad range of components, functions, and processes is represented in the EST data, indicating the wide diversity of genes that have been captured in this EST project. Of particular note is the large portion of sequences encoding transporter activity (8% of the classification by molecular function). This is possibly due to sequences derived from the TH library presenting the transcripts from hindgut and Malpighian tubules the tissues involved in epithelial transport of solutes.
The genome assembly has been integrated with the linkage maps, resulting in 10 linkage group sequences representing the X chromosome and 9 autosomes, and an 11th artificial “unknown” linkage group (Tribolium genome consortium, 2007). The latter was created by connecting all unmapped sequence scaffolds in arbitrary linear order, and does not represent a real chromosome. Mapping the uniESTs onto these 11 linkage groups indicates that, with one exception, (Fig. 5) the number of uniESTs on each linkage group was roughly proportional to chromosome sequence length. The one notable exception was linkage group 3, the longest linkage group, which was relatively sparsely endowed with uniESTs. Linkage groups 4, 5, 7 and 8 were slightly overrepresented by uniESTs.
These EST data provide useful information for studies in Tribolium. Almost 90% of uniESTs map onto predicted genes, attesting to the overall accuracy and usefulness of the GLEAN gene set. Current EST data will be further expanded and utilized to determine intron/exon structure with even greater accuracy, and to identify splicing variants as well as 5’- and 3’-untranslated sequences, all of which are difficult to predict from automated annotations of genome sequence. Furthermore, a large portion (~1,600) of uniESTs lacks corresponding GLEAN models, indicating a continued need for additional EST projects. Additional survey of the Tribolium genome by EST analyses will further improve the automated annotation. Further sequencing of these libraries is being conducted and will be reported in a future publication.
This study was funded by the KSU-Plant Biotechnology Center, by NIH Grant Number P20 RR016475 from the INBRE Program of the National Center for Research Resources, and by USDA-NRI-CRSEES 2007-35604-17759. We thank Shiela Prabhakar and Mukta Pahwa for insect dissections. This is Contribution Number 07-218-J from Kansas Agricultural Experimental Station. Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the U.S. Department of Agriculture.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.