1.  Annotation and analysis of 10,000 expressed sequence tags from developing mouse eye and adult retina 
Genome Biology  2003;4(10):R65.
The generation and analysis of 10,000 expressed sequence tags (ESTs) from three mouse eye tissue cDNA libraries is reported that identifies a large number of potentially interesting genes for biological investigation.
As a biomarker of cellular activities, the transcriptome of a specific tissue or cell type during development and disease is of great biomedical interest. We have generated and analyzed 10,000 expressed sequence tags (ESTs) from three mouse eye tissue cDNA libraries: embryonic day 15.5 (M15E) eye, postnatal day 2 (M2PN) eye and adult retina (MRA).
Annotation of 8,633 non-mitochondrial and non-ribosomal high-quality ESTs revealed that 57% of the sequences represent known genes and 43% are unknown or novel ESTs, with M15E having the highest percentage of novel ESTs. Of these, 2,361 ESTs correspond to 747 unique genes and the remaining 6,272 are represented only once. Phototransduction genes are preferentially identified in MRA, whereas transcripts for cell structure and regulatory proteins are highly expressed in the developing eye. Map locations of human orthologs of known genes uncovered a high density of ocular genes on chromosome 17, and identified 277 genes in the critical regions of 37 retinal disease loci. In silico expression profiling identified 210 genes and/or ESTs over-expressed in the eye; of these, more than 26 are known to have vital retinal function. Comparisons between libraries provided a list of temporally regulated genes and/or ESTs. A few of these were validated by qRT-PCR analysis.
Our studies present a large number of potentially interesting genes for biological investigation, and the annotated EST set provides a useful resource for microarray and functional genomic studies.
PMCID: PMC328454  PMID: 14519200
2.  Peanut gene expression profiling in developing seeds at different reproduction stages during Aspergillus parasiticus infection 
Peanut (Arachis hypogaea L.) is an important crop economically and nutritionally, and is one of the most susceptible host crops to colonization of Aspergillus parasiticus and subsequent aflatoxin contamination. Knowledge from molecular genetic studies could help to devise strategies in alleviating this problem; however, few peanut DNA sequences are available in the public database. In order to understand the molecular basis of host resistance to aflatoxin contamination, a large-scale project was conducted to generate expressed sequence tags (ESTs) from developing seeds to identify resistance-related genes involved in defense response against Aspergillus infection and subsequent aflatoxin contamination.
We constructed six different cDNA libraries derived from developing peanut seeds at three reproduction stages (R5, R6 and R7) from a resistant and a susceptible cultivated peanut genotypes, 'Tifrunner' (susceptible to Aspergillus infection with higher aflatoxin contamination and resistant to TSWV) and 'GT-C20' (resistant to Aspergillus with reduced aflatoxin contamination and susceptible to TSWV). The developing peanut seed tissues were challenged by A. parasiticus and drought stress in the field. A total of 24,192 randomly selected cDNA clones from six libraries were sequenced. After removing vector sequences and quality trimming, 21,777 high-quality EST sequences were generated. Sequence clustering and assembling resulted in 8,689 unique EST sequences with 1,741 tentative consensus EST sequences (TCs) and 6,948 singleton ESTs. Functional classification was performed according to MIPS functional catalogue criteria. The unique EST sequences were divided into twenty-two categories. A similarity search against the non-redundant protein database available from NCBI indicated that 84.78% of total ESTs showed significant similarity to known proteins, of which 165 genes had been previously reported in peanuts. There were differences in overall expression patterns in different libraries and genotypes. A number of sequences were expressed throughout all of the libraries, representing constitutive expressed sequences. In order to identify resistance-related genes with significantly differential expression, a statistical analysis to estimate the relative abundance (R) was used to compare the relative abundance of each gene transcripts in each cDNA library. Thirty six and forty seven unique EST sequences with threshold of R > 4 from libraries of 'GT-C20' and 'Tifrunner', respectively, were selected for examination of temporal gene expression patterns according to EST frequencies. Nine and eight resistance-related genes with significant up-regulation were obtained in 'GT-C20' and 'Tifrunner' libraries, respectively. Among them, three genes were common in both genotypes. Furthermore, a comparison of our EST sequences with other plant sequences in the TIGR Gene Indices libraries showed that the percentage of peanut EST matched to Arabidopsis thaliana, maize (Zea mays), Medicago truncatula, rapeseed (Brassica napus), rice (Oryza sativa), soybean (Glycine max) and wheat (Triticum aestivum) ESTs ranged from 33.84% to 79.46% with the sequence identity ≥ 80%. These results revealed that peanut ESTs are more closely related to legume species than to cereal crops, and more homologous to dicot than to monocot plant species.
The developed ESTs can be used to discover novel sequences or genes, to identify resistance-related genes and to detect the differences among alleles or markers between these resistant and susceptible peanut genotypes. Additionally, this large collection of cultivated peanut EST sequences will make it possible to construct microarrays for gene expression studies and for further characterization of host resistance mechanisms. It will be a valuable genomic resource for the peanut community. The 21,777 ESTs have been deposited to the NCBI GenBank database with accession numbers ES702769 to ES724546.
PMCID: PMC2257936  PMID: 18248674
3.  Comprehensive in silico functional specification of mouse retina transcripts 
BMC Genomics  2005;6:40.
The retina is a well-defined portion of the central nervous system (CNS) that has been used as a model for CNS development and function studies. The full specification of transcripts in an individual tissue or cell type, like retina, can greatly aid the understanding of the control of cell differentiation and cell function. In this study, we have integrated computational bioinformatics and microarray experimental approaches to classify the tissue specificity and developmental distribution of mouse retina transcripts.
We have classified a set of retina-specific genes using sequence-based screening integrated with computational and retina tissue-specific microarray approaches. 33,737 non-redundant sequences were identified as retina transcript clusters (RTCs) from more than 81,000 mouse retina ESTs. We estimate that about 19,000 to 20,000 genes might express in mouse retina from embryonic to adult stages. 39.1% of the RTCs are not covered by 60,770 RIKEN full-length cDNAs. Through comparison with 2 million mouse ESTs, spectra of neural, retinal, late-generated retinal, and photoreceptor -enriched RTCs have been generated. More than 70% of these RTCs have data from biological experiments confirming their tissue-specific expression pattern. The highest-grade retina-enriched pool covered almost all the known genes encoding proteins involved in photo-transduction.
This study provides a comprehensive mouse retina transcript profile for further gene discovery in retina and suggests that tissue-specific transcripts contribute substantially to the whole transcriptome.
PMCID: PMC1083414  PMID: 15777472
4.  Identifying and mapping novel retinal-expressed ESTs from humans 
Molecular vision  1999;5:5.
The goal of this study was to develop efficient methods to identify tissue-specific expressed sequence tags (ESTs) and to map their locations in the human genome. Through a combination of database analysis and laboratory investigation, unique retina-specific ESTs were identified and mapped as candidate genes for inherited retinal diseases.
DNA sequences from retina-specific EST clusters were obtained from the TIGR Human Gene Index Database. Further processing of the EST sequence data was necessary to ensure that each EST cluster represented a novel, non-redundant mapping candidate. Processing involved screening for homologies to known genes and proteins using BLAST, excluding known human gene sequences and repeat sequences, and developing primers for PCR amplification of the gene encoding each cDNA cluster from genomic DNA. The EST clusters were mapped using the GeneBridge 4.0 Radiation Hybrid Mapping Panel with standard PCR conditions.
A total of 83 retinal-expressed EST clusters were examined as potential novel, non-redundant mapping candidates. Fifty-five clusters were mapped successfully and their locations compared to the locations of known retinal disease genes. Fourteen EST clusters localize to candidate regions for inherited retinal diseases.
This pilot study developed methodology for mapping uniquely expressed retinal ESTs and for identifying potential candidate genes for inherited retinal disorders. Despite the overall success, several complicating factors contributed to the high failure rate (33%) for mapping EST-clustered sequences. These include redundancy in the sequence data, widely dispersed sequences, ambiguous nucleotides within the sequences, the possibility of amplifying through introns and the presence of repetitive elements within the sequence. However, the combination of database analysis and laboratory mapping is a powerful method for identification of candidate genes for inherited diseases.
PMCID: PMC2583080  PMID: 10228186
5.  Gene Discovery in the Auditory System: Characterization of Additional Cochlear-Expressed Sequences  
To identify genes involved in hearing, 8494 expressed sequence tags (ESTs) were generated from a human fetal cochlear cDNA library in two distinct sequencing projects. Analysis of the first set of 4304 ESTs revealed clones representing 517 known human genes, 41 mammalian genes not previously detected in human tissues, 487 ESTs from other human tissues, and 541 cochlear-specific ESTs ( ). We now report results of a DNA sequence similarity (BLAST) analysis of an additional 4190 cochlear ESTs and a comparison to the first set. Among the 4190 new cochlear ESTs, 959 known human genes were identified; 594 were found only among the new ESTs and 365 were found among ESTs from both sequencing projects. COL1A2 was the most abundant transcript among both sets of ESTs, followed in order by COL3A1, SPARC, EEF1A1, and TPTI. An additional 22 human homologs of known nonhuman mammalian genes and 1595 clusters of ESTs, of which 333 are cochlear-specific, were identified among the new cochlear ESTs. Map positions were determined for 373 of the new cochlear ESTs and revealed 318 additional loci. Forty-nine of the mapped ESTs are located within the genetic interval of 23 deafness loci. Reanalysis of unassigned ESTs from the prior study revealed 338 additional known human genes. The total number of known human genes identified from 8494 cochlear ESTs is 1449 and is represented by 4040 ESTs. Among the known human genes are 14 deafness-associated genes, including GJB2 (connexin 26) and KVLQT1. The total number of nonhuman mammalian genes identified is 43 and is represented by 58 ESTs. The total number of ESTs without sequence similarity to known genes is 4055. Of these, 778 also do not have sequence similarity to any other ESTs, are categorized into 700 clusters, and may represent genes uniquely or preferentially expressed in the cochlea. Identification of additional known genes, ESTs, and cochlear-specific ESTs provides new candidate genes for both syndromic and nonsyndromic deafness disorders.
PMCID: PMC3202364  PMID: 12083723
ESTs; genes; cochlea; cochlear-expressed genes
6.  Generation and analysis of expressed sequence tags from the ciliate protozoan parasite Ichthyophthirius multifiliis 
BMC Genomics  2007;8:176.
The ciliate protozoan Ichthyophthirius multifiliis (Ich) is an important parasite of freshwater fish that causes 'white spot disease' leading to significant losses. A genomic resource for large-scale studies of this parasite has been lacking. To study gene expression involved in Ich pathogenesis and virulence, our goal was to generate expressed sequence tags (ESTs) for the development of a powerful microarray platform for the analysis of global gene expression in this species. Here, we initiated a project to sequence and analyze over 10,000 ESTs.
We sequenced 10,368 EST clones using a normalized cDNA library made from pooled samples of the trophont, tomont, and theront life-cycle stages, and generated 9,769 sequences (94.2% success rate). Post-sequencing processing led to 8,432 high quality sequences. Clustering analysis of these ESTs allowed identification of 4,706 unique sequences containing 976 contigs and 3,730 singletons. These unique sequences represent over two million base pairs (~10% of Plasmodium falciparum genome, a phylogenetically related protozoan). BLASTX searches produced 2,518 significant (E-value < 10-5) hits and further Gene Ontology (GO) analysis annotated 1,008 of these genes. The ESTs were analyzed comparatively against the genomes of the related protozoa Tetrahymena thermophila and P. falciparum, allowing putative identification of additional genes. All the EST sequences were deposited by dbEST in GenBank (GenBank: EG957858–EG966289). Gene discovery and annotations are presented and discussed.
This set of ESTs represents a significant proportion of the Ich transcriptome, and provides a material basis for the development of microarrays useful for gene expression studies concerning Ich development, pathogenesis, and virulence.
PMCID: PMC1906770  PMID: 17577414
7.  Generation and analysis of ESTs from the eastern oyster, Crassostrea virginica Gmelin and identification of microsatellite and SNP markers 
BMC Genomics  2007;8:157.
The eastern oyster, Crassostrea virginica (Gmelin 1791), is an economically important species cultured in many areas in North America. It is also ecologically important because of the impact of its filter feeding behaviour on water quality. Populations of C. virginica have been threatened by overfishing, habitat degradation, and diseases. Through genome research, strategies are being developed to reverse its population decline. However, large-scale expressed sequence tag (EST) resources have been lacking for this species. Efficient generation of EST resources from this species has been hindered by a high redundancy of transcripts. The objectives of this study were to construct a normalized cDNA library for efficient EST analysis, to generate thousands of ESTs, and to analyze the ESTs for microsatellites and potential single nucleotide polymorphisms (SNPs).
A normalized and subtracted C. virginica cDNA library was constructed from pooled RNA isolated from hemocytes, mantle, gill, gonad and digestive tract, muscle, and a whole juvenile oyster. A total of 6,528 clones were sequenced from this library generating 5,542 high-quality EST sequences. Cluster analysis indicated the presence of 635 contigs and 4,053 singletons, generating a total of 4,688 unique sequences. About 46% (2,174) of the unique ESTs had significant hits (E-value ≤ 1e-05) to the non-redundant protein database; 1,104 of which were annotated using Gene Ontology (GO) terms. A total of 35 microsatellites were identified from the ESTs, with 18 having sufficient flanking sequences for primer design. A total of 6,533 putative SNPs were also identified using all existing and the newly generated EST resources of the eastern oysters.
A high quality normalized cDNA library was constructed. A total of 5,542 ESTs were generated representing 4,688 unique sequences. Putative microsatellite and SNP markers were identified. These genome resources provide the material basis for future microarray development, marker validation, and genetic linkage and QTL analysis.
PMCID: PMC1919373  PMID: 17559679
8.  A biphasic pattern of gene expression during mouse retina development 
Between embryonic day 12 and postnatal day 21, six major neuronal and one glia cell type are generated from multipotential progenitors in a characteristic sequence during mouse retina development. We investigated expression patterns of retina transcripts during the major embryonic and postnatal developmental stages to provide a systematic view of normal mouse retina development,
A tissue-specific cDNA microarray was generated using a set of sequence non-redundant EST clones collected from mouse retina. Eleven stages of mouse retina, from embryonic day 12.5 (El2.5) to postnatal day 21 (PN21), were collected for RNA isolation. Non-amplified RNAs were labeled for microarray experiments and three sets of data were analyzed for significance, hierarchical relationships, and functional clustering. Six individual gene expression clusters were identified based on expression patterns of transcripts through retina development. Two developmental phases were clearly divided with postnatal day 5 (PN5) as a separate cluster. Among 4,180 transcripts that changed significantly during development, approximately 2/3 of the genes were expressed at high levels up until PN5 and then declined whereas the other 1/3 of the genes increased expression from PN5 and remained at the higher levels until at least PN21. Less than 1% of the genes observed showed a peak of expression between the two phases. Among the later increased population, only about 40% genes are correlated with rod photoreceptors, indicating that multiple cell types contributed to gene expression in this phase. Within the same functional classes, however, different gene populations were expressed in distinct developmental phases. A correlation coefficient analysis of gene expression during retina development between previous SAGE studies and this study was also carried out.
This study provides a complementary genome-wide view of common gene dynamics and a broad molecular classification of mouse retina development. Different genes in the same functional clusters are expressed in the different developmental stages, suggesting that cells might change gene expression profiles from differentiation to maturation stages. We propose that large-scale changes in gene regulation during development are necessary for the final maturation and function of the retina.
PMCID: PMC1633734  PMID: 17044933
9.  Transcriptomic analysis of the entomopathogenic nematode Heterorhabditis bacteriophora TTO1 
BMC Genomics  2009;10:205.
The entomopathogenic nematode Heterorhabditis bacteriophora and its symbiotic bacterium, Photorhabdus luminescens, are important biological control agents of insect pests. This nematode-bacterium-insect association represents an emerging tripartite model for research on mutualistic and parasitic symbioses. Elucidation of mechanisms underlying these biological processes may serve as a foundation for improving the biological control potential of the nematode-bacterium complex. This large-scale expressed sequence tag (EST) analysis effort enables gene discovery and development of microsatellite markers. These ESTs will also aid in the annotation of the upcoming complete genome sequence of H. bacteriophora.
A total of 31,485 high quality ESTs were generated from cDNA libraries of the adult H. bacteriophora TTO1 strain. Cluster analysis revealed the presence of 3,051 contigs and 7,835 singletons, representing 10,886 distinct EST sequences. About 72% of the distinct EST sequences had significant matches (E value < 1e-5) to proteins in GenBank's non-redundant (nr) and Wormpep190 databases. We have identified 12 ESTs corresponding to 8 genes potentially involved in RNA interference, 22 ESTs corresponding to 14 genes potentially involved in dauer-related processes, and 51 ESTs corresponding to 27 genes potentially involved in defense and stress responses. Comparison to ESTs and proteins of free-living nematodes led to the identification of 554 parasitic nematode-specific ESTs in H. bacteriophora, among which are those encoding F-box-like/WD-repeat protein theromacin, Bax inhibitor-1-like protein, and PAZ domain containing protein. Gene Ontology terms were assigned to 6,685 of the 10,886 ESTs. A total of 168 microsatellite loci were identified with primers designable for 141 loci.
A total of 10,886 distinct EST sequences were identified from adult H. bacteriophora cDNA libraries. BLAST searches revealed ESTs potentially involved in parasitism, RNA interference, defense responses, stress responses, and dauer-related processes. The putative microsatellite markers identified in H. bacteriophora ESTs will enable genetic mapping and population genetic studies. These genomic resources provide the material base necessary for genome annotation, microarray development, and in-depth gene functional analysis.
PMCID: PMC2686736  PMID: 19405965
10.  Towards the ictalurid catfish transcriptome: generation and analysis of 31,215 catfish ESTs 
BMC Genomics  2007;8:177.
EST sequencing is one of the most efficient means for gene discovery and molecular marker development, and can be additionally utilized in both comparative genome analysis and evaluation of gene duplications. While much progress has been made in catfish genomics, large-scale EST resources have been lacking. The objectives of this project were to construct primary cDNA libraries, to conduct initial EST sequencing to generate catfish EST resources, and to obtain baseline information about highly expressed genes in various catfish organs to provide a guide for the production of normalized and subtracted cDNA libraries for large-scale transcriptome analysis in catfish.
A total of 17 cDNA libraries were constructed including 12 from channel catfish (Ictalurus punctatus) and 5 from blue catfish (I. furcatus). A total of 31,215 ESTs, with average length of 778 bp, were generated including 20,451 from the channel catfish and 10,764 from blue catfish. Cluster analysis indicated that 73% of channel catfish and 67% of blue catfish ESTs were unique within the project. Over 53% and 50% of the channel catfish and blue catfish ESTs, respectively, had significant similarities to known genes. All ESTs have been deposited in GenBank. Evaluation of the catfish EST resources demonstrated their potential for molecular marker development, comparative genome analysis, and evaluation of ancient and recent gene duplications. Subtraction of abundantly expressed genes in a variety of catfish tissues, identified here, will allow the production of low-redundancy libraries for in-depth sequencing.
The sequencing of 31,215 ESTs from channel catfish and blue catfish has significantly increased the EST resources in catfish. The EST resources should provide the potential for microarray development, polymorphic marker identification, mapping, and comparative genome analysis.
PMCID: PMC1906771  PMID: 17577415
11.  Identical Mutation in a Novel Retinal Gene Causes Progressive Rod-Cone Degeneration (prcd) in Dogs and Retinitis Pigmentosa in Man 
Genomics  2006;88(5):551-563.
Progressive rod-cone degeneration (prcd) is a late-onset, autosomal recessive photoreceptor degeneration of dogs, and a homolog for some forms of human retinitis pigmentosa (RP). Previously, the disease relevant interval was reduced to a 106 Kb region on CFA9, and a common phenotype-specific haplotype was identified in all affected dogs from several different breeds, and breed varieties. Screening of a canine retinal EST library identified partial cDNAs for novel candidate genes in the disease relevant interval. The complete cDNA of one of these, PRCD, was cloned in dog, human and mouse. The gene codes for a 54 amino acid (aa) protein in dog and human, and 53 aa protein in the mouse; the first 24 aa, coded for by exon 1, are highly conserved in 14 vertebrate species. A homozygous mutation (TGC → TAC) in the second codon shows complete concordance with the disorder in 18 different dog breeds/breed varieties tested. The same homozygous mutation was identified in a human patient from Bangladesh with autosomal recessive (ar) RP. Expression studies support the predominant expression of this gene in the retina, with equal expression in the retinal pigment epithelium (RPE), photoreceptors and ganglion cell layers. This study provides strong evidence that a mutation in the novel gene, PRCD, is the cause of autosomal recessive retinal degeneration in both dogs and man.
PMCID: PMC3989879  PMID: 16938425
Dogs; Disease Models, Animal; Genetic diversity; Genetic linkage; Genetic markers; Genetic predisposition to disease; Genetic variation; Mutation; Retinal Degeneration; Retinitis Pigmentosa
12.  Gene2EST: a BLAST2 server for searching expressed sequence tag (EST) databases with eukaryotic gene-sized queries 
Nucleic Acids Research  2001;29(6):1272-1277.
Expressed sequence tags (ESTs) are randomly sequenced cDNA clones. Currently, nearly 3 million human and 2 million mouse ESTs provide valuable resources that enable researchers to investigate the products of gene expression. The EST databases have proven to be useful tools for detecting homologous genes, for exon mapping, revealing differential splicing, etc. With the increasing availability of large amounts of poorly characterised eukaryotic (notably human) genomic sequence, ESTs have now become a vital tool for gene identification, sometimes yielding the only unambiguous evidence for the existence of a gene expression product. However, BLAST-based Web servers available to the general user have not kept pace with these developments and do not provide appropriate tools for querying EST databases with large highly spliced genes, often spanning 50 000–100 000 bases or more. Here we describe Gene2EST (, a server that brings together a set of tools enabling efficient retrieval of ESTs matching large DNA queries and their subsequent analysis. RepeatMasker is used to mask dispersed repetitive sequences (such as Alu elements) in the query, BLAST2 for searching EST databases and Artemis for graphical display of the findings. Gene2EST combines these components into a Web resource targeted at the researcher who wishes to study one or a few genes to a high level of detail.
PMCID: PMC29756  PMID: 11238992
13.  Pattern analysis approach reveals restriction enzyme cutting abnormalities and other cDNA library construction artifacts using raw EST data 
BMC Biotechnology  2012;12:16.
Expressed Sequence Tag (EST) sequences are widely used in applications such as genome annotation, gene discovery and gene expression studies. However, some of GenBank dbEST sequences have proven to be “unclean”. Identification of cDNA termini/ends and their structures in raw ESTs not only facilitates data quality control and accurate delineation of transcription ends, but also furthers our understanding of the potential sources of data abnormalities/errors present in the wet-lab procedures for cDNA library construction.
After analyzing a total of 309,976 raw Pinus taeda ESTs, we uncovered many distinct variations of cDNA termini, some of which prove to be good indicators of wet-lab artifacts, and characterized each raw EST by its cDNA terminus structure patterns. In contrast to the expected patterns, many ESTs displayed complex and/or abnormal patterns that represent potential wet-lab errors such as: a failure of one or both of the restriction enzymes to cut the plasmid vector; a failure of the restriction enzymes to cut the vector at the correct positions; the insertion of two cDNA inserts into a single vector; the insertion of multiple and/or concatenated adapters/linkers; the presence of 3′-end terminal structures in designated 5′-end sequences or vice versa; and so on. With a close examination of these artifacts, many problematic ESTs that have been deposited into public databases by conventional bioinformatics pipelines or tools could be cleaned or filtered by our methodology. We developed a software tool for Abnormality Filtering and Sequence Trimming for ESTs (AFST, using a pattern analysis approach. To compare AFST with other pipelines that submitted ESTs into dbEST, we reprocessed 230,783 Pinus taeda and 38,709 Arachis hypogaea GenBank ESTs. We found 7.4% of Pinus taeda and 29.2% of Arachis hypogaea GenBank ESTs are “unclean” or abnormal, all of which could be cleaned or filtered by AFST.
cDNA terminal pattern analysis, as implemented in the AFST software tool, can be utilized to reveal wet-lab errors such as restriction enzyme cutting abnormities and chimeric EST sequences, detect various data abnormalities embedded in existing Sanger EST datasets, improve the accuracy of identifying and extracting bona fide cDNA inserts from raw ESTs, and therefore greatly benefit downstream EST-based applications.
PMCID: PMC3424822  PMID: 22554190
cDNA terminus; cDNA library construction; Pattern analysis; Restriction enzyme cutting abnormality; Chimeric EST sequences
14.  A comparison of expressed sequence tags (ESTs) to human genomic sequences. 
Nucleic Acids Research  1997;25(8):1626-1632.
The Expressed Sequence Tag (EST) division of GenBank, dbEST, is a large repository of the data being generated by human genome sequencing centers. ESTs are short, single pass cDNA sequences generated from randomly selected library clones. The approximately 415 000 human ESTs represent a valuable, low priced, and easily accessible biological reagent. As many ESTs are derived from yet uncharacterized genes, dbEST is a prime starting point for the identification of novel mRNAs. Conversely, other genes are represented by hundreds of ESTs, a redundancy which may provide data about rare mRNA isoforms. Here we present an analysis of >1000 ESTs generated by the WashU-Merck EST project. These ESTs were collected by querying dbEST with the genomic sequences of 15 human genes. When we aligned the matching ESTs to the genomic sequences, we found that in one gene, 73% of the ESTs which derive from spliced or partially spliced transcripts either contain intron sequences or are spliced at previously unreported sites; other genes have lower percentages of such ESTs, and some have none. This finding suggests that ESTs could provide researchers with novel information about alternative splicing in certain genes. In a related analysis of pairs of ESTs which are reported to derive from a single gene, we found that as many as 26% of the pairs do not BOTH align with the sequence of the same gene. We suspect that some of these unusual ESTs result from artifacts in EST generation, and caution researchers that they may find such clones while analyzing sequences in dbEST.
PMCID: PMC146621  PMID: 9092672
15.  Generation and analysis of expressed sequence tags (ESTs) for marker development in yam (Dioscorea alata L.) 
BMC Genomics  2011;12:100.
Anthracnose (Colletotrichum gloeosporioides) is a major limiting factor in the production of yam (Dioscorea spp.) worldwide. Availability of high quality sequence information is necessary for designing molecular markers associated with resistance. However, very limited sequence information pertaining to yam is available at public genome databases. Therefore, this collaborative project was developed for genetic improvement and germplasm characterization of yams using molecular markers. The current investigation is focused on studying gene expression, by large scale generation of ESTs, from one susceptible (TDa 95-0310) and two resistant yam genotypes (TDa 87-01091, TDa 95-0328) challenged with the fungus. Total RNA was isolated from young leaves of resistant and susceptible genotypes and cDNA libraries were sequenced using Roche 454 technology.
A total of 44,757 EST sequences were generated from the cDNA libraries of the resistant and susceptible genotypes. Greater than 56% of ESTs were annotated using MapMan Mercator tool and Blast2GO search tools. Gene annotations were used to characterize the transcriptome in yam and also perform a differential gene expression analysis between the resistant and susceptible EST datasets. Mining for SSRs in the ESTs revealed 1702 unique sequences containing SSRs and 1705 SSR markers were designed using those sequences.
We have developed a comprehensive annotated transcriptome data set in yam to enrich the EST information in public databases. cDNA libraries were constructed from anthracnose fungus challenged leaf tissues for transcriptome characterization, and differential gene expression analysis. Thus, it helped in identifying unique transcripts in each library for disease resistance. These EST resources provide the basis for future microarray development, marker validation, genetic linkage mapping and QTL analysis in Dioscorea species.
PMCID: PMC3047301  PMID: 21303556
16.  Exploring nervous system transcriptomes during embryogenesis and metamorphosis in Xenopus tropicalis using EST analysis 
BMC Genomics  2007;8:118.
The western African clawed frog Xenopus tropicalis is an anuran amphibian species now used as model in vertebrate comparative genomics. It provides the same advantages as Xenopus laevis but is diploid and has a smaller genome of 1.7 Gbp. Therefore X. tropicalis is more amenable to systematic transcriptome surveys. We initiated a large-scale partial cDNA sequencing project to provide a functional genomics resource on genes expressed in the nervous system during early embryogenesis and metamorphosis in X. tropicalis.
A gene index was defined and analysed after the collection of over 48,785 high quality sequences. These partial cDNA sequences were obtained from an embryonic head and retina library (30,272 sequences) and from a metamorphic brain and spinal cord library (27,602 sequences). These ESTs are estimated to represent 9,693 transcripts derived from an estimated 6,000 genes. Comparison of these cDNA sequences with protein databases indicates that 46% contain their start codon. Further annotation included Gene Ontology functional classification, InterPro domain analysis, alternative splicing and non-coding RNA identification. Gene expression profiles were derived from EST counts and used to define transcripts specific to metamorphic stages of development. Moreover, these ESTs allowed identification of a set of 225 polymorphic microsatellites that can be used as genetic markers.
These cDNA sequences permit in silico cloning of numerous genes and will facilitate studies aimed at deciphering the roles of cognate genes expressed in the nervous system during neural development and metamorphosis. The genomic resources developed to study X. tropicalis biology will accelerate exploration of amphibian physiology and genetics. In particular, the model will facilitate analysis of key questions related to anuran embryogenesis and metamorphosis and its associated regulatory processes.
PMCID: PMC1890556  PMID: 17506875
17.  Construction and EST sequencing of full-length, drought stress cDNA libraries for common beans (Phaseolus vulgaris L.) 
BMC Plant Biology  2011;11:171.
Common bean is an important legume crop with only a moderate number of short expressed sequence tags (ESTs) made with traditional methods. The goal of this research was to use full-length cDNA technology to develop ESTs that would overlap with the beginning of open reading frames and therefore be useful for gene annotation of genomic sequences. The library was also constructed to represent genes expressed under drought, low soil phosphorus and high soil aluminum toxicity. We also undertook comparisons of the full-length cDNA library to two previous non-full clone EST sets for common bean.
Two full-length cDNA libraries were constructed: one for the drought tolerant Mesoamerican genotype BAT477 and the other one for the acid-soil tolerant Andean genotype G19833 which has been selected for genome sequencing. Plants were grown in three soil types using deep rooting cylinders subjected to drought and non-drought stress and tissues were collected from both roots and above ground parts. A total of 20,000 clones were selected robotically, half from each library. Then, nearly 10,000 clones from the G19833 library were sequenced with an average read length of 850 nucleotides. A total of 4,219 unigenes were identified consisting of 2,981 contigs and 1,238 singletons. These were functionally annotated with gene ontology terms and placed into KEGG pathways. Compared to other EST sequencing efforts in common bean, about half of the sequences were novel or represented the 5' ends of known genes.
The present full-length cDNA libraries add to the technological toolbox available for common bean and our sequencing of these clones substantially increases the number of unique EST sequences available for the common bean genome. All of this should be useful for both functional gene annotation, analysis of splice site variants and intron/exon boundary determination by comparison to soybean genes or with common bean whole-genome sequences. In addition the library has a large number of transcription factors and will be interesting for discovery and validation of drought or abiotic stress related genes in common bean.
PMCID: PMC3240127  PMID: 22118559
18.  Generation, analysis and functional annotation of expressed sequence tags from the sheepshead minnow (Cyprinodon variegatus) 
BMC Genomics  2010;11(Suppl 2):S4.
Sheepshead minnow (Cyprinodon variegatus) are small fish capable of withstanding exposure to very low levels of dissolved oxygen, as well as extreme temperatures and salinities. It is an important model in understanding the impacts and biological response to hypoxia and co-occurring compounding stressors such as polycyclic aromatic hydrocarbons, endocrine disrupting chemicals, metals and herbicides. Here, we initiated a project to sequence and analyze over 10,000 ESTs generated from the Sheepshead minnow (Cyprinodon variegatus) as a resource for investigating stressor responses.
We sequenced 10,858 EST clones using a normalized cDNA library made from larval, embryonic and adult suppression subtractive hybridization-PCR (SSH) libraries. Post- sequencing processing led to 8,099 high quality sequences. Clustering analysis of these ESTs indentified 4,223 unique sequences containing 1,053 contigs and 3,170 singletons. BLASTX searches produced 1,394 significant (E-value < 10-5) hits and further Gene Ontology (GO) analysis annotated 388 of these genes. All the EST sequences were deposited by Expressed Sequence Tags database (dbEST) in GenBank (GenBank: GE329585 to GE337683). Gene discovery and annotations are presented and discussed. This set of ESTs represents a significant proportion of the Sheepshead minnow (Cyprinodon variegatus) transcriptome, and provides a material basis for the development of microarrays useful for further gene expression studies in association with stressors such as hypoxia, cadmium, chromium and pyrene.
PMCID: PMC2975421  PMID: 21047385
19.  Aphid biology: Expressed genes from alate Toxoptera citricida, the brown citrus aphid 
The brown citrus aphid, Toxoptera citricida (Kirkaldy), is considered the primary vector of citrus tristeza virus, a severe pathogen which causes losses to citrus industries worldwide. The alate (winged) form of this aphid can readily fly long distances with the wind, thus spreading citrus tristeza virus in citrus growing regions. To better understand the biology of the brown citrus aphid and the emergence of genes expressed during wing development, we undertook a large-scale 5′ end sequencing project of cDNA clones from alate aphids. Similar large-scale expressed sequence tag (EST) sequencing projects from other insects have provided a vehicle for answering biological questions relating to development and physiology. Although there is a growing database in GenBank of ESTs from insects, most are from Drosophila melanogaster and Anopheles gambiae, with relatively few specifically derived from aphids. However, important morphogenetic processes are exclusively associated with piercing-sucking insect development and sap feeding insect metabolism. In this paper, we describe the first public data set of ESTs from the brown citrus aphid, T. citricida. The cDNA library was derived from alate adults due to their significance in spreading viruses (e.g., citrus tristeza virus). Over 5180 cDNA clones were sequenced, resulting in 4263 high-quality ESTs. Contig alignment of these ESTs resulted in 2124 total assembled sequences, including both contiguous sequences and singlets. Approximately 33% of the ESTs currently have no significant match in either the non-redundant protein or nucleic acid databases. Sequences returning matches with an E-value of ≤ −10 using BLASTX, BLASTN, or TBLASTX were annotated based on their putative molecular function and biological process using the Gene Ontology classification system. These data will aid research efforts in the identification of important genes within insects, specifically aphids and other sap feeding insects within the Order Hemiptera.
The sequence data described in this paper have been submitted to Genbank's dbEST under the following accession numbers.: CB814527-CB814982, CB832665-CB833296, CB854878-CB855147, CB909714-CB910020, CB936196-CB936346, CD449954-CD450759.
expressed sequence tag
PMCID: PMC524662  PMID: 15841239
Aphididae; cDNA; EST; Gene expression; Hemiptera; Development; Toxoptera
20.  Expressed sequence tags from larval gut of the European corn borer (Ostrinia nubilalis): Exploring candidate genes potentially involved in Bacillus thuringiensis toxicity and resistance 
BMC Genomics  2009;10:286.
Lepidoptera represents more than 160,000 insect species which include some of the most devastating pests of crops, forests, and stored products. However, the genomic information on lepidopteran insects is very limited. Only a few studies have focused on developing expressed sequence tag (EST) libraries from the guts of lepidopteran larvae. Knowledge of the genes that are expressed in the insect gut are crucial for understanding basic physiology of food digestion, their interactions with Bacillus thuringiensis (Bt) toxins, and for discovering new targets for novel toxins for use in pest management. This study analyzed the ESTs generated from the larval gut of the European corn borer (ECB, Ostrinia nubilalis), one of the most destructive pests of corn in North America and the western world. Our goals were to establish an ECB larval gut-specific EST database as a genomic resource for future research and to explore candidate genes potentially involved in insect-Bt interactions and Bt resistance in ECB.
We constructed two cDNA libraries from the guts of the fifth-instar larvae of ECB and sequenced a total of 15,000 ESTs from these libraries. A total of 12,519 ESTs (83.4%) appeared to be high quality with an average length of 656 bp. These ESTs represented 2,895 unique sequences, including 1,738 singletons and 1,157 contigs. Among the unique sequences, 62.7% encoded putative proteins that shared significant sequence similarities (E-value ≤ 10-3)with the sequences available in GenBank. Our EST analysis revealed 52 candidate genes that potentially have roles in Bt toxicity and resistance. These genes encode 18 trypsin-like proteases, 18 chymotrypsin-like proteases, 13 aminopeptidases, 2 alkaline phosphatases and 1 cadherin-like protein. Comparisons of expression profiles of 41 selected candidate genes between Cry1Ab-susceptible and resistant strains of ECB by RT-PCR showed apparently decreased expressions in 2 trypsin-like and 2 chymotrypsin-like protease genes, and 1 aminopeptidase genes in the resistant strain as compared with the susceptible strain. In contrast, the expression of 3 trypsin- like and 3 chymotrypsin-like protease genes, 2 aminopeptidase genes, and 2 alkaline phosphatase genes were increased in the resistant strain. Such differential expressions of the candidate genes may suggest their involvement in Cry1Ab resistance. Indeed, certain trypsin-like and chymotrypsin-like proteases have previously been found to activate or degrade Bt protoxins and toxins, whereas several aminopeptidases, cadherin-like proteins and alkaline phosphatases have been demonstrated to serve as Bt receptor proteins in other insect species.
We developed a relatively large EST database consisting of 12,519 high-quality sequences from a total of 15,000 cDNAs from the larval gut of ECB. To our knowledge, this database represents the largest gut-specific EST database from a lepidopteran pest. Our work provides a foundation for future research to develop an ECB gut-specific DNA microarray which can be used to analyze the global changes of gene expression in response to Bt protoxins/toxins and the genetic difference(s) between Bt- resistant and susceptible strains. Furthermore, we identified 52 candidate genes that may potentially be involved in Bt toxicity and resistance. Differential expressions of 15 out of the 41 selected candidate genes examined by RT-PCR, including 5 genes with apparently decreased expression and 10 with increased expression in Cry1Ab-resistant strain, may help us conclusively identify the candidate genes involved in Bt resistance and provide us with new insights into the mechanism of Cry1Ab resistance in ECB.
PMCID: PMC2717985  PMID: 19558725
21.  Negative Subtraction Hybridization: An efficient method to isolate large numbers of condition-specific cDNAs 
BMC Genomics  2004;5:22.
The construction of cDNA libraries is a useful tool to understand gene expression in organisms under different conditions, but random sequencing of unbiased cDNA collections is laborious and can give rise to redundant EST collections.
We aimed to isolate cDNAs of messages induced by switching Aspergillus nidulans from growth on glucose to growth on selected polysaccharides. Approximately 4,700 contigs from 12,320 ESTs were already available from a cDNA library representing transcripts isolated from glucose-grown A. nidulans during asexual development. Our goals were to expand the cDNA collection without repeated sequencing of previously identified ESTs and to find as many transcripts as possible that are specifically induced in complex polysaccharide metabolism.
We have devised a Negative Subtraction Hybridization (NSH) method and tested it in A. nidulans. NSH entails screening a plasmid library made from cDNAs prepared from cells grown under a selected physiological condition with labeled cDNA probes prepared from another physiological condition. Plasmids with inserts that failed to hybridize to cDNA probes through two rounds of screening (i.e. negatives) indicate that they are transcripts present at low concentration in the labeled probe pool. Thus, these transcripts will be predominantly condition-specific, along with some rare transcripts.
In a screen for transcripts induced by switching the carbon source from glucose to 12 selected polysaccharides, 3,532 negatives were isolated from approximately 100,000 surveyed colonies using this method. Negative clones were end-sequenced and assembled into 2,039 contigs, of which 1,722 were not present in the previously characterized glucose-grown cDNA library. Single-channel microarray hybridization experiments confirmed that the majority of the negatives represented genes that were differentially induced by a switch from growth in glucose to one or more of the polysaccharides.
The Negative Subtraction Hybridization method described here has several practical benefits. This method can be used to screen any existing cDNA library, including full-length and pooled libraries, and does not rely on PCR or sequence information. In addition, NSH is a cost-effective method for the isolation of novel, full-length cDNAs for differentially expressed transcripts or enrichment of rare transcripts.
PMCID: PMC400731  PMID: 15050035
22.  Construction and Application of an Electronic Spatiotemporal Expression Profile and Gene Ontology Analysis Platform Based on the EST Database of the Silkworm, Bombyx mori  
An Expressed Sequence Tag (EST) is a short sub-sequence of a transcribed cDNA sequence. ESTs represent gene expression and give good clues for gene expression analysis. Based on EST data obtained from NCBI, an EST analysis package was developed (apEST). This tool was programmed for electronic expression, protein annotation and Gene Ontology (GO) category analysis in Bombyx mori (L.) (Lepidoptera: Bombycidae). A total of 245,761 ESTs (as of 01 July 2009) were searched and downloaded in FASTA format, from which information for tissue type, development stage, sex and strain were extracted, classified and summed by running apEST. Then, corresponding distribution profiles were formed after redundant parts had been removed. Gene expression profiles for one tissue of different developmental stages and from one development stage of the different tissues were attained. A housekeeping gene and tissue-and-stage-specific genes were selected by running apEST, contrasting with two other online analysis approaches, microarray-based gene expression profile on SilkDB (BmMDB) and EST profile on NCBI. A spatio-temporal expression profile of catalase run by apEST was then presented as a three-dimensional graph for the intuitive visualization of patterns. A total of 37 query genes confirmed from microarray data and RT—PCR experiments were selected as queries to test apEST. The results had great conformity among three approaches. Nevertheless, there were minor differences between apEST and BmMDB because of the unique items investigated. Therefore, complementary analysis was proposed. Application of apEST also led to the acquisition of corresponding protein annotations for EST datasets and eventually for their functions. The results were presented according to statistical information on protein annotation and Gene Ontology (GO) category. These all verified the reliability of apEST and the operability of this platform. The apEST can also be applied in other species by modifying some parameters and serves as a model for gene expression study for Lepidoptera.
PMCID: PMC3016962  PMID: 20874595
EST analysis package; UniGene; Lepidoptera
23.  Comparative genomic mapping of uncharacterized canine retinal ESTs to identify novel candidate genes for hereditary retinal disorders 
Molecular Vision  2009;15:927-936.
To identify the genomic location of previously uncharacterized canine retina-expressed expressed sequence tags (ESTs), and thus identify potential candidate genes for heritable retinal disorders.
A set of over 500 retinal canine ESTs were mapped onto the canine genome using the RHDF5000–2 radiation hybrid (RH) panel, and the resulting map positions were compared to their respective localization in the CanFam2 assembly of the canine genome sequence.
Unique map positions could be assigned for 99% of the mapped clones, of which only 29% showed significant homology to known RefSeq sequences. A comparison between RH map and sequence assembly indicated some areas of discrepancy. Retinal expressed genes were not concentrated in particular areas of the canine genome, and also were located on the canine Y chromosome (CFAY). Several of the EST clones were located within areas of conserved synteny to human retinal disease loci.
RH mapping of canine retinal ESTs provides insight into the location of potential candidate genes for hereditary retinal disorders, and, by comparison with the assembled canine genome sequence, highlights inconsistencies with the current assembly. Regions of conserved synteny between the canine and the human genomes allow this information to be extrapolated to identify potential positional candidate genes for mapped human retinal disorders. Furthermore, these ESTs can help identify novel or uncharacterized genes of significance for better understanding of retinal morphology, physiology, and pathology.
PMCID: PMC2683029  PMID: 19452016
24.  The first set of EST resource for gene discovery and marker development in pigeonpea (Cajanus cajan L.) 
BMC Plant Biology  2010;10:45.
Pigeonpea (Cajanus cajan (L.) Millsp) is one of the major grain legume crops of the tropics and subtropics, but biotic stresses [Fusarium wilt (FW), sterility mosaic disease (SMD), etc.] are serious challenges for sustainable crop production. Modern genomic tools such as molecular markers and candidate genes associated with resistance to these stresses offer the possibility of facilitating pigeonpea breeding for improving biotic stress resistance. Availability of limited genomic resources, however, is a serious bottleneck to undertake molecular breeding in pigeonpea to develop superior genotypes with enhanced resistance to above mentioned biotic stresses. With an objective of enhancing genomic resources in pigeonpea, this study reports generation and analysis of comprehensive resource of FW- and SMD- responsive expressed sequence tags (ESTs).
A total of 16 cDNA libraries were constructed from four pigeonpea genotypes that are resistant and susceptible to FW ('ICPL 20102' and 'ICP 2376') and SMD ('ICP 7035' and 'TTB 7') and a total of 9,888 (9,468 high quality) ESTs were generated and deposited in dbEST of GenBank under accession numbers GR463974 to GR473857 and GR958228 to GR958231. Clustering and assembly analyses of these ESTs resulted into 4,557 unique sequences (unigenes) including 697 contigs and 3,860 singletons. BLASTN analysis of 4,557 unigenes showed a significant identity with ESTs of different legumes (23.2-60.3%), rice (28.3%), Arabidopsis (33.7%) and poplar (35.4%). As expected, pigeonpea ESTs are more closely related to soybean (60.3%) and cowpea ESTs (43.6%) than other plant ESTs. Similarly, BLASTX similarity results showed that only 1,603 (35.1%) out of 4,557 total unigenes correspond to known proteins in the UniProt database (≤ 1E-08). Functional categorization of the annotated unigenes sequences showed that 153 (3.3%) genes were assigned to cellular component category, 132 (2.8%) to biological process, and 132 (2.8%) in molecular function. Further, 19 genes were identified differentially expressed between FW- responsive genotypes and 20 between SMD- responsive genotypes. Generated ESTs were compiled together with 908 ESTs available in public domain, at the time of analysis, and a set of 5,085 unigenes were defined that were used for identification of molecular markers in pigeonpea. For instance, 3,583 simple sequence repeat (SSR) motifs were identified in 1,365 unigenes and 383 primer pairs were designed. Assessment of a set of 84 primer pairs on 40 elite pigeonpea lines showed polymorphism with 15 (28.8%) markers with an average of four alleles per marker and an average polymorphic information content (PIC) value of 0.40. Similarly, in silico mining of 133 contigs with ≥ 5 sequences detected 102 single nucleotide polymorphisms (SNPs) in 37 contigs. As an example, a set of 10 contigs were used for confirming in silico predicted SNPs in a set of four genotypes using wet lab experiments. Occurrence of SNPs were confirmed for all the 6 contigs for which scorable and sequenceable amplicons were generated. PCR amplicons were not obtained in case of 4 contigs. Recognition sites for restriction enzymes were identified for 102 SNPs in 37 contigs that indicates possibility of assaying SNPs in 37 genes using cleaved amplified polymorphic sequences (CAPS) assay.
The pigeonpea EST dataset generated here provides a transcriptomic resource for gene discovery and development of functional markers associated with biotic stress resistance. Sequence analyses of this dataset have showed conservation of a considerable number of pigeonpea transcripts across legume and model plant species analysed as well as some putative pigeonpea specific genes. Validation of identified biotic stress responsive genes should provide candidate genes for allele mining as well as candidate markers for molecular breeding.
PMCID: PMC2923520  PMID: 20222972
25.  Microarrays for global expression constructed with a low redundancy set of 27,500 sequenced cDNAs representing an array of developmental stages and physiological conditions of the soybean plant 
BMC Genomics  2004;5:73.
Microarrays are an important tool with which to examine coordinated gene expression. Soybean (Glycine max) is one of the most economically valuable crop species in the world food supply. In order to accelerate both gene discovery as well as hypothesis-driven research in soybean, global expression resources needed to be developed. The applications of microarray for determining patterns of expression in different tissues or during conditional treatments by dual labeling of the mRNAs are unlimited. In addition, discovery of the molecular basis of traits through examination of naturally occurring variation in hundreds of mutant lines could be enhanced by the construction and use of soybean cDNA microarrays.
We report the construction and analysis of a low redundancy 'unigene' set of 27,513 clones that represent a variety of soybean cDNA libraries made from a wide array of source tissue and organ systems, developmental stages, and stress or pathogen-challenged plants.
The set was assembled from the 5' sequence data of the cDNA clones using cluster analysis programs. The selected clones were then physically reracked and sequenced at the 3' end. In order to increase gene discovery from immature cotyledon libraries that contain abundant mRNAs representing storage protein gene families, we utilized a high density filter normalization approach to preferentially select more weakly expressed cDNAs. All 27,513 cDNA inserts were amplified by polymerase chain reaction. The amplified products, along with some repetitively spotted control or 'choice' clones, were used to produce three 9,728-element microarrays that have been used to examine tissue specific gene expression and global expression in mutant isolines.
Global expression studies will be greatly aided by the availability of the sequence-validated and low redundancy cDNA sets described in this report. These cDNAs and ESTs represent a wide array of developmental stages and physiological conditions of the soybean plant. We also demonstrate that the quality of the data from the soybean cDNA microarrays is sufficiently reliable to examine isogenic lines that differ with respect to a mutant phenotype and thereby to define a small list of candidate genes potentially encoding or modulated by the mutant phenotype.
PMCID: PMC526184  PMID: 15453914

