Expressed Sequence Tags (ESTs) are a source of simple sequence repeats (SSRs) that can be used to develop molecular markers for genetic studies. The availability of ESTs for Quercus robur and Quercus petraea provided a unique opportunity to develop microsatellite markers to accelerate research aimed at studying adaptation of these long-lived species to their environment. As a first step toward the construction of a SSR-based linkage map of oak for quantitative trait locus (QTL) mapping, we describe the mining and survey of EST-SSRs as well as a fast and cost-effective approach (bin mapping) to assign these markers to an approximate map position. We also compared the level of polymorphism between genomic and EST-derived SSRs and address the transferability of EST-SSRs in Castanea sativa (chestnut).
A catalogue of 103,000 Sanger ESTs was assembled into 28,024 unigenes from which 18.6% presented one or more SSR motifs. More than 42% of these SSRs corresponded to trinucleotides. Primer pairs were designed for 748 putative unigenes. Overall 37.7% (283) were found to amplify a single polymorphic locus in a reference full-sib pedigree of Quercus robur. The usefulness of these loci for establishing a genetic map was assessed using a bin mapping approach. Bin maps were constructed for the male and female parental tree for which framework linkage maps based on AFLP markers were available. The bin set consisting of 14 highly informative offspring selected based on the number and position of crossover sites. The female and male maps comprised 44 and 37 bins, with an average bin length of 16.5 cM and 20.99 cM, respectively. A total of 256 EST-SSRs were assigned to bins and their map position was further validated by linkage mapping. EST-SSRs were found to be less polymorphic than genomic SSRs, but their transferability rate to chestnut, a phylogenetically related species to oak, was higher.
We have generated a bin map for oak comprising 256 EST-SSRs. This resource constitutes a first step toward the establishment of a gene-based map for this genus that will facilitate the dissection of QTLs affecting complex traits of ecological importance.
Genetic markers and linkage mapping are basic prerequisites for comparative genetic analyses, QTL detection and map-based cloning. A large number of mapping populations have been developed for oak, but few gene-based markers are available for constructing integrated genetic linkage maps and comparing gene order and QTL location across related species.
We developed a set of 573 expressed sequence tag-derived simple sequence repeats (EST-SSRs) and located 397 markers (EST-SSRs and genomic SSRs) on the 12 oak chromosomes (2n = 2x = 24) on the basis of Mendelian segregation patterns in 5 full-sib mapping pedigrees of two species: Quercus robur (pedunculate oak) and Quercus petraea (sessile oak). Consensus maps for the two species were constructed and aligned. They showed a high degree of macrosynteny between these two sympatric European oaks. We assessed the transferability of EST-SSRs to other Fagaceae genera and a subset of these markers was mapped in Castanea sativa, the European chestnut. Reasonably high levels of macrosynteny were observed between oak and chestnut. We also obtained diversity statistics for a subset of EST-SSRs, to support further population genetic analyses with gene-based markers. Finally, based on the orthologous relationships between the oak, Arabidopsis, grape, poplar, Medicago, and soybean genomes and the paralogous relationships between the 12 oak chromosomes, we propose an evolutionary scenario of the 12 oak chromosomes from the eudicot ancestral karyotype.
This study provides map locations for a large set of EST-SSRs in two oak species of recognized biological importance in natural ecosystems. This first step toward the construction of a gene-based linkage map will facilitate the assignment of future genome scaffolds to pseudo-chromosomes. This study also provides an indication of the potential utility of new gene-based markers for population genetics and comparative mapping within and beyond the Fagaceae.
Microsatellites or simple sequence repeats (SSRs) in expressed sequence tags (ESTs) are useful resources for genome analysis because of their abundance, functionality and polymorphism. The advent of commercial second generation sequencing machines has lead to new strategies for developing EST-SSR markers, necessitating the development of bioinformatic framework that can keep pace with the increasing quality and quantity of sequence data produced. We describe an open scheme for analyzing ESTs and developing EST-SSR markers from reads collected by Sanger sequencing and pyrosequencing of sugi (Cryptomeria japonica).
We collected 141,097 sequence reads by Sanger sequencing and 1,333,444 by pyrosequencing. After trimming contaminant and low quality sequences, 118,319 Sanger and 1,201,150 pyrosequencing reads were passed to the MIRA assembler, generating 81,284 contigs that were analysed for SSRs. 4,059 SSRs were found in 3,694 (4.54%) contigs, giving an SSR frequency lower than that in seven other plant species with gene indices (5.4–21.9%). The average GC content of the SSR-containing contigs was 41.55%, compared to 40.23% for all contigs. Tri-SSRs were the most common SSRs; the most common motif was AT, which was found in 655 (46.3%) di-SSRs, followed by the AAG motif, found in 342 (25.9%) tri-SSRs. Most (72.8%) tri-SSRs were in coding regions, but 55.6% of the di-SSRs were in non-coding regions; the AT motif was most abundant in 3′ untranslated regions. Gene ontology (GO) annotations showed that six GO terms were significantly overrepresented within SSR-containing contigs. Forty–four EST-SSR markers were developed from 192 primer pairs using two pipelines: read2Marker and the newly-developed CMiB, which combines several open tools. Markers resulting from both pipelines showed no differences in PCR success rate and polymorphisms, but PCR success and polymorphism were significantly affected by the expected PCR product size and number of SSR repeats, respectively. EST-SSR markers exhibited less polymorphism than genomic SSRs.
We have created a new open pipeline for developing EST-SSR markers and applied it in a comprehensive analysis of EST-SSRs and EST-SSR markers in C. japonica. The results will be useful in genomic analyses of conifers and other non-model species.
Cork oak (Quercus suber) is one of the rare trees with the ability to produce cork, a material widely used to make wine bottle stoppers, flooring and insulation materials, among many other uses. The molecular mechanisms of cork formation are still poorly understood, in great part due to the difficulty in studying a species with a long life-cycle and for which there is scarce molecular/genomic information. Cork oak forests are of great ecological importance and represent a major economic and social resource in Southern Europe and Northern Africa. However, global warming is threatening the cork oak forests by imposing thermal, hydric and many types of novel biotic stresses. Despite the economic and social value of the Q. suber species, few genomic resources have been developed, useful for biotechnological applications and improved forest management.
We generated in excess of 7 million sequence reads, by pyrosequencing 21 normalized cDNA libraries derived from multiple Q. suber tissues and organs, developmental stages and physiological conditions. We deployed a stringent sequence processing and assembly pipeline that resulted in the identification of ~159,000 unigenes. These were annotated according to their similarity to known plant genes, to known Interpro domains, GO classes and E.C. numbers. The phylogenetic extent of this ESTs set was investigated, and we found that cork oak revealed a significant new gene space that is not covered by other model species or EST sequencing projects. The raw data, as well as the full annotated assembly, are now available to the community in a dedicated web portal at http://www.corkoakdb.org.
This genomic resource represents the first trancriptome study in a cork producing species. It can be explored to develop new tools and approaches to understand stress responses and developmental processes in forest trees, as well as the molecular cascades underlying cork differentiation and disease response.
Pigeonpea (Cajanus cajan (L.) Millsp) is one of the major grain legume crops of the tropics and subtropics, but biotic stresses [Fusarium wilt (FW), sterility mosaic disease (SMD), etc.] are serious challenges for sustainable crop production. Modern genomic tools such as molecular markers and candidate genes associated with resistance to these stresses offer the possibility of facilitating pigeonpea breeding for improving biotic stress resistance. Availability of limited genomic resources, however, is a serious bottleneck to undertake molecular breeding in pigeonpea to develop superior genotypes with enhanced resistance to above mentioned biotic stresses. With an objective of enhancing genomic resources in pigeonpea, this study reports generation and analysis of comprehensive resource of FW- and SMD- responsive expressed sequence tags (ESTs).
A total of 16 cDNA libraries were constructed from four pigeonpea genotypes that are resistant and susceptible to FW ('ICPL 20102' and 'ICP 2376') and SMD ('ICP 7035' and 'TTB 7') and a total of 9,888 (9,468 high quality) ESTs were generated and deposited in dbEST of GenBank under accession numbers GR463974 to GR473857 and GR958228 to GR958231. Clustering and assembly analyses of these ESTs resulted into 4,557 unique sequences (unigenes) including 697 contigs and 3,860 singletons. BLASTN analysis of 4,557 unigenes showed a significant identity with ESTs of different legumes (23.2-60.3%), rice (28.3%), Arabidopsis (33.7%) and poplar (35.4%). As expected, pigeonpea ESTs are more closely related to soybean (60.3%) and cowpea ESTs (43.6%) than other plant ESTs. Similarly, BLASTX similarity results showed that only 1,603 (35.1%) out of 4,557 total unigenes correspond to known proteins in the UniProt database (≤ 1E-08). Functional categorization of the annotated unigenes sequences showed that 153 (3.3%) genes were assigned to cellular component category, 132 (2.8%) to biological process, and 132 (2.8%) in molecular function. Further, 19 genes were identified differentially expressed between FW- responsive genotypes and 20 between SMD- responsive genotypes. Generated ESTs were compiled together with 908 ESTs available in public domain, at the time of analysis, and a set of 5,085 unigenes were defined that were used for identification of molecular markers in pigeonpea. For instance, 3,583 simple sequence repeat (SSR) motifs were identified in 1,365 unigenes and 383 primer pairs were designed. Assessment of a set of 84 primer pairs on 40 elite pigeonpea lines showed polymorphism with 15 (28.8%) markers with an average of four alleles per marker and an average polymorphic information content (PIC) value of 0.40. Similarly, in silico mining of 133 contigs with ≥ 5 sequences detected 102 single nucleotide polymorphisms (SNPs) in 37 contigs. As an example, a set of 10 contigs were used for confirming in silico predicted SNPs in a set of four genotypes using wet lab experiments. Occurrence of SNPs were confirmed for all the 6 contigs for which scorable and sequenceable amplicons were generated. PCR amplicons were not obtained in case of 4 contigs. Recognition sites for restriction enzymes were identified for 102 SNPs in 37 contigs that indicates possibility of assaying SNPs in 37 genes using cleaved amplified polymorphic sequences (CAPS) assay.
The pigeonpea EST dataset generated here provides a transcriptomic resource for gene discovery and development of functional markers associated with biotic stress resistance. Sequence analyses of this dataset have showed conservation of a considerable number of pigeonpea transcripts across legume and model plant species analysed as well as some putative pigeonpea specific genes. Validation of identified biotic stress responsive genes should provide candidate genes for allele mining as well as candidate markers for molecular breeding.
Cucumber, Cucumis sativus L., is an economically and nutritionally important crop of the Cucurbitaceae family and has long served as a primary model system for sex determination studies. Recently, the sequencing of its whole genome has been completed. However, transcriptome information of this species is still scarce, with a total of around 8,000 Expressed Sequence Tag (EST) and mRNA sequences currently available in GenBank. In order to gain more insights into molecular mechanisms of plant sex determination and provide the community a functional genomics resource that will facilitate cucurbit research and breeding, we performed transcriptome sequencing of cucumber flower buds of two near-isogenic lines, WI1983G, a gynoecious plant which bears only pistillate flowers, and WI1983H, a hermaphroditic plant which bears only bisexual flowers.
Using Roche-454 massive parallel pyrosequencing technology, we generated a total of 353,941 high quality EST sequences with an average length of 175bp, among which 188,255 were from gynoecious flowers and 165,686 from hermaphroditic flowers. These EST sequences, together with ~5,600 high quality cucumber EST and mRNA sequences available in GenBank, were clustered and assembled into 81,401 unigenes, of which 28,452 were contigs and 52,949 were singletons. The unigenes and ESTs were further mapped to the cucumber genome and more than 500 alternative splicing events were identified in 443 cucumber genes. The unigenes were further functionally annotated by comparing their sequences to different protein and functional domain databases and assigned with Gene Ontology (GO) terms. A biochemical pathway database containing 343 predicted pathways was also created based on the annotations of the unigenes. Digital expression analysis identified ~200 differentially expressed genes between flowers of WI1983G and WI1983H and provided novel insights into molecular mechanisms of plant sex determination process. Furthermore, a set of SSR motifs and high confidence SNPs between WI1983G and WI1983H were identified from the ESTs, which provided the material basis for future genetic linkage and QTL analysis.
A large set of EST sequences were generated from cucumber flower buds of two different sex types. Differentially expressed genes between these two different sex-type flowers, as well as putative SSR and SNP markers, were identified. These EST sequences provide valuable information to further understand molecular mechanisms of plant sex determination process and forms a rich resource for future functional genomics analysis, marker development and cucumber breeding.
Sheepgrass [Leymus chinensis (Trin.) Tzvel.] is an important perennial forage grass across the Eurasian Steppe and is known for its adaptability to various environmental conditions. However, insufficient data resources in public databases for sheepgrass limited our understanding of the mechanism of environmental adaptations, gene discovery and molecular marker development.
The transcriptome of sheepgrass was sequenced using Roche 454 pyrosequencing technology. We assembled 952,328 high-quality reads into 87,214 unigenes, including 32,416 contigs and 54,798 singletons. There were 15,450 contigs over 500 bp in length. BLAST searches of our database against Swiss-Prot and NCBI non-redundant protein sequences (nr) databases resulted in the annotation of 54,584 (62.6%) of the unigenes. Gene Ontology (GO) analysis assigned 89,129 GO term annotations for 17,463 unigenes. We identified 11,675 core Poaceae-specific and 12,811 putative sheepgrass-specific unigenes by BLAST searches against all plant genome and transcriptome databases. A total of 2,979 specific freezing-responsive unigenes were found from this RNAseq dataset. We identified 3,818 EST-SSRs in 3,597 unigenes, and some SSRs contained unigenes that were also candidates for freezing-response genes. Characterizations of nucleotide repeats and dominant motifs of SSRs in sheepgrass were also performed. Similarity and phylogenetic analysis indicated that sheepgrass is closely related to barley and wheat.
This research has greatly enriched sheepgrass transcriptome resources. The identified stress-related genes will help us to decipher the genetic basis of the environmental and ecological adaptations of this species and will be used to improve wheat and barley crops through hybridization or genetic transformation. The EST-SSRs reported here will be a valuable resource for future gene-phenotype studies and for the molecular breeding of sheepgrass and other Poaceae species.
Rust fungi are biotrophic basidiomycete plant pathogens that cause major diseases on plants and trees world-wide, affecting agriculture and forestry. Their biotrophic nature precludes many established molecular genetic manipulations and lines of research. The generation of genomic resources for these microbes is leading to novel insights into biology such as interactions with the hosts and guiding directions for breakthrough research in plant pathology.
To support gene discovery and gene model verification in the genome of the wheat leaf rust fungus, Puccinia triticina (Pt), we have generated Expressed Sequence Tags (ESTs) by sampling several life cycle stages. We focused on several spore stages and isolated haustorial structures from infected wheat, generating 17,684 ESTs. We produced sequences from both the sexual (pycniospores, aeciospores and teliospores) and asexual (germinated urediniospores) stages of the life cycle. From pycniospores and aeciospores, produced by infecting the alternate host, meadow rue (Thalictrum speciosissimum), 4,869 and 1,292 reads were generated, respectively. We generated 3,703 ESTs from teliospores produced on the senescent primary wheat host. Finally, we generated 6,817 reads from haustoria isolated from infected wheat as well as 1,003 sequences from germinated urediniospores. Along with 25,558 previously generated ESTs, we compiled a database of 13,328 non-redundant sequences (4,506 singlets and 8,822 contigs). Fungal genes were predicted using the EST version of the self-training GeneMarkS algorithm. To refine the EST database, we compared EST sequences by BLASTN to a set of 454 pyrosequencing-generated contigs and Sanger BAC-end sequences derived both from the Pt genome, and to ESTs and genome reads from wheat. A collection of 6,308 fungal genes was identified and compared to sequences of the cereal rusts, Puccinia graminis f. sp. tritici (Pgt) and stripe rust, P. striiformis f. sp. tritici (Pst), and poplar leaf rust Melampsora species, and the corn smut fungus, Ustilago maydis (Um). While extensive homologies were found, many genes appeared novel and species-specific; over 40% of genes did not match any known sequence in existing databases. Focusing on spore stages, direct comparison to Um identified potential functional homologs, possibly allowing heterologous functional analysis in that model fungus. Many potentially secreted protein genes were identified by similarity searches against genes and proteins of Pgt and Melampsora spp., revealing apparent orthologs.
The current set of Pt unigenes contributes to gene discovery in this major cereal pathogen and will be invaluable for gene model verification in the genome sequence.
Chickpea (Cicer arietinum L.), an important grain legume crop of the world is seriously challenged by terminal drought and salinity stresses. However, very limited number of molecular markers and candidate genes are available for undertaking molecular breeding in chickpea to tackle these stresses. This study reports generation and analysis of comprehensive resource of drought- and salinity-responsive expressed sequence tags (ESTs) and gene-based markers.
A total of 20,162 (18,435 high quality) drought- and salinity- responsive ESTs were generated from ten different root tissue cDNA libraries of chickpea. Sequence editing, clustering and assembly analysis resulted in 6,404 unigenes (1,590 contigs and 4,814 singletons). Functional annotation of unigenes based on BLASTX analysis showed that 46.3% (2,965) had significant similarity (≤1E-05) to sequences in the non-redundant UniProt database. BLASTN analysis of unique sequences with ESTs of four legume species (Medicago, Lotus, soybean and groundnut) and three model plant species (rice, Arabidopsis and poplar) provided insights on conserved genes across legumes as well as novel transcripts for chickpea. Of 2,965 (46.3%) significant unigenes, only 2,071 (32.3%) unigenes could be functionally categorised according to Gene Ontology (GO) descriptions. A total of 2,029 sequences containing 3,728 simple sequence repeats (SSRs) were identified and 177 new EST-SSR markers were developed. Experimental validation of a set of 77 SSR markers on 24 genotypes revealed 230 alleles with an average of 4.6 alleles per marker and average polymorphism information content (PIC) value of 0.43. Besides SSR markers, 21,405 high confidence single nucleotide polymorphisms (SNPs) in 742 contigs (with ≥ 5 ESTs) were also identified. Recognition sites for restriction enzymes were identified for 7,884 SNPs in 240 contigs. Hierarchical clustering of 105 selected contigs provided clues about stress- responsive candidate genes and their expression profile showed predominance in specific stress-challenged libraries.
Generated set of chickpea ESTs serves as a resource of high quality transcripts for gene discovery and development of functional markers associated with abiotic stress tolerance that will be helpful to facilitate chickpea breeding. Mapping of gene-based markers in chickpea will also add more anchoring points to align genomes of chickpea and other legume species.
Molecular breeding of pepper (Capsicum spp.) can be accelerated by developing DNA markers associated with transcriptomes in breeding germplasm. Before the advent of next generation sequencing (NGS) technologies, the majority of sequencing data were generated by the Sanger sequencing method. By leveraging Sanger EST data, we have generated a wealth of genetic information for pepper including thousands of SNPs and Single Position Polymorphic (SPP) markers. To complement and enhance these resources, we applied NGS to three pepper genotypes: Maor, Early Jalapeño and Criollo de Morelos-334 (CM334) to identify SNPs and SSRs in the assembly of these three genotypes.
Two pepper transcriptome assemblies were developed with different purposes. The first reference sequence, assembled by CAP3 software, comprises 31,196 contigs from >125,000 Sanger-EST sequences that were mainly derived from a Korean F1-hybrid line, Bukang. Overlapping probes were designed for 30,815 unigenes to construct a pepper Affymetrix GeneChip® microarray for whole genome analyses. In addition, custom Python scripts were used to identify 4,236 SNPs in contigs of the assembly. A total of 2,489 simple sequence repeats (SSRs) were identified from the assembly, and primers were designed for the SSRs. Annotation of contigs using Blast2GO software resulted in information for 60% of the unigenes in the assembly. The second transcriptome assembly was constructed from more than 200 million Illumina Genome Analyzer II reads (80–120 nt) using a combination of Velvet, CLC workbench and CAP3 software packages. BWA, SAMtools and in-house Perl scripts were used to identify SNPs among three pepper genotypes. The SNPs were filtered to be at least 50 bp from any intron-exon junctions as well as flanking SNPs. More than 22,000 high-quality putative SNPs were identified. Using the MISA software, 10,398 SSR markers were also identified within the Illumina transcriptome assembly and primers were designed for the identified markers. The assembly was annotated by Blast2GO and 14,740 (12%) of annotated contigs were associated with functional proteins.
Before availability of pepper genome sequence, assembling transcriptomes of this economically important crop was required to generate thousands of high-quality molecular markers that could be used in breeding programs. In order to have a better understanding of the assembled sequences and to identify candidate genes underlying QTLs, we annotated the contigs of Sanger-EST and Illumina transcriptome assemblies. These and other information have been curated in a database that we have dedicated for pepper project.
Pepper; Capsicum spp; Molecular Markers; EST; Transcriptome; RNAseq; Annotation; SNP; SSR; SPP
In temperate regions, the time lag between vegetative bud burst and bud set determines the duration of the growing season of trees (i.e. the duration of wood biomass production). Dormancy, the period during which the plant is not growing, allows trees to avoid cold injury resulting from exposure to low temperatures. An understanding of the molecular machinery controlling the shift between these two phenological states is of key importance in the context of climatic change. The objective of this study was to identify genes upregulated during endo- and ecodormancy, the two main stages of bud dormancy. Sessile oak is a widely distributed European white oak species. A forcing test on young trees was first carried out to identify the period most likely to correspond to these two stages. Total RNA was then extracted from apical buds displaying endo- and ecodormancy. This RNA was used for the generation of cDNA libraries, and in-depth transcriptome characterization was performed with 454 FLX pyrosequencing technology.
Pyrosequencing produced a total of 495,915 reads. The data were cleaned, duplicated reads removed, and sequences were mapped onto the oak UniGene data. Digital gene expression analysis was performed, with both R statistics and the R-Bioconductor packages (edgeR and DESeq), on 6,471 contigs with read numbers ≥ 5 within any contigs. The number of sequences displaying significant differences in expression level (read abundance) between endo- and ecodormancy conditions ranged from 75 to 161, depending on the algorithm used. 13 genes displaying significant differences between conditions were selected for further analysis, and 11 of these genes, including those for glutathione-S-transferase (GST) and dehydrin xero2 (XERO2) were validated by quantitative PCR.
The identification and functional annotation of differentially expressed genes involved in the “response to abscisic acid”, “response to cold stress” and “response to oxidative stress” categories constitutes a major step towards characterization of the molecular network underlying vegetative bud dormancy, an important life history trait of long-lived organisms.
Epimedium sagittatum (Sieb. Et Zucc.) Maxim, a traditional Chinese medicinal plant species, has been used extensively as genuine medicinal materials. Certain Epimedium species are endangered due to commercial overexploition, while sustainable application studies, conservation genetics, systematics, and marker-assisted selection (MAS) of Epimedium is less-studied due to the lack of molecular markers. Here, we report a set of expressed sequence tags (ESTs) and simple sequence repeats (SSRs) identified in these ESTs for E. sagittatum.
cDNAs of E. sagittatum are sequenced using 454 GS-FLX pyrosequencing technology. The raw reads are cleaned and assembled into a total of 76,459 consensus sequences comprising of 17,231 contigs and 59,228 singlets. About 38.5% (29,466) of the consensus sequences significantly match to the non-redundant protein database (E-value < 1e-10), 22,295 of which are further annotated using Gene Ontology (GO) terms. A total of 2,810 EST-SSRs is identified from the Epimedium EST dataset. Trinucleotide SSR is the dominant repeat type (55.2%) followed by dinucleotide (30.4%), tetranuleotide (7.3%), hexanucleotide (4.9%), and pentanucleotide (2.2%) SSR. The dominant repeat motif is AAG/CTT (23.6%) followed by AG/CT (19.3%), ACC/GGT (11.1%), AT/AT (7.5%), and AAC/GTT (5.9%). Thirty-two SSR-ESTs are randomly selected and primer pairs are synthesized for testing the transferability across 52 Epimedium species. Eighteen primer pairs (85.7%) could be successfully transferred to Epimedium species and sixteen of those show high genetic diversity with 0.35 of observed heterozygosity (Ho) and 0.65 of expected heterozygosity (He) and high number of alleles per locus (11.9).
A large EST dataset with a total of 76,459 consensus sequences is generated, aiming to provide sequence information for deciphering secondary metabolism, especially for flavonoid pathway in Epimedium. A total of 2,810 EST-SSRs is identified from EST dataset and ~1580 EST-SSR markers are transferable. E. sagittatum EST-SSR transferability to the major Epimedium germplasm is up to 85.7%. Therefore, this EST dataset and EST-SSRs will be a powerful resource for further studies such as taxonomy, molecular breeding, genetics, genomics, and secondary metabolism in Epimedium species.
Yellow lupin (Lupinus luteus L.) is a minor legume crop characterized by its high seed protein content. Although grown in several temperate countries, its orphan condition has limited the generation of genomic tools to aid breeding efforts to improve yield and nutritional quality. In this study, we report the construction of 454-expresed sequence tag (EST) libraries, carried out comparative studies between L. luteus and model legume species, developed a comprehensive set of EST-simple sequence repeat (SSR) markers, and validated their utility on diversity studies and transferability to related species.
Two runs of 454 pyrosequencing yielded 205 Mb and 530 Mb of sequence data for L1 (young leaves, buds and flowers) and L2 (immature seeds) EST- libraries. A combined assembly (L1L2) yielded 71,655 contigs with an average contig length of 632 nucleotides. L1L2 contigs were clustered into 55,309 isotigs. 38,200 isotigs translated into proteins and 8,741 of them were full length. Around 57% of L. luteus sequences had significant similarity with at least one sequence of Medicago, Lotus, Arabidopsis, or Glycine, and 40.17% showed positive matches with all of these species. L. luteus isotigs were also screened for the presence of SSR sequences. A total of 2,572 isotigs contained at least one EST-SSR, with a frequency of one SSR per 17.75 kbp. Empirical evaluation of the EST-SSR candidate markers resulted in 222 polymorphic EST-SSRs. Two hundred and fifty four (65.7%) and 113 (30%) SSR primer pairs were able to amplify fragments from L. hispanicus and L. mutabilis DNA, respectively. Fifty polymorphic EST-SSRs were used to genotype a sample of 64 L. luteus accessions. Neighbor-joining distance analysis detected the existence of several clusters among L. luteus accessions, strongly suggesting the existence of population subdivisions. However, no clear clustering patterns followed the accession’s origin.
L. luteus deep transcriptome sequencing will facilitate the further development of genomic tools and lupin germplasm. Massive sequencing of cDNA libraries will continue to produce raw materials for gene discovery, identification of polymorphisms (SNPs, EST-SSRs, INDELs, etc.) for marker development, anchoring sequences for genome comparisons and putative gene candidates for QTL detection.
Lupinus luteus; EST-SSR; Orphan crop; Microsynteny
Cultivated peanut or groundnut (Arachis hypogaea L.) is an important oilseed crop with an allotetraploid genome (AABB, 2n = 4x = 40). Both the low level of genetic variation within the cultivated gene pool and its polyploid nature limit the utilization of molecular markers to explore genome structure and facilitate genetic improvement. Nevertheless, a wealth of genetic diversity exists in diploid Arachis species (2n = 2x = 20), which represent a valuable gene pool for cultivated peanut improvement. Interspecific populations have been used widely for genetic mapping in diploid species of Arachis. However, an intraspecific mapping strategy was essential to detect chromosomal rearrangements among species that could be obscured by mapping in interspecific populations. To develop intraspecific reference linkage maps and gain insights into karyotypic evolution within the genus, we comparatively mapped the A- and B-genome diploid species using intraspecific F2 populations. Exploring genome organization among diploid peanut species by comparative mapping will enhance our understanding of the cultivated tetraploid peanut genome. Moreover, new sources of molecular markers that are highly transferable between species and developed from expressed genes will be required to construct saturated genetic maps for peanut.
A total of 2,138 EST-SSR (expressed sequence tag-simple sequence repeat) markers were developed by mining a tetraploid peanut EST assembly including 101,132 unigenes (37,916 contigs and 63,216 singletons) derived from 70,771 long-read (Sanger) and 270,957 short-read (454) sequences. A set of 97 SSR markers were also developed by mining 9,517 genomic survey sequences of Arachis. An SSR-based intraspecific linkage map was constructed using an F2 population derived from a cross between K 9484 (PI 298639) and GKBSPSc 30081 (PI 468327) in the B-genome species A. batizocoi. A high degree of macrosynteny was observed when comparing the homoeologous linkage groups between A (A. duranensis) and B (A. batizocoi) genomes. Comparison of the A- and B-genome genetic linkage maps also showed a total of five inversions and one major reciprocal translocation between two pairs of chromosomes under our current mapping resolution.
Our findings will contribute to understanding tetraploid peanut genome origin and evolution and eventually promote its genetic improvement. The newly developed EST-SSR markers will enrich current molecular marker resources in peanut.
Peanut (Arachis hypogaea); SSR; Genetic linkage map; Intraspecific cross; EST
The oomycete pathogen Phytophthora ramorum is responsible for sudden oak death (SOD) in California coastal forests. P. ramorum is a generalist pathogen with over 100 known host species. Three or four closely related genotypes of P. ramorum (from a single lineage) were originally introduced in California forests and the pathogen reproduces clonally. Because of this the genetic diversity of P. ramorum is extremely low in Californian forests. However, P. ramorum shows diverse phenotypic variation in colony morphology, colony senescence, and virulence. In this study, we show that phenotypic variation among isolates is associated with the host species from which the microbe was originally cultured. Microarray global mRNA profiling detected derepression of transposable elements (TEs) and down-regulation of crinkler effector homologs (CRNs) in the majority of isolates originating from coast live oak (Quercus agrifolia), but this expression pattern was not observed in isolates from California bay laurel (Umbellularia californica). In some instances, oak and bay laurel isolates originating from the same geographic location had identical genotypes based on multilocus simples sequence repeat (SSR) marker analysis but had different phenotypes. Expression levels of the two marker genes analyzed by quantitative reverse transcription PCR were correlated with originating host species, but not with multilocus genotypes. Because oak is a nontransmissive dead-end host for P. ramorum, our observations are congruent with an epi-transposon hypothesis; that is, physiological stress is triggered on P. ramorum while colonizing oak stems and disrupts epigenetic silencing of TEs. This then results in TE reactivation and possibly genome diversification without significant epidemiological consequences. We propose the P. ramorum-oak host system in California forests as an ad hoc model for epi-transposon mediated diversification.
Common bean (Phaseolus vulgaris) is the most important food legume in the world. Although this crop is very important to both the developed and developing world as a means of dietary protein supply, resources available in common bean are limited. Global transcriptome analysis is important to better understand gene expression, genetic variation, and gene structure annotation in addition to other important features. However, the number and description of common bean sequences are very limited, which greatly inhibits genome and transcriptome research. Here we used 454 pyrosequencing to obtain a substantial transcriptome dataset for common bean.
We obtained 1,692,972 reads with an average read length of 207 nucleotides (nt). These reads were assembled into 59,295 unigenes including 39,572 contigs and 19,723 singletons, in addition to 35,328 singletons less than 100 bp. Comparing the unigenes to common bean ESTs deposited in GenBank, we found that 53.40% or 31,664 of these unigenes had no matches to this dataset and can be considered as new common bean transcripts. Functional annotation of the unigenes carried out by Gene Ontology assignments from hits to Arabidopsis and soybean indicated coverage of a broad range of GO categories. The common bean unigenes were also compared to the bean bacterial artificial chromosome (BAC) end sequences, and a total of 21% of the unigenes (12,724) including 9,199 contigs and 3,256 singletons match to the 8,823 BAC-end sequences. In addition, a large number of simple sequence repeats (SSRs) and transcription factors were also identified in this study.
This work provides the first large scale identification of the common bean transcriptome derived by 454 pyrosequencing. This research has resulted in a 150% increase in the number of Phaseolus vulgaris ESTs. The dataset obtained through this analysis will provide a platform for functional genomics in common bean and related legumes and will aid in the development of molecular markers that can be used for tagging genes of interest. Additionally, these sequences will provide a means for better annotation of the on-going common bean whole genome sequencing.
Functional genomics has particular promise in silkworm biology for identifying genes involved in a variety of biological functions that include: synthesis and secretion of silk, sex determination pathways, insect-pathogen interactions, chorionogenesis, molecular clocks. Wild silkmoths have hardly been the subject of detailed scientific investigations, owing largely to non-availability of molecular and genetic data on these species. As a first step, in the present study we generated large scale expressed sequence tags (EST) in three economically important species of wild silkmoths. In order to make these resources available for the use of global scientific community, an EST database called 'WildSilkbase' was developed.
WildSilkbase is a catalogue of ESTs generated from several tissues at different developmental stages of 3 economically important saturniid silkmoths, an Indian golden silkmoth, Antheraea assama, an Indian tropical tasar silkmoth, A. mylitta and eri silkmoth, Samia cynthia ricini. Currently the database is provided with 57,113 ESTs which are clustered and assembled into 4,019 contigs and 10,019 singletons. Data can be browsed and downloaded using a standard web browser. Users can search the database either by BLAST query, keywords or Gene Ontology query. There are options to carry out searches for species, tissue and developmental stage specific ESTs in BLAST page. Other features of the WildSilkbase include cSNP discovery, GO viewer, homologue finder, SSR finder and links to all other related databases. The WildSilkbase is freely available from .
A total of 14,038 putative unigenes was identified in 3 species of wild silkmoths. These genes provide important resources to gain insight into the functional and evolutionary study of wild silkmoths. We believe that WildSilkbase will be extremely useful for all those researchers working in the areas of comparative genomics, functional genomics and molecular evolution in general, and gene discovery, gene organization, transposable elements and genome variability of insect species in particular.
Melon (Cucumis melo L.) is one of the most important fleshy fruits for fresh consumption. Despite this, few genomic resources exist for this species. To facilitate the discovery of genes involved in essential traits, such as fruit development, fruit maturation and disease resistance, and to speed up the process of breeding new and better adapted melon varieties, we have produced a large collection of expressed sequence tags (ESTs) from eight normalized cDNA libraries from different tissues in different physiological conditions.
We determined over 30,000 ESTs that were clustered into 16,637 non-redundant sequences or unigenes, comprising 6,023 tentative consensus sequences (contigs) and 10,614 unclustered sequences (singletons). Many potential molecular markers were identified in the melon dataset: 1,052 potential simple sequence repeats (SSRs) and 356 single nucleotide polymorphisms (SNPs) were found. Sixty-nine percent of the melon unigenes showed a significant similarity with proteins in databases. Functional classification of the unigenes was carried out following the Gene Ontology scheme. In total, 9,402 unigenes were mapped to one or more ontology. Remarkably, the distributions of melon and Arabidopsis unigenes followed similar tendencies, suggesting that the melon dataset is representative of the whole melon transcriptome. Bioinformatic analyses primarily focused on potential precursors of melon micro RNAs (miRNAs) in the melon dataset, but many other genes potentially controlling disease resistance and fruit quality traits were also identified. Patterns of transcript accumulation were characterised by Real-Time-qPCR for 20 of these genes.
The collection of ESTs characterised here represents a substantial increase on the genetic information available for melon. A database (MELOGEN) which contains all EST sequences, contig images and several tools for analysis and data mining has been created. This set of sequences constitutes also the basis for an oligo-based microarray for melon that is being used in experiments to further analyse the melon transcriptome.
Limited DNA sequence and DNA marker resources have been developed for Iris (Iridaceae), a monocot genus of 200–300 species in the Asparagales, several of which are horticulturally important. We mined an I. brevicaulis-I. fulva EST database for simple sequence repeats (SSRs) and developed ortholog-specific EST-SSR markers for genetic mapping and other genotyping applications in Iris. Here, we describe the abundance and other characteristics of SSRs identified in the transcript assembly (EST database) and the cross-species utility and polymorphisms of I. brevicaulis-I. fulva EST-SSR markers among wild collected ecotypes and horticulturally important cultivars.
Collectively, 6,530 ESTs were produced from normalized leaf and root cDNA libraries of I. brevicaulis (IB72) and I. fulva (IF174), and assembled into 4,917 unigenes (1,066 contigs and 3,851 singletons). We identified 1,447 SSRs in 1,162 unigenes and developed 526 EST-SSR markers, each tracing a different unigene. Three-fourths of the EST-SSR markers (399/526) amplified alleles from IB72 and IF174 and 84% (335/399) were polymorphic between IB25 and IF174, the parents of I. brevicaulis × I. fulva mapping populations. Forty EST-SSR markers were screened for polymorphisms among 39 ecotypes or cultivars of seven species – 100% amplified alleles from wild collected ecotypes of Louisiana Iris (I.brevicaulis, I.fulva, I. nelsonii, and I. hexagona), whereas 42–52% amplified alleles from cultivars of three horticulturally important species (I. pseudacorus, I. germanica, and I. sibirica). Ecotypes and cultivars were genetically diverse – the number of alleles/locus ranged from two to 18 and mean heterozygosity was 0.76.
Nearly 400 ortholog-specific EST-SSR markers were developed for comparative genetic mapping and other genotyping applications in Iris, were highly polymorphic among ecotypes and cultivars, and have broad utility for genotyping applications within the genus.
The obligate parasitic plant witchweed (Striga hermonthica) infects major cereal crops such as sorghum, maize, and millet, and is the most devastating weed pest in Africa. An understanding of the nature of its parasitism would contribute to the development of more sophisticated management methods. However, the molecular and genomic resources currently available for the study of S. hermonthica are limited.
We constructed a full-length enriched cDNA library of S. hermonthica, sequenced 37,710 clones from the library, and obtained 67,814 expressed sequence tag (EST) sequences. The ESTs were assembled into 17,317 unigenes that included 10,319 contigs and 6,818 singletons. The S. hermonthica unigene dataset was subjected to a comparative analysis with other plant genomes or ESTs. Approximately 80% of the unigenes have homologs in other dicotyledonous plants including Arabidopsis, poplar, and grape. We found that 589 unigenes are conserved in the hemiparasitic Triphysaria species but not in other plant species. These are good candidates for genes specifically involved in plant parasitism. Furthermore, we found 1,445 putative simple sequence repeats (SSRs) in the S. hermonthica unigene dataset. We tested 64 pairs of PCR primers flanking the SSRs to develop genetic markers for the detection of polymorphisms. Most primer sets amplified polymorphicbands from individual plants collected at a single location, indicating high genetic diversity in S. hermonthica. We selected 10 primer pairs to analyze S. hermonthica harvested in the field from different host species and geographic locations. A clustering analysis suggests that genetic distances are not correlated with host specificity.
Our data provide the first extensive set of molecular resources for studying S. hermonthica, and include EST sequences, a comparative analysis with other plant genomes, and useful genetic markers. All the data are stored in a web-based database and freely available. These resources will be useful for genome annotation, gene discovery, functional analysis, molecular breeding, epidemiological studies, and studies of plant evolution.
Expressed Sequence Tag (EST) has been a cost-effective tool in molecular biology and represents an abundant valuable resource for genome annotation, gene expression, and comparative genomics in plants.
In this study, we constructed a cDNA library of Prunus mume flower and fruit, sequenced 10,123 clones of the library, and obtained 8,656 expressed sequence tag (EST) sequences with high quality. The ESTs were assembled into 4,473 unigenes composed of 1,492 contigs and 2,981 singletons and that have been deposited in NCBI (accession IDs: GW868575 - GW873047), among which 1,294 unique ESTs were with known or putative functions. Furthermore, we found 1,233 putative simple sequence repeats (SSRs) in the P. mume unigene dataset. We randomly tested 42 pairs of PCR primers flanking potential SSRs, and 14 pairs were identified as true-to-type SSR loci and could amplify polymorphic bands from 20 individual plants of P. mume. We further used the 14 EST-SSR primer pairs to test the transferability on peach and plum. The result showed that nearly 89% of the primer pairs produced target PCR bands in the two species. A high level of marker polymorphism was observed in the plum species (65%) and low in the peach (46%), and the clustering analysis of the three species indicated that these SSR markers were useful in the evaluation of genetic relationships and diversity between and within the Prunus species.
We have constructed the first cDNA library of P. mume flower and fruit, and our data provide sets of molecular biology resources for P. mume and other Prunus species. These resources will be useful for further study such as genome annotation, new gene discovery, gene functional analysis, molecular breeding, evolution and comparative genomics between Prunus species.
Lentil (Lens culinaris Medik.) is a cool-season grain legume which provides a rich source of protein for human consumption. In terms of genomic resources, lentil is relatively underdeveloped, in comparison to other Fabaceae species, with limited available data. There is hence a significant need to enhance such resources in order to identify novel genes and alleles for molecular breeding to increase crop productivity and quality.
Tissue-specific cDNA samples from six distinct lentil genotypes were sequenced using Roche 454 GS-FLX Titanium technology, generating c. 1.38 × 106 expressed sequence tags (ESTs). De novo assembly generated a total of 15,354 contigs and 68,715 singletons. The complete unigene set was sequence-analysed against genome drafts of the model legume species Medicago truncatula and Arabidopsis thaliana to identify 12,639, and 7,476 unique matches, respectively. When compared to the genome of Glycine max, a total of 20,419 unique hits were observed corresponding to c. 31% of the known gene space. A total of 25,592 lentil unigenes were subsequently annoated from GenBank. Simple sequence repeat (SSR)-containing ESTs were identified from consensus sequences and a total of 2,393 primer pairs were designed. A subset of 192 EST-SSR markers was screened for validation across a panel 12 cultivated lentil genotypes and one wild relative species. A total of 166 primer pairs obtained successful amplification, of which 47.5% detected genetic polymorphism.
A substantial collection of ESTs has been developed from sequence analysis of lentil genotypes using second-generation technology, permitting unigene definition across a broad range of functional categories. As well as providing resources for functional genomics studies, the unigene set has permitted significant enhancement of the number of publicly-available molecular genetic markers as tools for improvement of this species.
Tea is one of the most popular beverages in the world and the tea plant, Camellia sinensis (L.) O. Kuntze, is an important crop in many countries. To increase the amount of genomic information available for C. sinensis, we constructed seven cDNA libraries from various organs and used these to generate expressed sequence tags (ESTs). A total of 17,458 ESTs were generated and assembled into 5,262 unigenes. About 50% of the unigenes were assigned annotations by Gene Ontology. Some were homologous to genes involved in important biological processes, such as nitrogen assimilation, aluminum response, and biosynthesis of caffeine and catechins. Digital northern analysis showed that 67 unigenes were expressed differentially among the seven organs. Simple sequence repeat (SSR) motif searches among the unigenes identified 1,835 unigenes (34.9%) harboring SSR motifs of more than six repeat units. A subset of 100 EST-SSR primer sets was tested for amplification and polymorphism in 16 tea accessions. Seventy-one primer sets successfully amplified EST-SSRs and 70 EST-SSR loci were polymorphic. Furthermore, these 70 EST-SSR markers were transferable to 14 other Camellia species. The ESTs and EST-SSR markers will enhance the study of important traits and the molecular genetics of tea plants and other Camellia species.
Camellia sinensis; tea plants; expressed sequence tags; EST-SSR
The striped bass and its relatives (genus Morone) are important fisheries and aquaculture species native to estuaries and rivers of the Atlantic coast and Gulf of Mexico in North America. To open avenues of gene expression research on reproduction and breeding of striped bass, we generated a collection of expressed sequence tags (ESTs) from a complementary DNA (cDNA) library representative of their ovarian transcriptome.
Sequences of a total of 230,151 ESTs (51,259,448 bp) were acquired by Roche 454 pyrosequencing of cDNA pooled from ovarian tissues obtained at all stages of oocyte growth, at ovulation (eggs), and during preovulatory atresia. Quality filtering of ESTs allowed assembly of 11,208 high-quality contigs ≥ 100 bp, including 2,984 contigs 500 bp or longer (average length 895 bp). Blastx comparisons revealed 5,482 gene orthologues (E-value < 10-3), of which 4,120 (36.7% of total contigs) were annotated with Gene Ontology terms (E-value < 10-6). There were 5,726 remaining unknown unique sequences (51.1% of total contigs). All of the high-quality EST sequences are available in the National Center for Biotechnology Information (NCBI) Short Read Archive (GenBank: SRX007394). Informative contigs were considered to be abundant if they were assembled from groups of ESTs comprising ≥ 0.15% of the total short read sequences (≥ 345 reads/contig). Approximately 52.5% of these abundant contigs were predicted to have predominant ovary expression through digital differential display in silico comparisons to zebrafish (Danio rerio) UniGene orthologues. Over 1,300 Gene Ontology terms from Biological Process classes of Reproduction, Reproductive process, and Developmental process were assigned to this collection of annotated contigs.
This first large reference sequence database available for the ecologically and economically important temperate basses (genus Morone) provides a foundation for gene expression studies in these species. The predicted predominance of ovary gene expression and assignment of directly relevant Gene Ontology classes suggests a powerful utility of this dataset for analysis of ovarian gene expression related to fundamental questions of oogenesis. Additionally, a high definition Agilent 60-mer oligo ovary 'UniClone' microarray with 8 × 15,000 probe format has been designed based on this striped bass transcriptome (eArray Group: Striper Group, Design ID: 029004).
Previous loblolly pine (Pinus taeda L.) genetic linkage maps have been based on a variety of DNA polymorphisms, such as AFLPs, RAPDs, RFLPs, and ESTPs, but only a few SSRs (simple sequence repeats), also known as simple tandem repeats or microsatellites, have been mapped in P. taeda. The objective of this study was to integrate a large set of SSR markers from a variety of sources and published cDNA markers into a composite P. taeda genetic map constructed from two reference mapping pedigrees. A dense genetic map that incorporates SSR loci will benefit complete pine genome sequencing, pine population genetics studies, and pine breeding programs. Careful marker annotation using a variety of references further enhances the utility of the integrated SSR map.
The updated P. taeda genetic map, with an estimated genome coverage of 1,515 cM(Kosambi) across 12 linkage groups, incorporated 170 new SSR markers and 290 previously reported SSR, RFLP, and ESTP markers. The average marker interval was 3.1 cM. Of 233 mapped SSR loci, 84 were from cDNA-derived sequences (EST-SSRs) and 149 were from non-transcribed genomic sequences (genomic-SSRs). Of all 311 mapped cDNA-derived markers, 77% were associated with NCBI Pta UniGene clusters, 67% with RefSeq proteins, and 62% with functional Gene Ontology (GO) terms. Duplicate (i.e., redundant accessory) and paralogous markers were tentatively identified by evaluating marker sequences by their UniGene cluster IDs, clone IDs, and relative map positions. The average gene diversity, He, among polymorphic SSR loci, including those that were not mapped, was 0.43 for 94 EST-SSRs and 0.72 for 83 genomic-SSRs. The genetic map can be viewed and queried at http://www.conifergdb.org/pinemap.
Many polymorphic and genetically mapped SSR markers are now available for use in P. taeda population genetics, studies of adaptive traits, and various germplasm management applications. Annotating mapped genes with UniGene clusters and GO terms allowed assessment of redundant and paralogous EST markers and further improved the quality and utility of the genetic map for P. taeda.