|Home | About | Journals | Submit | Contact Us | Français|
Tea is one of the most popular beverages in the world and the tea plant, Camellia sinensis (L.) O. Kuntze, is an important crop in many countries. To increase the amount of genomic information available for C. sinensis, we constructed seven cDNA libraries from various organs and used these to generate expressed sequence tags (ESTs). A total of 17,458 ESTs were generated and assembled into 5,262 unigenes. About 50% of the unigenes were assigned annotations by Gene Ontology. Some were homologous to genes involved in important biological processes, such as nitrogen assimilation, aluminum response, and biosynthesis of caffeine and catechins. Digital northern analysis showed that 67 unigenes were expressed differentially among the seven organs. Simple sequence repeat (SSR) motif searches among the unigenes identified 1,835 unigenes (34.9%) harboring SSR motifs of more than six repeat units. A subset of 100 EST-SSR primer sets was tested for amplification and polymorphism in 16 tea accessions. Seventy-one primer sets successfully amplified EST-SSRs and 70 EST-SSR loci were polymorphic. Furthermore, these 70 EST-SSR markers were transferable to 14 other Camellia species. The ESTs and EST-SSR markers will enhance the study of important traits and the molecular genetics of tea plants and other Camellia species.
Tea is one of the most popular non-alcoholic beverages and is drunk all around the world. The tea plant, Camellia sinensis (L.) O. Kuntze, is a woody evergreen plant of the genus Camellia in the family Theaceae and is native to southern China. In 2007, the harvested area totaled about 2.8 million hectares, with China, India, Sri Lanka, Kenya and Indonesia being major producers (http://faostat.fao.org/).
Tea has a genome of 4.0 Gb (Tanaka et al. 2006) with a basic chromosome number of n = 15. The genome size is larger than that of human (3.1 Gb; International Human Genome Sequencing Consortium 2004), 33 times that of Arabidopsis thaliana (120 Mb; Arabidopsis Genome Initiative 2000), 10 times that of rice (389 Mb; International Rice Genome Sequencing Project 2005) and one-quarter that of wheat (16 Gb; Arumuganathan and Earle 1991). An efficient first step for the analysis of the large-genome species such as tea is to survey the expressed genes. Expressed sequence tag (EST) analysis in which partial sequences of a large number of cDNA clones are isolated, is a useful approach to reveal expressed sequences in the genome and it enables the identification of many genes responsible for important traits. In addition, ESTs can be used as a resource for functional genomics experiments, such as gene expression analysis using microarrays.
Several EST analyses of tea plants have been reported. Chen et al. (2005) reported 1,684 ESTs generated from tender shoots. Park et al. (2004) reported 588 ESTs isolated by suppression subtractive hybridization. Sharma and Kumar (2005) reported three drought-responsive ESTs obtained by differential display. Shi et al. (2011) reported details of the transcriptome of C. sinensis that were generated by RNA-seq analysis using a high-throughput Illumina GA IIx sequencer. The ESTs reported in the first three studies were derived from green tissues, such as young shoots and mature leaves, but not roots. The RNA-seq data reported by Shi et al. (2011) were generated from seven different organs, including young roots, flower buds, and immature seeds, but the RNAs were mixed before analysis, and thus the origin of each transcript could not be identified.
DNA markers such as microsatellites (Becher 2007, Gupta and Prasad 2009, Hanai et al. 2007, Heesacker et al. 2008, Laurent et al. 2007) and single-nucleotide polymorphisms (SNPs) (Chagne et al. 2008, Choi et al. 2007, Deleu et al. 2009, Lijavetzky et al. 2007, Sato et al. 2009) can be developed by using sequence information from ESTs. Those ESTs that harbor simple sequence repeat (SSR) motifs, referred to as EST-SSRs, show a high level of transferability to closely related species because they originate from transcribed regions, which are often conserved. Therefore EST-SSRs of C. sinensis should be useful for genome analysis in many other Camellia species as well.
Sharma et al. (2009) developed 61 EST-SSRs of C. sinensis and demonstrated the polymorphism of these marker loci. However, to construct linkage maps of C. sinensis, several hundred markers are necessary.
In this paper, we report 17,458 ESTs derived from seven cDNA libraries of young shoots, mature leaves and roots of tea plants. To facilitate gene identification and functional studies, we performed Gene Ontology (Ashburner et al. 2000) annotation of tea unigenes. Furthermore, we developed EST-SSR markers developed using the EST data, and show them to be highly polymorphic and transferable to many Camellia species.
Organs for RNA isolations were collected from tea plants growing at Makurazaki Tea Research Station, NARO Institute of Vegetable and Tea Science, Kagoshima, Japan. Young roots (RT) came from 15-d-old seedlings derived from natural crosses of C. sinensis cv. ‘Sayamakaori’. Tap roots (TR) and lateral roots (LR) were harvested from 30-d-old seedlings. Young leaves (YL), terminal buds (TB) and young stems (YS) of growing shoots with two leaves and a bud were harvested from field-grown ‘Sayamakaori’ in April of the first flush (first harvest) season. Mature leaves (ML) that developed the previous year were harvested from field-grown ‘Sayamakaori’ during the first flush season. The 16 accessions of C. sinensis and the 14 other Camellia species used for EST-SSR analysis are listed in Tables 1, ,2,2, respectively.
Total RNAs from above-ground tissues (YL, ML, YS and TB) were extracted using TRIzol reagent (Life Technologies, USA). Total RNAs from young root tissues (RT, TR and LR) were extracted using an RNeasy Plant mini kit (Qiagen, Germany).
For cDNA library construction from the RT RNA sample, total RNA was dephosphorylated and decapped with a GeneRacer kit (Life Technologies). The decapped RNA was ligated with GeneRacer RNA Oligo and reverse-transcribed with SuperScript II reverse transcriptase (Life Technologies). After first-strand cDNA synthesis, the RNA was degraded with RNase H. cDNA was amplified by PCR with 5′ (5′-CGACTGGAGCACGAGGACACTGA-3′) and 3′ (5′-GCTGTCAACGATACGCTACGTAACG-3′) primers for 2 min at 94°C; followed by 20 cycles at 94°C for 20 s, 56°C for 30 s and 72°C for 10 min; followed by 10-min extension at 72°C. To enrich the content of long cDNAs, the PCR products were separated by agarose gel electrophoresis and products longer than 1,000 bp were isolated and cloned into the pGEM-T Easy vector (Life Technologies) and then transformed into Escherichia coli strain DH5α cells.
To construct cDNA libraries from the other organs, double-stranded cDNA was synthesized with a SMART cDNA Library Construction Kit (Clontech, USA), digested with restriction enzyme SfiI and size-fractionated in a CHROMA-SPIN 400 column (Clontech). The cDNA fragments were directionally ligated into an SfiI-digested pTriplEx2 vector. The ligation mixture was electroporated into E. coli DH5α competent cells.
Both ends of cDNAs from the RT library were sequenced using T7 (5′-TAATACGACTCACTATAGGG-3′) or SP6 (5′-ATTTAGGTGACACTATAGAA-3′) primers and the 5′ ends of cDNAs from the YL, TB, YS, ML, LR and TR libraries were sequenced using the 5′ λTriplEx2 sequencing primer (5′-TCCGAGATCTGGACGAGC-3′). Cycle sequencing reactions were performed using a BigDye Terminator Cycle Sequencing Kit (Life Technologies) and capillary electrophoresis was done using an ABI 3730xl or ABI 3130xl sequencer (Life Technologies).
Base-calling of sequence reads was performed using the KB basecaller program (Life Technologies). Ambiguous sequences were removed using the Sequencing Analysis program (Life Technologies) and vector sequences were trimmed using the cross_match program (http://www.phrap.org/). Sequences of less than 100 bp were then eliminated from the analysis. A total of 17,458 ESTs were generated and submitted to the DDBJ database (accession numbers AB361047 to AB361052, AB461364 to AB461372, AB485966 to AB505873 and FS943336 to FS960759). The 17,458 ESTs were assembled using the phrap program (http://www.phrap.org/). If the 5′ read and 3′ read derived from the same clone in the RT library belonged to different contigs, or both reads were singletons, or one read was a member of a contig and the other was a singleton, then the contigs or singletons were treated as a single scaffold.
The nucleotide sequences of the unigenes were searched using the BLASTX program (Gish and States 1993) against the non-redundant protein sequences in GenBank (http://www.ncbi.nlm.nih.gov/genbank/), the UniProt database (http://www.uniprot.org/), the Arabidopsis proteome database (TAIR8; http://www.arabidopsis.org/) and amino acid sequences deduced from the rice genome sequence (IRGSP/RAP build 5; http://rapdb.dna.affrc.go.jp/download/). Functional annotation of unigenes was performed using the Blast2GO program (Conesa et al. 2005). GO Slim annotations of unigenes were also generated with Blast2GO using the plant GO Slim mapping program provided by TAIR (http://www.arabidopsis.org/).
We selected 144 unigenes that were generated by assembly of 10 or more independent ESTs and used them for expression profiling based on the number of ESTs within each library. Differential expression levels were tested with the Audic and Claverie statistical test in IDEG6 software (Romualdi et al. 2003). To eliminate false positives, we used Bonferroni’s correction for the adjustment of multiple comparisons. Sixty-seven unigenes that were expressed differently among the seven libraries were clustered by using Hierarchical Clustering Explorer v. 3.0 software (http://www.cs.umd.edu/hcil/hce/).
Using the tea unigene set as a target, we identified micro-satellites that the total number of repeats were six or more, with each repeat unit being at least three times repeats of di-nucleotides or trinucleotides by using the Read2Marker program (Fukuoka et al. 2005). We also designed PCR primers for amplification of EST-SSRs using Read2Marker.
PCR was performed in a 10-μl reaction mix including 20 ng of genomic DNA, 10× PCR Gold buffer (Life Technologies), 0.8 μl of 8 mM dNTP, 0.1 U of AmpliTaq Gold polymerase, 0.8 μl of 25 mM MgCl2 and 1 μM forward and reverse primers. The PCR reactions were carried out in a GeneAmp 9600 thermal cycler (Life Technologies) according to the following “touchdown PCR” cycling program: 95°C for 5 min; 95°C for 1 min, 62°C for 30 s and 72°C for 1 min; 13 cycles at decreasing annealing temperatures in decrements of 0.5°C per cycle; 25 cycles of 95°C for 1 min, 62°C for 30 s and 72°C for 1 min and a final 72°C for 10 min. PCR products were directly labeled with fluorescence-labeled R110-ddUTP by the single-tube method (Inazuka et al. 1996). The labeled PCR products were analyzed with an ABI Prism 3130xl Genetic Analyzer, and the resulting allele data were analyzed with GeneMapper v. 3.7 software (Life Technologies). Polymorphism information content and heterozygosity information were calculated in PowerMarker software (Liu and Muse 2005).
Seven cDNA libraries were constructed from tea plant organs (Table 3). From the young roots (RT) cDNA library, 3,072 clones were randomly selected and single-pass-sequenced from both ends. From each of the other six libraries, 2,880 clones were sequenced from their 5′ ends. After removal of low-quality sequences and vector trimming, the resulting data set contained 17,458 sequences with an average length of 481 bp (Table 4). The GC content of the 17,458 ESTs (8,391,523 bases) was 44.0%. These 17,458 high-quality ESTs were assembled into contigs in phrap, which resulted in 2,227 contigs and 3,477 singletons. Some 5′ and 3′ reads from the same clones from RT library were not assembled into the same contigs; in such cases, the contigs and singletons that contained such 5′ and 3′ pair reads were treated as scaffolds. As a result, 442 scaffolds, 1,851 contigs and 2,969 singletons were generated. Together, the 5,262 sequences were used for further analysis as a 5.3-k tea unigene set. Among these sequences, 3,372 unigenes (64.1%) were longer than 500 bp. The assembly of ESTs contained in each cDNA library generated 846 to 1,587 unique transcripts per library (Table 3).
On 5 August 2011, the NCBI GenBank database contained 14,246 ESTs and 34.5 million RNA-seqs from tea. Similarity searches of the 5.3-k unigene set were performed against the 14,246 ESTs and the 76,159 assembled sequences from the RNA-seq analysis by Shi et al. (2011), which had been deposited in the Transcriptome Shotgun Assembly Sequence Database (TSA) at NCBI with accession numbers HP701085–HP777243. The searches were performed by using BLASTN with a cutoff value of 1e-10. Within the 5.3-k unigene set, 3,340 unigenes (63.5%) had no matches among the 14,246 tea plant ESTs in GenBank and 1,118 unigenes (21.2%) had no matches among the 76,159 assembled sequences of RNA-seqs; 732 unigenes (13.9%) had no significant matches within either data set.
Before annotation of the tea unigenes, sequence homology searches were performed. By a BLASTX search against the GenBank non-redundant database (cutoff of ≤1e–6), 3,055 in the 5.3-k unigene set (58.1%) returned significant hits. Out of the 3,055 unigenes, 762 (24.9%) were annotated as hypothetical, predicted, putative, unknown, or unnamed proteins. A BLASTX search was also performed against the UniProt database and the complete protein sets of Arabidopsis thaliana and Oryza sativa. We found that 2,484 (47.2%) of the 5.3-k unigene set encoded peptides that were significantly similar to those in the UniProt database, 3,417 (64.9%) were similar to Arabidopsis proteins and 3,673 (64.4%) were similar to rice proteins, all with a cutoff value of 1e–6.
Subsequently, Gene Ontology annotation was performed with Blast2GO. A total of 2,639 unigenes were annotated with 11,260 annotations, distributed among the main Gene Ontology categories: Biological Process (4,582), Molecular Function (3,509) and Cellular Component (3,169) (Fig. 1 and Supplemental Table 1). There were 1,191 unigenes annotated for all three Gene Ontology categories.
To evaluate the usefulness of the 5.3-k unigene set as a gene resource for tea, we searched for unigenes involved in important agricultural and biological processes of tea, such as nitrogen assimilation and amino acid metabolism, catechin and caffeine biosynthesis, photoresponse and aluminum response (Table 5). For nitrogen assimilation, we found unigenes involved in primary assimilation of inorganic nitrogen, such as nitrate transporter, ammonium transporter and glutamate synthetase, and in amino acid metabolism. For catechin biosynthesis, we found unigenes encoding 10 enzymes were found, including phenylalanine ammonialyase and leucoanthocyanidin reductase. In addition, we identified unigenes encoding caffeine synthetase and several involved in aluminum response and photoresponse.
To reveal patterns of gene expression and correlations of expression patterns between organs, we analyzed the EST data using R statistics of the IDEG6 web tool to identify unigenes that were differentially expressed. From the 5.3-k unigene set, 144 unigenes that consisted of more than 10 EST sequences were selected for analysis; of these, 67 showed significant differences in expression profile among the libraries (Supplemental Table 2). Cluster analysis using Hierarchical Clustering Explorer 3.0 divided the 67 unigenes into three major clusters (Fig. 2 and Supplemental Table 2). Cluster I was divided into four subclusters, Ia–Id, which contained unigenes highly expressed in the YL, YS, ML and TB libraries, respectively. Clusters II and III showed high expression in root: specifically, cluster II in the LR and TR libraries and cluster III in the RT library.
Clusters Ia and Ic contained a number of photosynthesis-related genes, including chlorophyll-a/b-binding protein and photosystem I reaction center subunit, respectively (Supplemental Table 2). Cluster II contained a unigene that encodes dihydroflavonol 4-reductase; this enzyme synthesizes leucoanthocyanidin, which is the direct precursor to (+)-catechin and (+)-gallocatechin. In cluster III, 10 out of 28 unigenes encoded stress-response proteins, including manganese superoxide dismutase and glutathione S-transferase.
An SSR motif search within the 5.3-k unigene set identified 1,835 unigenes (34.9%) that harbored SSR motifs of six or more repeat units. Among these 1,835 SSR-containing unigenes, the most frequent repeat motif was AT/GC repeat, which was found in 24.4% of all unigenes, followed by AC/GT repeat (6.5%) (Table 6).
We selected the 100 EST-SSRs with the highest numbers of repeat units and designed primer sets to amplify them (Supplemental Table 3). Out of the 100, three (MSE0049, MSE0066 and MSE0089) had high homology to EST-SSRs reported by Sharma et al. (2009), but the other 97 EST-SSRs were novel.
The 100 EST-SSRs were tested for their ability to amplify fragments within 16 tea (C. sinensis) accessions (Table 1). Of these, 71 produced well-amplified fragments, and 70 revealed polymorphism among the 16 accessions (Table 7). For 61 markers, only one or two fragments were amplified in each accession and these were considered single-locus markers. For the other 10 markers, more than three amplified fragments were observed in some accessions and these were considered multi-locus markers. For the single-locus markers, the number of alleles per locus ranged from 1 to 15, with an average of 8.2 alleles. Observed hetero-zygosities (Ho) ranged from 0 to 1.0, with an average of 0.64. Expected heterozygosities (He) ranged from 0 to 0.91, with an average of 0.72. Polymorphism information content ranged from 0 to 0.90, with an average of 0.69.
Using 14 Camellia species (Table 2), we investigated transferability of the EST-SSRs developed in this study. Seventy of the 71 markers usable in C. sinensis were amplified in more than one species (Table 7). The average proportion of C. sinensis markers amplified in each of the 14 species was 87.1%. In C. irrawadiensis, a member of the same subgenus (Thea) as C. sinensis, 68 markers (95.8%) showed amplification (Table 7 and Supplemental Table 4).
Before this study, the NCBI GenBank database held 14,246 ESTs and 34.5 million RNA-seqs from C. sinensis. In this study, we report the identification of 17,458 ESTs from seven cDNA libraries. Within the 5.3-k unigene set developed here, 732 unigenes had no significant matches by BLASTN homology searches against the tea ESTs and assembled sequences from RNA-seqs previously deposited in GenBank, indicating that these unigenes are novel mRNA sequences from tea. The lengths of 64.1% of the sequences in the 5.3-k unigene set were more than 500 bp, whereas in the unigenes generated by RNA-seq analysis, only 17.9% were longer than 500 bp. In general, EST analysis using Sanger sequencing generates longer sequence reads than RNA-seqs using a high-throughput Illumina GA IIx sequencer, so the difference in unigene length distribution is attributed to the difference in sequencing technique.
The data presented here are expected to become a useful gene resource for research aimed at understanding physiological processes important for tea cultivation and quality, such as nitrogen assimilation and amino acid metabolism. In Japan, large amounts of nitrogen fertilizers are used in tea plantations, causing pollution of groundwater, rivers and lakes. To improve this situation, it is important to develop tea cultivars with high nitrogen use efficiency (Tanaka and Taniguchi 2007). Therefore, we searched for unigenes related to nitrogen assimilation within the unigene set and found several that were homologous to genes related to nitrogen assimilation, such as glutamine synthetase, glutamate de-hydrogenase, ammonium transporter and nitrate transporter. In addition, the unigene set contains theanine synthase and several unigenes related to the metabolism of 2-oxoglutarate, a key component of the interaction of nitrogen and carbon metabolism.
In addition to nitrogen compounds such as amino acids, secondary metabolites such as catechins and caffeine are important for tea quality. Among our ESTs, we found several unigenes related to synthesis of these secondary metabolites. The metabolisms of nitrogen compounds and secondary metabolites are regulated by environmental status. For example, in young tea leaves, catechins increase under high light intensity (Saijo 1980). In contrast, shading of young tea leaves leads to an increase in total nitrogen content, as well as enhancement of theanine (Anan and Nakagawa 1974, Karasuyama and Matsumoto 1988). In the future, it will be important to decipher the mechanism of photoresponsive regulation of genes related to the metabolism of nitrogen compounds and secondary metabolites to enable improvement of these traits. Two unigenes related to photoresponse were found in our ESTs, providing us with tools to analyze the associated regulatory mechanisms.
Tea is well known as an aluminum-accumulating plant that grows well in very acidic soils containing high levels of Al3+; this is of interest because aluminum toxicity limits the growth of many other species in acidic soils (Morita et al. 2004, 2008) and the aluminum in the xylem sap of tea is complexed with citrate (Morita et al. 2004). Three unigenes potentially related to aluminum response were found in this study: one citrate synthetase and two aluminum-response proteins. Further analyses, such as expression analysis of the response of tea to aluminum, might reveal whether these genes have roles in aluminum resistance or response.
Using the EST data derived from seven different organs of the tea plant, digital northern analysis was performed to identify unigenes with different expression levels among different organs; 67 such unigenes were identified out of a sample of 144. Cluster analysis showed that the groups of unigenes highly expressed in each organ were related to different physiological functions. For example, several photosynthesis-related genes were highly expressed in the YL and ML libraries. Cluster III, which showed high expression in the RT library, was the largest cluster (25 unigenes), indicating that the physiological and developmental status of young root is considerably different from that of other organs. Interestingly, dihydroflavonol 4-reductase (DFR) was highly expressed in tap roots and lateral roots. Although catechins are not contained in tea roots (Forrest and Bendall 1969), leucoanthocyanidin, which is the product of DFR and the precursor of (+)-catechin, is contained in roots. Thus, we assume this DFR in roots to be involved not in catechin biosynthesis, but in other metabolic processes such as lignin or anthocyanin biosynthesis. One more unigene encoding DFR was found in the 5.3-k unigene set. This unigene was expressed in young stem, and the sequence similarity between the two DFRs was 52%. We think that the DFR from young stem is involved in catechin biosynthesis.
Ellis and Burke (2007) surveyed EST data from 33 species and showed that the proportion of unigenes containing SSRs was 2.5% to 21.1% (9.0% ± 0.1%, mean ± SEM). Based on this survey, the percentage of SSR-containing unigenes in this study (34.9%) is relatively high compared to that in other plant species.
The proportion of multi-locus markers in this study was higher than that reported by Sharma et al. (2009). We used a capillary sequencer for fragment analysis, whereas Sharma et al. (2009) used autoradiography of PAGE gels, which has lower resolution. Thus, the difference in the proportion of multi-locus markers might have been caused by the difference in the analysis method. Because of the paleopolyploidy of C. sinensis (Shi et al. 2010), it is not surprising that many multi-locus markers are contained in the set of EST-SSRs reported here.
The 16 accessions used in this study include major tea cultivars in Japan, parental cultivars and several foreign germplasms. These materials are representative of the genetic diversity of breeding materials in Japan. The EST-SSRs developed in this study were highly polymorphic among the 16 accessions. They should be very useful for many genetic studies in tea, such as construction of linkage maps, analysis of genetic diversity and cultivar identification. For example, using 377 EST-SSRs and other co-dominant markers, we recently constructed a reference linkage map of tea (Taniguchi et al. 2012).
Most of the EST-SSR markers developed here were applicable to other Camellia species. Species of Camellia other than C. sinensis contain useful traits that have been utilized in tea breeding; for instance, a parental line containing a high level of anthocyanin (Ogino et al. 2005) and a caffeineless tea plant (Ogino et al. 2009) were developed from interspecific crosses. EST-SSR markers will enable genetic analysis of important agronomic traits of various Camellia species, thus expanding the usefulness of these species in tea breeding.
In conclusion, the tea ESTs obtained in this study are valuable resources for analysis of gene function and for development of SSR markers. The 5.3-k tea unigene set contains novel transcripts from tea, and 67 out of 144 unigenes tested showed specific expression patterns among a set of seven organs. The SSR markers developed in this study are highly polymorphic in C. sinensis and many other Camellia species. Further studies using the tea EST dataset are expected to accelerate functional genomics and genetic breeding research in tea.
We would like to thank N. Rokutan and M. Iwata for their technical assistance. This work was supported by NARO Research Project No. 211, ‘Establishment of Integrated Basis for Development and Application of Advanced Tools for DNA Marker-Assisted Selection in Horticultural Crops.’