We have assembled 36,493 expressed gene sequences from an actively growing cultivated Taxus
tree using Illumina paired-end sequencing (mRNA-seq) technology and de novo
short read assembly. Quality control comparisons to the Sanger-derived transcript sequences from Taxus
, as well as multiple lines of evidence such as protein coding sequence (CDS) prediction (Fig. S4
) and DGE tag mapping showed that the transcript assemblies are robust and that thousands of coding sequences and their respective 5′ and/or 3′ untranslated regions were successfully assembled (Table S2
). For example, 23,145 and 1,860 CDSs were predicted by Blastx and ESTScan, respectively. Among 23,145 predicted CDSs, 14,793 (63.9%) and 14,442 (62.4%) can be aligned with clone sequences and Unigene sequences of P. glauca
, respectively (data available upon request). Comparison of assembled gene models to gene catalogs of other plant species by Blast analysis and functional annotation (e.g., GO, Swissprot and KEGG) indicate that we have sampled an extensive and diverse expressed gene catalog representing a large proportion of the genes expressed in Taxus
. Comparison to the few publicly available Taxus mairei
DNA sequence suggests that we have sampled the most comprehensive set of genes, which is also more complete in length and diversity from a single Taxus
species than has been available for Taxus cuspidata
of the genus 
. Additionally, using DGE to quantify mRNA-seq data we have produced an informative database of transcript abundance across three Taxus
tissues, which, due to the depth of sequencing, results in much higher sensitivity and wider dynamic range than Sanger or 454-derived EST counts usually associated with this type of analysis.
A concern associated with de novo
assembly of transcript sequences is the contiguity of assembled sequences. This concern naturally increases as the read length decreases, and may be one of the reasons why most transcriptome de novo
assembly approaches have utilized technologies with longer read lengths to date. We provide evidence that jointly support the contiguity of transcript sequences assembled in our study using Illumina short-read data. First, a high proportion of the Unigenes exhibited high confidence Blastx similarity to protein sequences from annotated gene catalogs of plant species such as Arabidopsis
), although one may argue that good hits with Blastx is not any conclusive evidence for correct assemblies. Second, a large proportion of the Unigenes contained long predicted CDSs (Fig S4
, Table S2
). For example, 4,131 out of 25,005 CDSs (16.5%) predicted by Blastx and ESTScan possess no less than 1,000 bp. The assembly quality and annotation of these sequences could be improved in future by even deeper sequencing and the addition of data from new tissue types. De novo
assembled transcriptome datasets lack the ability to discriminate and classify the lower confidence annotations, a challenge that is beyond the scope of this study.
The results of clustering analysis of differential gene expression pattern, GO functional enrichment analysis, and KEGG pathway enrichment analysis lend support to the biological significance of DGE profiles derived from short-read sequencing technology, which will assist in the discovery and annotation of novel Taxus genes playing key roles in growth and physiology, and particularly in taxane production. The Taxus genomic resource produced from this study, as well as future comparative analysis with other gymnosperm species such as Pinus and Picea, will be valuable for studying the unique biology of this evolutionarily ancient lineage.
It is found that root samples from different Taxus
species have similar profiles and nearly identical chemical distributions 
, which suggest that Taxus
roots have similar metabolic framework. The most significant chemical characteristic of this tissue is its abundant distribution of 7-xylosyltaxanes (), demonstrating that the biosynthesis of 7-xylosyltaxanes is a major metabolic pathway in this plant part. Besides 7-xylosyltaxanes, many essential and valuable taxanes, e.g., 10-DAB, baccatin III, 10-DAT, 10-deacetylcephalomannine (10-DAC), cephalomannine and paclitaxel, are also present in relatively high levels, demonstrating the high competence of taxane synthesis in the Taxus
root. Studies at the molecular and metabolic levels are thus complementary and cross-validated. It is noted that 33 taxoids 
with the 10-DAB skeleton exist in the Taxus
root, which can be used as the semi-synthetic precursors for paclitaxel. These taxoids are also recognized as intermediates or products involved in side routes of paclitaxel formation, i.e. downstream metabolites via the important intermediate 10-DAB during taxane biosynthesis 
. Furthermore, these pharmaceutically important taxanes comprise over 55% of the peak areas in the entire chromatogram (0–19 min). This is quite different from the needles that usually have major divergent pathways apart from paclitaxel biosynthesis such as the formation of abundant taxine B and taxinine M 
. The presence and connectivity of all spotted taxanes in the Taxus
root can be used to figure out the downstream biosynthetic framework of paclitaxel and its analogues such as 10-DAT, cephalomannine, and Taxol C, as well as 7-xylosyltaxanes (). In addition, these detected taxanes and their connectivity facilitate further studies on cell culture and metabolomic analysis of the Taxus
Proposed metabolic framework for taxane biosynthesis in the Taxus root.
From an application perspective, the consistent chemical profile suggests that the Taxus root resource can be processed by a universal developmental proposal without considering species origin. Moreover, contrary to the needles, the Taxus roots have relatively simple chemical constituents and can supply large quantities of various valuable taxanes such as paclitaxel, cephalomannine and 7-xylosyltaxanes. The concentration of paclitaxel and cephalomannine in the roots is two to eight times higher than in the corresponding needles ( and ). Hence the development prospect of the Taxus root is very favorable.
In conclusion, this sequence collection represents the first major genomic resource for Taxus mairei, and the large number of genes in the different Taxus tissues characterized by DGE technology should contribute to further research in this and other gymnosperm species. The de novo transcriptome and DGE analyses also provided us with a genome-wide view of the transcriptional and post-transcriptional mechanisms generating an increased number of transcript isoforms in the respective Taxus tissue. The data consistency from multiple approaches including transcriptome and metabolome assures that the mRNA-seq and DGE data produced in this study are reliable. Our results illustrate the utility of Illumina second generation sequencing as a basis for defining the metabolic pathway and tissue specific functional genomics in non-model plant species.