Sequence Analysis and Assembly
To obtain a global overview of the D. latiflorus Munro transcriptome and gene activity at nucleotide resolution, a mixed cDNA sample representing diverse developmental stages and tissues of D. latiflorus Munro was prepared and sequenced using the Illumina Genome Analyzer. Each sequenced sample yielded 2×72-bp independent reads from either end of a cDNA fragment. After stringent quality assessment and data filtering, 15,138,726 reads (~2.2 G) with 94.67% Q20 bases (those with a base quality greater than 20) were selected as high quality reads for further analysis. An overview of the sequencing is presented in . The high quality reads produced in this study have been deposited in the NCBI SRA database (accession number: SRA055083).
Summary of Illumina transcriptome sequencing for D. latiflorus Munro.
Using the Trinity de novo assembly program, next-generation short-read sequences were assembled into 103,354 scaffolds, with N50 length of 1,132 bp and with mean length of 736 bp. The distribution of scaffolds is shown in . In total, there were 19,236 scaffolds coding for transcripts longer than 1 kb and 5,897 scaffolds coding for transcripts longer than 2 kb. The scaffolds were subjected to cluster and assembly analyses. A total of 68,229 unigenes were obtained, among which 6,375 genes (9.34%) were greater than 1kb. The length distributions of unigenes are shown in , revealing that more than 20,000 unigenes (~29.3%) are greater than 500 bp. An overview of the assembled scaffolds and unigenes is presented in . These results demonstrated the effectiveness of Illumina pyrosequencing in rapidly capturing a large portion of the transcriptome. As expected for a randomly fragmented transcriptome, there was a positive relationship between the length of a given unigene and the number of reads assembled into it (). To facilitate the access and utilization of the bamboo transcriptome sequencing data, we have uploaded all the data including the unigene sequences, annotations and relatively highly expressed genes to the ftp site (ftp.biomarker.com.cn) and the category is/zhuory/Munro_Transcriptome/Moso_Bamboo_cDNA (Please contact R. Zhuo for ftp access).
Overview of the Dendrocalamus latiflorus Munro transcriptome sequencing and assembly.
Summary of Illumina transcriptome assembly for D. latiflorus Munro.
We utilized several complementary approaches to annotate the assembled sequences. The unigenes were annotated by aligning with the deposited ones in diverse protein databases including National Center for Biotechnology Information (NCBI) non-redundant protein (Nr) database, NCBI non-redundant nucleotide sequence (Nt) database, UniProt/Swiss-Prot, Kyoto Encyclopedia of Genes and Genomes (KEGG), Cluster of Orthologous Groups of proteins (COG) and UniProt/TrEMBL. The best one was selected from the matches with an E-value of less than 10−5
. The overall functional annotation was depicted in . First, a sequence similarity search was conducted against the NCBI Nr and Nt database, and Swiss-Prot protein database using the Basic Local Alignment Search Tool (BLAST) algorithm specifying E-values of less than 10−5
. The analysis indicated that of the 68,229 unigenes, 46,087 (67.5%) had significant matches in the Nr database and 52,660 (77%) had significant matches in the Nt database while 28,165 (41.2%) unigenes had similarity to proteins in the Swiss-Prot database. Altogether, 54,884 (78.9%) unigenes were successfully annotated in the Nr, Nt, Swiss-Prot, KEGG, COG and TrEMBL databases listed in Table S1
. Gui et al. (2010) produced 1.2 Mb of tetraploid Moso bamboo sequences from 13 bacterial artificial chromosome (BAC) clones, with 46% of 112 non-TE-related protein-coding genes predicted to be protein-encoding genes and displaying high similarity to genes of other plants deposited in the NCBI Genebank 
. The significance of the BLAST comparison depends in part on the length of the query sequence. Short reads obtained from sequencing would rarely be matched to known genes 
. The low percentage (21.1%) of unmapped unigenes that can be assigned a putative function might be mainly due to the short sequence reads generated by the sequencing technology and the relatively short sequences of the resulting unigenes, most of which probably lack the conserved functional domains 
. Another possible reason is that some of these unigenes might be non-coding RNAs. The insufficient sequences of bamboo in public databases also influence the annotation efficiency 
. Gui et al. (2010) showed that although both rice and sorghum exhibit high genomic synteny with bamboo, the comparison of the two bamboo-rice-sorghum syntenic regions demonstrated that some Moso bamboo genes seemed to have been lost or moved to other genomic regions after the divergence of bamboo from other members 
. Meanwhile, some Moso bamboo genes have no hits to the syntenic region or even other regions of rice and sorghum, suggesting they might be bamboo-specific genes 
. Therefore, generation of large collection of bamboo unigenes and ESTs is of great necessity for the bamboo research. Above all, these results demonstrated the reliability of Illumina paired-end sequencing and de novo
assembly. According to the reads per kilo bases per million reads (RPKM) values, 5,000 genes were chosen to be high-expressed unigenes, among which 4,918 unigenes were annotated by NCBI Nr protein database (Table S2
Functional annotation of the D. latiflorus Munro transcriptome.
Based on Nr annotation, Gene Ontology (GO) 
analysis was carried out, which provides a dynamic, controlled vocabulary and hierarchical relationships for the representation of information on molecular function, cellular component and biological process, allowing a coherent annotation of gene products. There were 46,087 unigenes annotated in Nr, among which 11,921 unigenes were assigned with one or more GO terms, with 34.8% for biological processes, 50.6% for molecular functions, and 14.6% for cellular components (). For biological processes, genes involved in physiological processes (GO: 0008152) and cellular processes (GO: 0009987) were highly represented. For molecular functions, binding activity (GO: 0005488) were the most represented GO term, followed by enzyme activity (GO: 0003824). Regarding cellular components, the most represented category was cells (GO: 0005623) ().
Functional annotation of assembled sequences based on gene ontology (GO) categorization.
In addition, all unigenes were subjected to a search against the COG database for functional prediction and classification. Overall, 10,147 of the 68,229 sequences showing a hit with the Nr database could be assigned to COG classifications (). COG-annotated putative proteins were functionally classified into at least 25 protein families involved in cellular structure, biochemistry metabolism, molecular processing, signal transduction and so on (). The cluster for general function prediction (2,673; 26.34%) represented the largest group, followed by replication, recombination and repair (1,359; 13.39%), transcription (1,319; 13%), signal transduction mechanisms (1,096, 10.8%), translation, ribosomal structure and biogenesis (1,004; 9.89%), posttranslational modification, protein turnover and chaperones (964; 9.5%), carbohydrate transport and metabolism (831, 8.19%), amino acid transport and metabolism (673; 6.63%), energy production and conversion (538; 5.3%), and whereas only a few unigenes were assigned to nuclear structure and extracellular structure (18 and 5 unigenes, respectively). In addition, 368 unigenes were assigned to cell wall/membrane/envelope biogenesis and 248 unigenes were assigned to intracellular trafficking, secretion and vesicular transport ().
Clusters of orthologous groups (COG) classification.
To further demonstrate the usefulness of Ma bamboo unigenes generated in the present study, we identified biochemical pathways represented by the unigene collection. Annotations of Ma bamboo unigenes were fed into the KEGG Pathway Tools, which is an alternative approach to categorize genes functions with the emphasis on biochemical pathways. This process predicted a total of 292 pathways represented by a total of 45,649 unigenes. Summary of the sequences involved in these pathways was included in Table S3
. These predicted pathways represented the majority of plant biochemical pathways for compound biosynthesis, degradation, utilization, and assimilation, and pathways involved in the processes of detoxification and generation of precursor metabolites and energy. Enzymes catalyzing almost all steps in several major plant metabolic pathways including the Calvin cycle, glycolysis, gluconeogenesis, the pentose phosphate pathway, and several important secondary metabolite biosynthesis pathways including carotenoid biosynthesis and flavonoid and anthocyanin biosynthesis, could be represented by unigenes derived from the Ma bamboo dataset. Moreover, genes involved in several signaling pathway including the p53, mammalian target of rapamycin (mTOR), vascular endothelial growth factor (VEGF) and mitogen-activated protein kinase (MAPK) signaling pathway, were also found in the unigene collection.
Comparative Analysis with Moso Bamboo and Other Grasses
To take a snapshot on the relationship between Moso bamboo and Ma bamboo in terms of orthology, or to identify proteins or pathways that might be unique to one of the two species, we did a new BLAST operation between the two datasets. A search for nucleotide sequence similarity with a relatively high stringency (E-value <1e-10 in BLASTn) showed that 1.6% or 870 of 54,884 unigenes had a significant match to Moso bamboo 10,608 putative FL-cDNAs. These unigenes were subjected to a search against the COG database for functional prediction and classification and there were 268 unigenes which could be assigned to COG classifications. The largest group was the cluster for general function prediction (59; 22%), followed by translation, ribosomal structure and biogenesis (55; 20.5%), and posttranslational modification, protein turnover and chaperones (40; 15%). The rest of these highly matched unigenes were predicted to play roles in energy production and conversion, and the transport and metabolism of carbohydrates, amino acids, nucleotides and lipids. Also, we subjected these highly matched unigenes to other databases including the NCBI Nr, NCBI Nt, SwissProt, and GO seqdb databases. The detailed results are listed in Table S4
. It is noteworthy that a large number of unigenes had no hits to the Moso FL-cDNAs. This may be explained by the fact that the Moso cDNA database is incomplete in the sense that it was constructed only from shoots, leaves, and roots of germinating seeds. Nonetheless many of the unmatched Ma bamboo unigenes may be indeed unique. The functions of these unigenes remain to be further characterized.
Given the wealth of rice and millet genome data, we examined at first the proportion of Ma bamboo unigenes that matched rice and millet databases at the nucleotide sequence level by a sequence similarity search. The search for nucleotide sequence similarity with a relatively high stringency (E-value <1e-10 in BLASTn) showed that 42,154 (61.78%) Ma bamboo unigenes had similarity hits to rice transcripts and 40,883 (59.92%) unigenes had similarity hits to millet transcripts. Among these aligned sequences, 37,379 unigenes had similarities with both rice and millet, while 4,775 unigenes only had similarity hits to rice and 3,504 unigenes only had similarity hits to millet (). A total of 22,571 (30.08%) bamboo unigenes did not match any of rice and millet sequences which were presumably Ma bamboo unique (). The detailed results are listed in Table S5
Ma bamboo unigene similarity comparison with rice and millet and functional classification by GO analysis.
Based on the above similarity search, we then conducted GO analysis to compare the functional classification between two groups of Ma bamboo unigenes, one including shared homologs with rice and millet and the other presumably being unique to Ma bamboo (). The detailed results were listed in Table S5
. In all, among 37,379 shared homologs, there were 10,740 unigenes which were assigned with one or more GO terms. The GO analysis showed that for biological processes, genes involved in physiological processes (GO: 0008152) and cellular processes (GO: 0009987) were highly represented. For molecular functions, binding activity (GO: 0005488) were the most represented GO term, followed by enzyme activity (GO: 0003824). Regarding cellular components, the most represented category was cells (GO: 0005623) (). Only 423 of 22,571 unigenes predicted to be unique to Ma bamboo were annotated by GO analysis displaying a similar trend to the annotated shared homologs. The low annotation percent is probably due to the relatively small fraction of high-quality sequence-finished bamboo genes deposited and annotated in public databases, especially compared with rice and Arabidopsis. Bamboo is famous for its fast growth rate and high woodiness. Cui et al (2012) identified 213 spots differentially expressed in culm development by MALDI-TOF/TOF MS which were involved in many physiological and metabolic processes including carbohydrate metabolism, cell division, cell expansion, protein synthesis, amino acid metabolism and redox homeostasis 
. Therefore we postulate that differences in the expression profile and the function allocation of the Ma bamboo unigenes with sequence similarity hits to rice and millet concurrently contribute to the divergence of bamboo from other grasses. Also, many of the predicted genes that are unique to Ma bamboo, and which encode proteins that are predominantly associated with binding activities and catalytic functions or that are involved in physiological processes, are likely to have defined bamboo as the plant species it is today. Although we cannot yet be certain about which genes precisely define bamboo as a species, we are convinced that the large bulk of predicted unigenes unique to Ma bamboo represent a valuable resource to explore Ma bamboo gene diversity and to allow for comparative genomic studies among grasses.
Functional Genes Involved in Lignin Biosynthesis
As is well known, the lignin content of bamboo is higher than most herbaceous plants 
, which may be the result of differences in the number or level of expression of key enzymes involved in lignin biosynthesis. We identified 105 unigenes from the 54,884 unigenes encoding eight key enzymes involved in lignin biosynthesis (KEGG PATH: ko00940, http://www.genome.jp/kegg/
) (Table S6
). For each enzyme, we compared the number of putatively Ma bamboo unigenes in this study with the number of Moso bamboo FL-cDNAs identified from the Moso cDNA database and rice genes identified from the genome sequences (). Peng et al (2010) predicted 26 transcripts encoding 4-coumarate-CoA ligase (4CL) and 23 transcripts encoding laccase from the rice genome database 
. Our results show that these two genes are also the most abundant ones involved in Ma bamboo lignin synthesis. However, the numbers of transcripts encoding for these two genes in Ma bamboo were higher than those in rice. There were 34 transcripts found for laccase and 35 transcripts coding for 4-coumarate-CoA ligase (4CL), which would partially contribute to the increased bamboo lignin content in comparison to rice. Among the 10,608 FL-cDNAs in Moso bamboo, the numbers of FL-cDNAs encoding 4CL and laccase were two and five respectively 
. The significant difference between Ma bamboo and Moso bamboo is not surprising because the 10,608 Moso FL-cDNAs actually represent only one third to one fourth of the estimated total of Moso bamboo genes. Also, the Moso FL-cDNA libraries were constructed from shoots, leaves and roots from germinating seeds which were not the most representative tissues for high lignin content, whereas our materials used for the transcriptome sequencing covered as many tissues as possible including culms of different developing periods. As previously reported, laccase had been found to display high expression in lignifying tissues 
. Therefore, the spatial distributions of genes related to lignin biosynthesis also influence the results. Of course, the differences among different species also count in the analysis. It is interesting that we could not detect unigenes coding for 5-hydroxyconiferyl aldehyde O-methyltransferase (AldOMT), which means that other genes encoding alternative methyltransferases, substituting for AldOMT activity, may exist in Ma bamboo. This needs to be further characterized. The above results indicate that lignin biosynthesis in Ma bamboo may follow yet unknown routes or pathways and that lignin synthesis in Ma bamboo displays unique features. However, as lignin biosynthesis is a complex process involving numerous factors, our analysis by transcriptome sequencing only elucidates part of the picture and hence it is difficult to draw precise conclusions. Clearly, additional studies deploying accurate molecular and proteomic analysis procedures are required to validate and further build on our predictions.
Number of bamboo FL-cDNAs and number of genes found in the rice genome that encode nine key enzymes in the lignin biosynthesis pathway.
Functional Genes Involved in Growth and Development
For many agricultural plants like bamboo, economic traits like growth and development are of particular interest to researchers. The sequence and annotation information from BLAST, GO and KEGG annotations all provided valuable gene sources for the study of molecular basis that underline these economic traits of Ma bamboo. Among them, genes encoding different groups of growth factors and their receptors involved in cell growth were identified, such as epidermal growth factor domains and receptors, transforming growth factors and receptors, hepatocyte growth factors and receptors and fibroblast growth factor and receptors.
It has been reported that many transcription factor families play vital roles in plant growth, development and immunity 
. From BLASTn analysis, we identified sets of unigenes that have putative functions as transcription factors including those belonging to the zinc finger protein family, F-box family, WD repeat-containing protein family, Myb family, WRKY family, MADS-box family, GATA family, etc. These putative transcription factors likely play specific and diverse roles in regulating gene expression levels endowing bamboo with unique features.
Unlike other plants, bamboo flowering is an elusive physiological phenomena, because it is unpredictable, long-periodic and uncontrollable. We have identified genes belonging to zinc finger protein family, WD repeat-containing protein family and MADS-box family which are thought to be correlated with bamboo flowering 
. Also, some other genes involved in flowering such as those coding for polycomb group protein 
, YABBY protein 
, phytochrome 
and histone deacetylase 
, were identified in our Illumina dataset.
As a plant which has gained reputation as a major resource of non-wood fiber, genes related to molecular mechanism of its fiber development are yet to be explored. Some fiber related genes of bamboo 
, such as those encoding a kinase-like protein involved in fiber initiation, a heat shock protein HSP82 involved in fiber elongation, or an eukaryotic initiation factor 4A involved in fiber maturation, were identified in our dataset. Further studies are required to identify the genes associated with fiber development in particular those that contribute to the outstanding fiber characteristics of bamboo.
Moreover, genes encoding plant hormones could be identified in our dataset. As is well known, plant hormones determine the formation of flowers, stems, leaves, the shedding of leaves, and the development and ripening of fruit. There are 425 unigenes involved in the auxin pathway such as auxin-responsive proteins IAA and auxin response factors. We also identified 263 unigenes for the ethylene pathway and 92 unigenes related to the gibberellin pathway.
Overall, functional analysis of our Illumina dataset identified candidate genes potentially involved in growth, development, plant signal pathways and regulatory networks.
SSRs are highly informative and widely used for genetics, evolution and breeding studies. It has been reported that approximately 3–7% of expressed genes contain putative SSR motifs, mainly within the un-translated regions of the mRNA 
. SSRs within gene sequences may have different putative functions, for example, SSR variations in 5′-untranslated regions (UTRs) could regulate gene expression by affecting transcription and translation; SSR expansions in the 3′-UTRs cause transcription slippage and produce expanded mRNA; intronic SSRs can affect gene transcription, mRNA splicing, or export to cytoplasm; SSRs within genes should be subjected to stronger selective pressure than other genomic regions 
. To explore SSR profiles in the unigenes of Ma bamboo, the 6,375 unigene sequences were submitted to an online service to search for SSRs 
. In total, 621 SSRs were obtained from 824 unigenes (12.8%) with 121 unigene sequences containing more than one SSR, among which tri-nucleotide repeat motif was the most abundant, accounting for 56.03%, followed by di-nucleotide repeat motif (33.33%), tetra-nucleotide (2.41%), penta-nucleotide (1.12%), and hexa-nucleotide (0.48%) repeat units (). The relative percentage of unigenes containing SSRs was much smaller than that of Moso bamboo (24%), which might be attributed to sequencing sampling or species difference 
Summary of simple sequence repeat (SSR) types in the D. latiflorus Munro transcriptome.
The AG/GA/CT/TC motifs accounted for approximately half of the total number of di-nucleotide SSRs, similar to that of Huperzia serrata
. Among the di-nucleotide repeat motifs, CT repeats were the most common, which is different from that of H. serrate
in which AG repeats were the most frequent. This may be due to the introduction of additional repeats during chromosome replication 
. It was reported that (CT)n
may function as an enhancer due to that fact that the same motif (TCTCTCTCT) was found in a 60-nt region downstream of the transcription start site of CaMV 35S RNA, which can enhance gene translation in plant protoplasts 
. Furthermore, as complementary sequences to (CT)n, (GA)n serves as regulatory elements that contain a series of overlapped GAG motifs (AGAGAGa) involved in light regulation 
. When compared with the frequency of di- or tri-nucleotide motif of SSRs among the unigenes of Ma bamboo, the results were coincidence with those of Arabidopsis
, rice and Moso bamboo, in which the type and distribution of tri-nucleotide SSRs were also the most abundant 
. The most common motif for tri-nucleotide repeats of SSRs were CTC/CCT/TCC/GCT/GCA/GGA (38% of tri-nucleotides) and CGG/CGC/GGC (10.6% of tri-nucleotides), which are similar to those of Moso bamboo and rice. This phenomenon is perhaps correlated with the higher G+C content of grasses and may have allowed more frequent insertion/deletion of certain nucleotides, without causing frame shift mutations 
SSRs were developed as powerful molecular markers for comparative genetic mapping and genotyping since they are ubiquitous in transcriptomes, typically locus-specific and co-dominant, multi-allelic, highly polymorphic, and transportable among species within genera 
. EST databases have been a rich source of SSRs for genotyping in numerous species of flowering plants 
. The unigenes obtained from Ma bamboo have provided a good resource for SSR mining and applications in research and molecular marker-assistant breeding.
This work presents the first de novo transcriptome sequencing analysis of mixed RNA from Ma bamboo flowers, seeds and different tissues (root, leaf, shoot, stem) using the Illumina platform. 2.2 Gbp of data were generated and assembled into 68,229 unigenes. A large number of candidate genes potentially involved in growth, development, flowering and plant hormone pathways were identified, and are worthy of further investigation. Ma bamboo unigenes related to lignin biosynthesis were characterized and their sequences compared to the sequence databases of rice, millet, and Moso bamboo. Orthologous sequences and unigenes unique to Ma bamboo were preliminary classified. A large number of SSRs were also identified and ready for marker development. To our knowledge, this is the first application of Illumina paired-end sequencing technology to investigate the whole transcriptome of Ma bamboo and moreover the assembly of the reads was conducted without a reference genome. The dataset will improve our understanding of the molecular mechanisms of fiber development, lignin biosynthesis, flowering, and other biochemical processes in Ma bamboo. This resource should lay an important foundation for future genetic or genomic studies on bamboo species and will help to close a critical gap existing in grass comparative genomics and consequently allow the more efficient development of the grass system for evolutionary and functional studies of plant genes and genomes.