Transcriptome sequencing and assembly
Pyrosequencing of cDNA on a 454 GS FLX Titanium (Roche) generated a total of 146,267 raw reads, with an average length of 408 bp. After filtering for adaptors, primer and low-quality sequences, 5,588 reads were removed resulting in 140,679 high quality reads corresponding to 96% of the first raw sequences, representing approximately 60 Mbp. Raw data (>200 bp) were deposited in NCBI Sequence Read Archive (SRA) under the accession number SRA049632.2.
By using Newbler Software v. 2.5 (Roche, IN, USA); a total of 111,814 sequences were
de novo assembled into 3,394 contiguous sequences (contigs). Overlapping contigs were assembled into 3,005 isotigs (equivalent to unique RNA transcripts). In addition, isotigs originating from the same contig-graph were grouped into 2,722 isogroups (equivalent to genomic locus) by Newbler, potentially reflecting multiple splice variants. About 28,861 reads not assembled into isotigs were clustered using CD-HIT-454 algorithm to eliminate artificial duplicates leaving 21,881 singletons, summing up a total of 24,886 non-redundant sequences or unigenes (Table
). All unigene sequences (isotigs and singletons >200 bp) were deposited to the Transcriptome Shotgun Assembly (TSA) database, accession numbers JT763459-JT784547. Isotig length ranged from 66 bp to 7,093 bp, with an overall average length of 765

±

537 bp (Figure
A). More than 83% of the isotigs were 66 to 1,000 bp long and 50% of the assembled bases were incorporated into isotigs greater than 589 bp. The average length of
N. nervosa isotigs (765 bp) was larger than those assembled in other non model organisms (e.g.197 bp
[
22], 440 bp
[
23], 500 bp
[
24]; 535 bp,
[
25]), and similar to the average isotig length described in
Bituminaria bituminosa (707 bp
[
26]).
| Table 1N. nervosa transcriptome annotation summary |
The coverage depth for isotigs ranged from 2 to 19, with an average of 9 contigs assembled into each isotig, which is larger than the averages obtained in other 454 transcriptome analyses (mean

=

2.1,
[
24,
25]).
The length distribution of the 21,881 singletons ranged from 50 to 711 bp with an overall average length of 369.6 bp (Figure
B). The length of 86% of the singletons was shorter than 500 bp.
Functional annotation
All unique sequences were subjected to BLASTX similarity search against the NR protein database (National Center for Biotechnology Information, NCBI), with a Viridiplantae filter, to assign a putative function
[
27].
Under an E-value threshold of <10
-10, a total of 2,762 isotigs (92% of total isotigs) and 12,735 singletons sequences (58% of total singletons) had significant BLASTX matches (Table
). The frequency of annotated isotigs was significantly higher than the values previously reported for
de novo transcriptome assemblies of eukaryotes that range from 20 to 40%
[
22-
25].
In total, 15,497 unique sequences had at least one hit, while the remaining sequences 9,389 (38%) exhibited less significant matches (e-value

>

10
−10) but still informative for identifying putative biological functions in future studies in this species. We also performed a BLASTX against the NCBI - NR protein database to retrieve sequences that did not show BLAST hits against Viridiplantae NCBI, which summed up some few new hits (81), but not adding any other valuable annotations.
The majority of matched sequences exhibited high similarity to Vitis vinífera (41%), and Populus trichocarpa (38%) sequences. The top-hit species distribution of BLAST matches is shown in Figure
.
Annotation and mapping routines were run with BLAST2GO platform
[
28]. Sequences with a positive BLAST match were annotated using Gene Ontology terms (GO) and Enzyme Commission categories (i.e. EC numbers). Thus, GO terms were assigned to 2,238 isotigs (74%) and 9,596 singletons (44%) totalizing 11,834 GO terms (Table
).
Of the 11,834 GO annotated isotigs and singletons sequences, most were assigned to “Biological Processes” (7,926 terms), “Molecular functions” (8,229 terms) and “Cellular Components” (9,206 terms), (Figure
).
BLAST2GO analysis at process level 2, showed that among 21 different biological processes most of the transcripts belonged either to “Metabolic Processes” (5,823), to “Cellular Processes” (5,090) and to “Response to Stimuli” (1,493), of which 756 were putative stress-response genes (Figure
A).
Likewise, the molecular function category subdivided annotated sequences into binding (6,985), catalytic activity (5,658) and transporters (689) as the most represented (Figure
B).
A detailed BLAST2GO analysis (level 2) at the cellular component category, sorted all transcripts from N. nervosa into 5 groups being the most representative: cell (7,304), organelle (4,822) and macromolecular complex component (1,136) (Figure
C).
In order to more precisely compare the similarity of
N. nervosa genes with those of the Fagaceae family (from Fagaceae Genomics Web [
http://www.fagaceae.org/]),
N. nervosa unigenes were subjected to BLAT (dnax) search against 2,407,823 contigs and singletons from American Beech (
Fagus grandiflora), American Chestnut (
Castanea dentate), Chinese Chestnut (
Castanea mollisima) and oak species (
Quercus rubra and
Q. alba). Eighty-two percent of the
N. nervosa expressed sequences exhibited high similarity to Fagaceae genes. A total of 4,447 (18%) sequences did not show matches against Fagaceae sequences, from which there were 82 isotigs and 4,365 singletons. Among them, 12 isotigs and 490 singletons had distinctive GO annotation, which could be considered as novel genes for this large group of tree species (Table
). Most interestingly, from these transcripts 21 were found to be potentially new genes for stress response (data not shown).
Of the 11,834 sequences annotated with GO terms, 2,355 were assigned with EC numbers (931 isotigs and 1,424 singletons) (Table
).
The most represented enzymes in all sequences are shown in Figure
: transferase activity (37%), hydrolase activity (35%) and oxidoreductase activity (13%) were the most abundant.
To further enhance the annotation of
N. nervosa transcriptome dataset, the 11,834 genes with GO terms were mapped to KEGG using KEGG automatic annotation server (KAAS)
[
29]. The identified 58 metabolic pathways include: purine metabolism (411), thiamine metabolism (405), T cell receptor signalling pathway (115), biosynthesis of secondary metabolites (58), and microbial metabolism in diverse environments (37) (see Additional file
1).
We detected as much as 861 chloroplast (cp) sequences (150 in isotigs and 711 in singletons), corresponding to a quite high rate (7%), but this value was within the 2 to 10% found in cDNA libraries from all tissue types, as reported in a study conducted in oak
[
30].
The number of annotated isotigs in this study was comparatively larger than that obtained in other similar studies
[
22-
25]. These results could be associated with the high quality and small number of assembled isotigs, which potentially corresponds to highly expressed genes. Also the use of specific plant protein sequences and close related Fagaceae database possibly increased the BLAST hits. The first assumption comprises technical issues such as a high percentage of isotigs that was greater than ~600 bp length and with good coverage depth. Moreover, the small number of isotigs would be detecting the most represented and known expressed genes, as it was also shown in the analyses of
B. bituminosa leaf transcriptome (89.1% annotated contigs)
[
26]. Proportions of best hits in major GO category were generally similar to those found in this species, for example, binding 48% and catalytic activity 37% in the
N. nervosa transcriptome survey versus 37% and 37% respectively for the same categories in
B. bituminosa.The second statement relies on the annotation approach based on the search against the Viridiplantae protein database. This strategy allows to more likely finding BLAST hits above the cut off value. In addition, a higher percentage of reliable annotated isotigs was found when the searched was carried out against the Fagaceae protein sequence dataset (Table
). The favorable effect of using specific databases for annotation was also reported for other authors
[
31-
33].
Besides, the lower percentage of singletons that were annotated was likely due to the high frequency of short length sequences, also reported in recent studies
[
24,
34]
. Fifty percent of non-annotated singletons were shorter than 370 bp (data not shown), whereas the 50% in annotated singletons were longer than 454 bp. Similar results were obtained in
Pinus contorta where only 5% of contigs and singletons had BLAST matches when the length of the sequences was less than 250 bp
[
24]. Nonetheless, many singletons were good quality reads and matched to proteins in BLAST searches representing together with the isotigs, a great source of information.
Summarizing, the frequency of annotated isotigs and singletons was significantly higher than previously reported for new generation sequencing
de novo transcriptome assemblies of trees like
Pinus contorta[
24], or two oaks species,
Quercus petraea and
Q. robur[
30], even though the high stringency of BLASTX analysis.
If we assume that the average number of genes encoded in a plant nuclear genome is about 30 thousands (as estimated from seven completely sequenced genomes)
[
34], our annotated dataset likely represents a half of the
N. nervosa genes catalogue.
In order to test the presence of expressed repetitive sequences, BLASTN (e-value cut off

≤

10e
-50) searches were performed against all Viridiplantae Repbase (reference database of eukaryotic repetitive DNA). A total of 374 repetitive DNA sequences were found (57 in isotigs and 317 in singletons). From all the rRNA sequences, 255 corresponded to small subunit rRNA (SSUrRNA), 102 to large subunit rRNA (LSUrRNA) and 17 to transposable elements. Similar numbers of retrotransposon were observed in other plant species (e.g. 15 in
Populus tremula and
Pinus pinaster)
[
24]. However, in
Fagopyrum esculentum and
Pinus contorta much more transcribed retrotransposable elements were found in the different tissues sampled
[
24,
34].
Characterization of microsatellite motives
As expected, the most frequent type of microsatellite corresponded to trimeric (37.4%) and dimeric motives (32.3%), being tetra-, penta- and hexanucleotide repeats present at much lower frequencies (16.3%, 5.2% and 8.8% respectively, Figure
). Similar results were found in oak
[
30] (36.6% for trimeric and 36.2%, for dimeric motives) with the minimum repeat number of five and four for di- and tri-microsatellites, respectively.
SSR motif combinations can be grouped into unique classes based on DNA base complementarities. For example, dinucleotides were grouped into the following four unique classes: AT/TA; AG/GA/CT/TC; AC/CA/TG/GT and GC/CG. Thus, the numbers of unique classes possible for di-, tri- and tetra-nucleotide repeats are 4, 10, and 33, respectively
[
36,
37]. The AG/CT group was the predominant class (56.2%) of the dinucleotide repeats, whereas AT (29.2%), AC (14.5%) and CG (0.1%) groups were less represented. The frequency of AG was similar to the highest value reported by Kumpatla
[
38] (14.6%–54.5% of the total SSRs observed in 55 dicotyledonous species) but lower than that found in Oak (70.5%)
[
30] and eucalypts (91%)
[
39].
The most frequent trimeric SSR motives were the AAG/CTT (27.8%), ATG/CAT (15.2), AGC/GCT (12.6%) and AGG/CCT (11.6%), similar to the first category found in oak (26.8%)
[
30]. Within tetrameric motives, AAAT repeat was found to be the most abundant (32.9%), followed by AAAG (22.7%) and AACA (11.6%).
The topography of SSR distribution was analyzed for SSR presence within UTRs and coding sequence regions. About 45% of the SSR sequences were inside ORF sequences. Most trinucleotide repeats were found in ORFs (52%), while dinucleotides were more frequent in the UTRs (40%), similar to those reported in oak
[
30] and pines
[
40]. It is expected that tri- and hexanucleotide repeats would occur more frequently than other motifs in coding sequences. Such dominance of triplets over other repeats in coding regions may be explained on the basis of the selective disadvantage of non-trimeric SSR variants in coding regions, possibly causing frame-shift mutations
[
41].
Validation of the predicted microsatellite markers
Seventy three microsatellites were selected according to their sequence length, GC content and functional annotation related to abiotic stress category.
From these, 57% were located in coding regions. The 73 loci were tested for successful PCR amplification in six individuals
. All of them were effectively amplified validating the quality of the assembly and the utility of the SSRs produced. A similar research carried using Illumina sequencing technology in sesame showed that about 90% primer pairs successfully amplified DNA fragments
[
42]. On the other side, the rate of SSR validation was lower (64.9%) when the marker mining was done using EST produced by Sanger technology
[
39] possibly because of low-quality EST sequences, and/or primer sequences derived from chimerical cDNA clones.
About 20% (14 SSR) of the tested Nothofagus SSRs were polymorphic and showed at least one individual that differed in allelic composition.
This relative low percentage of polymorphic loci could be explained because of the small sample size tested (six seedlings), in contrast to the 46% found in
E. globulus[
39] evaluated in 8 samples, and the 80% found in sesame
[
42] essayed in 24 samples.
Nine of the polymorphic SSR found in this work were located within predicted ORF and seven had repeat motives multiple of three (Table
), according to their presence in coding regions
[
41].
| Table 2Polymorphic SSRs primer pairs derived from N. nervosa unigenes |