Our analysis of 148.4 Mbp of E. grandis
expressed sequences generated with three 454 sequencing runs demonstrates that short reads produced with pyrosequencing technology can be assembled de novo
into reasonably long contigs; an advantage for species with limited public genomic resources. The 25.4 Mbp comprised in our unigene set represents an enrichment of 37× in the amount of publicly available E. grandis
expressed sequences, and will provide substantial support for the genome sequencing and annotation that are currently under way [11
]. Longer reads from the GS-FLX were essential to assemble a reasonable proportion of biologically relevant contig sequences. Although our sequencing effort produced three times more sequence (≈ 148 Mbp) than the control dideoxy-based EST library (Sanger, ≈ 45 Mbp), the 454 short read-based contigs average only one third of the size. Despite this limitation, a substantial gain occurs at the level of gene discovery – the large number of reads in our 454 dataset leads to sampling of sequences from a much broader variety of genes. Therefore, 454-based gene discovery projects represent a viable and, perhaps favorable alternative to Sanger-based sequencing of EST libraries when a diverse sampling of genes is more important than obtaining transcript-length contigs. As GS-FLX becomes the standard 454 pyrosequencing platform, large-scale EST sequencing and gene discovery projects will be more successful in assembly of transcript-length contigs.
We also demonstrated that the detection of valid SNPs is possible by sequencing a pooled sample of highly heterozygous genotypes. By aligning the reads derived from cDNA of 21 E. grandis
genotypes, we were able to detect 23,742 SNPs and validate 83% of a sample of 337. Therefore, approximately 4,000 of the detected SNPs may be false positives, possibly arising from sequencing errors or alignment of paralogs. Paralogs that share high levels of sequence similarity may have been assembled in the same contig because they cannot be distinguished due to the short read length of 454 pyrosequencing. This would lead to the detection of false SNPs. On the other hand, a higher stringency of assembly raises the possibility of the opposite problem: the separate assembly of haplotypes from highly polymorphic genes. However, the assessment of the EST assembly quality in this study is difficult due to the lack of a reference genome sequence. Nonetheless, the validation rate observed here was similar to the 85% reported for maize, where SNPs were detected by comparing sequences from two separate 454 runs, each interrogating a different inbred homozygous line [7
A significant number of polymorphisms in our sample may have been overlooked because of the stringent methodology used to declare SNPs. For contigs with reasonable sequence coverage (length > 200 bp) and depth (> 10 reads on average per nucleotide) we detected one SNP for every 192 bp on average. This rate of SNP discovery appears low compared to previous reports of one SNP every ≈ 100 bp in coding sequences of Eucalyptus
] and other forest species [20
]. There are at least three foreseeable reasons for missing true SNPs that are intrinsic to our experimental methodology and SNP detection approach. First, because 454 was used to randomly sequence cDNAs, not all genotypes were sequenced for every SNP locus – in fact, the average sequencing depth (6.7 reads/bp) in our experiment is far lower than the number of possible haplotypes in our sample. Secondly, the requirement that a SNP be observed in at least 10% of the reads aligned to a polymorphic site compromised the sensitivity required to detect rare alleles. Studies have demonstrated an excess of rare alleles in natural populations of forest tree species [22
] which are probably largely discarded by our approach. Additionally, there is a relatively high level of relatedness among the 21 individuals utilized in our study. The sampled genotypes come from seven open-pollinated families – i.e. seven groups of three individuals share one common maternal parent, limiting the genetic diversity sampled relative to what might be present in a similarly sized sample of unrelated trees. Finally, the detection of polymorphisms was likely hindered by the requirement that at least two reads containing the variant alleles have at least 20 nucleotides of conserved sequence upstream and downstream of the locus. This requirement is intended to minimize the discovery of false polymorphisms due to the alignment of paralogs – a potentially significant problem when aligning short sequence reads. Therefore, only nucleotide variants in relatively conserved or recently derived paralogs may have been incorrectly identified as SNPs. The drawback is that true SNPs in hotspots of genetic diversity or genes under high diversifying selection (e.g. disease-resistance genes, [26
]) may be discarded. Considering the high diversity found in forest trees [20
], the requirement for sequence conservation in regions surrounding a SNP may be too conservative.
High throughput DNA sequencing and SNP discovery from a pool of multiple genotypes may be a powerful approach for rapid assessment of genetic diversity and selection in a genomic scale. Genome-wide surveys of genetic diversity have only recently been reported in the model plant Arabidopsis
]. Here we attempted to generate an approximate estimate of genetic diversity for a broad sample of genes by adapting existing parameters of genetic diversity to our experimental methodology. The most commonly used measure of genetic diversity, the nucleotide diversity parameter theta (θ), is an estimate of the number of polymorphic sites in a random haplotype sample from a population and is independent of the allele frequencies [17
]. In our study, the number of independent haplotypes sampled is unknown, as the same haplotype may be sampled repeatedly. Therefore we developed a modified nucleotide diversity estimator beta (β) based on SNPs detected by randomly sequencing a multi-genotype cDNA pool. β is always underestimated relative to θ, because the number of reads at a given SNP position is always equal or smaller than the effective (true) number of genotypes. The lower sensitivity to detect SNPs discussed previously also contributes to the underestimation of β. Therefore, the average β reported here (0.00186) is, as expected, much lower than the average θ estimated for genes of Populus
by 9× [23
], Douglas fir by 3.8× [24
], loblolly pine by 2.2–2.7× [20
] and Norway Spruce by 1.7× [22
]. Although it is not possible to compare estimates of β to the commonly used θ, they are useful for a comparison of nucleotide diversity among genes in this project.
In addition to the nucleotide diversity (β) we also estimated the proportion of non-synonymous to synonymous substitutions (Ka/Ks). As observed in other plant species [13
] most genes of E. grandis
appear to be under purifying selection and, accordingly, Ka/Ks distribution averages 0.30 and is heavily right-skewed. Among genes predicted to be under strong purifying selection based on the Ka/Ks ratio, there is enrichment for GO categories involving essential biological processes conserved across kingdoms, such as translation, ubiquitin-dependent protein degradation and nucleosome assembly by histones. rRNA and ribosomal proteins have been shown to be highly conserved between species and to evolve under strong purifying selection [31
]. Several studies also confirm that histone genes evolve constrained by negative selection [34
Among genes classified as diverse and/or under positive selection based on βN
and Ka/Ks distributions, there is enrichment for genes classified as unknown biological process, defense response and response to biotic stress, and multicellular development. Other researchers have already reported an excess of unknown function among genes under positive selection [16
]. A possible explanation for this overrepresentation is that the category of uncharacterized genes may be enriched for duplicated genes where relaxed selection constraints are leading their diversification and/or eventual silencing [38
]. In fact, the category may include pseudogenes and transcribed but untranslated loci. It is also possible that duplicated genes might have higher nucleotide diversity due to the assembly of paralogs. However we do not anticipate false SNPs to bias Ka/Ks estimates because they are not expected to occur more or less frequently in synonymous versus non-synonymous sites by chance.
Genes acting in defense response/response to biotic stimulus are frequently positively selected for diversification to compete with rapidly evolving avirulence genes of pathogens [26
]. We did find extensive diversity in non-synonymous sites (βN
) of most defense response genes, but the diversity was also high among synonymous sites and, as a result, these genes were not enriched among those under diversifying selection (measured by the Ka/Ks). Similar results were reported in another study [16
]. One possible explanation is that positive selection generally only operates in certain domains (i.e., leucine-rich repeat (LRR)) of resistance genes [26
] and we estimated Ka/Ks over the entire coding sequence. The multicellular development GO class which is enriched among genes under diversifying selection is mainly due to the presence of NAC transcription factor genes. Transcription factors have been demonstrated to have an excess of positively selected genes in humans [40
], and specifically one member of the NAC family was shown to be evolving rapidly in Arabidopsis
]. Finally, the literature survey supports our results and demonstrates that the proposed nucleotide diversity estimator (β) accurately depicts relative differences of variability within gene sequences.