A schematic representation of data processing is shown in Figure .
We produced 145,827 Sanger reads from 20 cDNA libraries (Tables and ), including ten libraries from Q. petraea
and ten from Q. robur
. There were nine Suppression Subtractive Hybridization (SSH) and 11 cDNA libraries. We used five different experimental systems (kit combinations) for library construction. Two systems were used for SSH libraries, while three systems were used for cDNA libraries. The maximum and minimum number of genotypes in a library was 60 for library D and 2 for libraries B and I, respectively. There were seven bud, seven root, four leaf and two differentiating secondary xylem libraries. All sequences were subjected to pre-processing (SURF and qualityTrimmer software, see Methods section) to remove library specific cloning-vector and adaptor sequences, to mask low complexity sequences, to eliminate contaminants and poor-quality sequences (e.g. very short reads). The resulting Sanger catalogue contained 125,886 high quality sequences that are available at the EST section of EMBL. Furthermore, all of the chromatograms can be downloaded from the SURF web site http://surf.toulouse.inra.fr/
with user name (oak) and password (oak1).
Oak (Q. petraea and Q. robur) cDNA libraries for Sanger sequencing
Sequencing statistics for libraries sequenced by the Sanger method
The sequencing success rate (defined as the number of high quality reads divided by the total number of sequences) as well as the average length of high quality sequences varied depending on libraries (Table ). Excluding the three Suppression Subtractive Hybridization (SSH) libraries (C, D, E) for which these two parameters were among the lowest, the former parameter ranged from 69% to 97.9% and the latter from 500 bp to 712 bp. Overall, the average read length of high quality sequences was 575 bp.
SURF detected 402 chloroplastic (cp) reads, corresponding to a global rate of 0.28% (Table ). cp sequences were detected in all tissue types, including leaves and buds at quite high rate, ranging from 2 to 10% (libraries C, D, Q, R), but also in root (M) and xylem (J) at a much lower rate. There were three reads that matched with E. coli
sequence. SURF tagged a total of 22,431 sequences as 'not valid' (short and/or contaminant sequences) including 1,555 and 229 sequences flagged as 'doubtful' (containing library specific short vector/adaptor sequences), and 'pcrkitful' (containing short sequences from the UniVec database), respectively. Interestingly, most of these low quality sequences were found in libraries constructed with the lambdaZap kit (A, M, P, S) as well as in three SSH libraries (C, D, E). It should also be noted that about 1/3 (544) of the 'doubtful' sequences could be attributed without ambiguity to chimeras, because they presented library specific sequences in the [30%-70%] interval of the sequence. Trace2dbEST pipeline [11
] also detected 63 reads with putative chimeras. Out of 63 chimeras judged by this pipeline, only one read was common with those detected by SURF. The difference may result from the different phred [12
] and cross_match [14
] parameters. Trace2dbEST uses phred error probability cut-off of 0.05 (which corresponds to quality value (QV) of 13), while that in SURF was 0.01 (QV = 20). The sequence regions with the specified QV were expected to be longer in sequences processed by trace2dbEST than by SURF. Theoretically, the probability to detect a chimera is higher for larger sequences. Unfortunately, trace2dbEST had less stringent parameters in cross_match compared to SURF, which decreased the rate of chimera detection. This comparison clearly shows that parameter optimization to detect possible chimera is necessary for each study objectives. For example, if the goal is to provide a global view of the transcriptome, chimeric clones do not cause large problems. However, if the goal is to bioinformatically infer full-length cDNAs, chimeric clones must be strictly eliminated.
We produced three kinds of assemblies using three different approaches (Table ). First, we processed the 134,500 Sanger reads (Table ) resulting from the trace2dbEST analysis using the PartiGene pipeline[15
] and produced 40,944 unigene elements, containing 17,499 contigs and 23,445 singletons (Table ). Contigs were defined in sequence assembly, resulting from multiple reads, while singletons were unique sequence that were not clustered with any other reads or that were not assembled with any other reads in a cluster. The distribution of sequence length (bp) of contigs and singletons is indicated in additional file 3
: Figure S1 (A). While the mode of the distribution resided in the (600-700] class for both singletons and contigs, the average and maximum length of singletons and contigs were quite different: 919 bp and 4,412 bp, respectively for the contigs, and 485 bp and 1,305 bp for the singletons. Of the 17,499 contigs, 6,271 (35.8%) contained two ESTs, 3,104 (17.7%) contained three ESTs, 1,842 (10.5%) contained four ESTs, 1,257 (7.2%) contained five ESTs and 5,025 (28.7%) contained more than five ESTs (additional file 4
: Figure S2, red bars). The average and maximum number of ESTs in a single contig was 6.3 and 510, respectively. The average GC content of this unigene sets was 41.6%. When we compared these results with other similar studies (Table ), the statistics of the oak assembly were within the range of what has been reported so far in plants. Positive relationships between the different quantities (numbers of ESTs, unigene elements, contigs, and singletons) were evident. Even though the number of ESTs collected was different for each study, the percentages of contigs within each unigene set was nearly identical (mean = 41.1%), except for the cocoa EST assembly.
Statistics for assembly by PartiGene (Sanger ESTs only), MIRA and TGICL (Sanger and 454- ESTs)
PartiGene assembly summary for libraries sequenced by the Sanger method
Comparison of EST sequencing statistics for Sanger sequencing
Second, we used the SIGENAE system (which relies on the TGICL software [16
], see Methods section) to bring together in the same analysis 125,925 Sanger and 1,578,192 454-reads. Overall, 222,671 elements (69,154 (31%) tentative consensus sequences (TCs) and 153,517 (69%) singletons; OakContigV1) were obtained (Table ). The average and maximum length of contigs was 705 bp and 7,898 bp, respectively. The distribution of sequence length (bp) of contigs and singletons is indicated in additional file 3
: Figure S1 (B). The distribution was bimodal for both contigs and singletons. The first peak for singletons resided in (200-300] class, where 56,249 GS-FLX reads (92.5% of total reads within the class) resided. The second peak for singletons laid in the (400-500] class, where 14,336 Titanium reads (95.7% of total reads within the class) resided. For contigs, the mode was located at the (200-300] class. Within this class, there were 11,269 contigs (92.0% of total contigs within the class) that were made up from GS-FLX reads only. The average and maximum depth of contigs was 22.4 and 4,927, respectively. The deepest contig was 1,336 bp and presented similarity with a chloroplast membrane protein from Mercurialis perennis
at e-value of 6e-10. Of the 69,154 TCs, 23,281 (33.7%) contained two ESTs, 8,860 (12.8%) contained three ESTs, 5,069 (7.3%) contained four ESTs and 31,944 (46.2%) contained more than four ESTs (additional file 4
: Figure S2, green bars). Overall the 69,154 TCs contained 1,550,824 sequences. Among the 69,154 TCs, 40,542 (58.6%) consisted of 454-reads only, while 1,230 (1.78%) were made up of Sanger reads only (Figure ). In total, 356,893 (22.6%) of the 454-reads did not cluster to Sanger reads (139,443 singletons plus 217,450 454-reads in 40,542 TCs supported only by 454-reads). This also means that 77.4% of the 454-reads clustered with Sanger reads. The average GC content of OakContigV1 was 39.8%.
Figure 4 Composition of contigs constructed by (A) TGICL (OakContigV1) and (B) MIRA software. When the number of Sanger reads is zero in a contig, it means that the contig is made up of only 454-reads (the blue bar at zero on the horizontal axis). On the other (more ...)
Graphical interface to browse OakContigV1 was constructed using the Ensembl tool (oak contig browser; http://genotoul-contigbrowser.toulouse.inra.fr:9092/Quercus_robur/index.html
(user: oak, pass word: quercus33)). Browsing similarity annotation, SNP alignments and data mining by BioMart are also available as described in detail in Fleury et al. [17
]. All of the data can be downloaded from the web site.
Third, we used MIRA software [18
] to produce direct 454-Sanger hybrid assembly. This analysis resulted into 116,826 unigene elements including 113,625 contigs and 3,201 singletons (Table ). There were also 189,268 so called "debris" reads, including 12,532 Sanger and 176,736 454-reads. About 54.6% (103,428 reads out of 189,268) of the sequences in the 'debris' corresponded to 67.4% of the OakContigV1 singletons. The number of Sanger and 454-reads included in assembly was 125,925 and 1,578,013, respectively. The distribution of sequence length (bp) of contigs and singletons is indicated in additional file 3
: Figure S1 (C). For contigs, the mode was located on the (200-300] class. Within this class, there were 16,601 contigs (86.6% of total contigs within the class) that were made up of GS-FLX reads only. Among the 113,625 contigs, 72,896 (64.2%) consisted of 454-reads only, while 3,582 (3.2%) were made up of Sanger reads only (Figure ). In total, 421,391 (26.7%) of the 454-reads did not cluster to Sanger reads (2,992 singletons plus 418,399 454-reads in 72,896 TCs supported by 454-reads only). This also means that 74.3% of the 454-reads clustered together with Sanger reads, a similar (although larger) value to that obtained using TGICL. The average and maximum length of contigs was 671 bp and 15,177 bp, respectively (additional file 3
: Figure S1 (C)). The average and maximum depth of contigs was 13.3 and 3,253, respectively, that is almost twice smaller than that obtained with TGICL. Of the 113,625 TCs, 38,730 (34.1%) contained two ESTs, 15,189 (13.4%) contained three ESTs, 8,686 (7.6%) contained four ESTs and 51,020 (44.9%) contained more than four ESTs. Overall the 113,625 TCs contained 1,511,639 sequences. The average GC content of this unigene sets was 41.9%. A total of 33.5% of the reads that did not clustered with Sanger reads using either MIRA or TGICL assembler were identical.
Reciprocal best Blast hits (RBHs) were searched between unigene elements constructed by MIRA and TGICL, PartiGene and TGICL, as well as between PartiGene and MIRA. In total 32,459 sequences were identified as RBH between MIRA and TGICL (OakContigV1) unigene elements, which accounted for 27.8% and 14.6% of MIRA and OakContigV1 unigene elements, respectively. In terms of contigs, 27.9% of MIRA and 38.1% OakContigV1 contigs had RBHs, while in terms of singletons, 24.0% of MIRA and 4.0% of OakContigV1 singletons presented RBHs. This low percentage is due to the fact that MIRA classified most of the singletons as 'debris'. There were 17,933 RBHs between PartiGene and OakContigV1 unigene elements, which accounted for 43.8% and 8.05% of PartiGene and OakContigV1 unigene elements, respectively. In terms of contigs, 59.8% of PartiGene and 19.6% OakContigV1 contigs had RBHs, while in terms of singletons, 32.2% of PartiGene and 2.83% of OakContigV1 singletons had RBHs. There were 13,037 RBHs between PartiGene and MIRA unigene elements, which accounted for 31.8% and 11.2% of PartiGene and MIRA unigene elements, respectively. In terms of contigs, 51.4% of PartiGene and 11.4% of MIRA contigs had RBH, while in terms of singletons, 17.2% of PartiGene and 1.1% of MIRA singletons presented RBHs.
By the addition of 454-reads, the number of unigene elements was greatly increased from 40,944 based on the PartiGene assembly to 222,671 in OakContigV1. This is due to 139,443 454-singletons and 40,542 contigs that contain only 454-reads. In total, these 179,985 454-unigene elements accounted for 80.8% of the OakContigV1 sequences, comprising 22.6% of the 454-reads. It should also be pointed out that 46.8% (i.e. 10,073 Sanger reads) of the 21,504 PartiGene singletons also present in OakContigV1 were present as contig member of the TGICL assembly. In addition, mapping 454-reads onto the PartiGene assembly (using MIRA) showed that 852,986 (54.0%) reads were mapped, including 683,768 (43.3%) reads on contigs. The rest of the 454-reads did not find corresponding sequences within the PartiGene Sanger assembly. Because 77.4% of 454-reads were assembled with at least one Sanger read in OakContigV1, this simple mapping procedure resulted into a much lower rate of integration of 454-reads. All together, these results indicate the value of the combined assembly approach based on TGICL and the added value of 454-reads to assemble Sanger reads into contigs. When the assembly was carried out based on 454-reads only using the MIRA assembler, we found that 2698 (2.3%) decrease in the number of unigene elements, 60.7 bp (9.0%) decrease in the length of contigs. Sanger reads contributed more to the length of contigs than to the number of unigene elements.
Detection of unique peptide elements
Starting from 222,671 unigene elements in OakContigV1, FrameDP [19
] predicted peptides for 117,311 (52.7%) of them (additional file 5
: Figure S3), resulting in 132,406 predicted peptides. A single peptide was predicted for 104,172 (46.8%) elements of OakContigV1, while the rest produced multiple peptides. The maximum number of predicted peptides from one sequence of OakContigV1 was seven. When 116,826 unigene elements plus the 189,268 'debris' produced by MIRA were used for peptide prediction ('debris' were included here for comparative purpose with TGICL analysis), FrameDP predicted peptides for 176,324 (57.6%) elements. For 164,468 (53.7%) of them, there was only one peptide predicted by FrameDP. When peptide prediction was performed for the unigene elements produced by PartiGene, 31,798 (77.7%) presented at least one peptide. Only one peptide was predicted for 27,273 (66.6%) unigene elements. Therefore, unigene elements of OakContigV1 presented the largest portion of non-translated sequences (additional file 5
: Figure S3). Unigene elements from MIRA analysis also presented a large portion of non-translated sequence, due to 'debris' reads. When the 'debris' were excluded for peptide prediction, both MIRA and PartiGene displayed similar patterns of distribution of predicted peptides (data not shown). Only 41.3% singletons of OakContigV1 and 28.2% of 'debris' in MIRA, respectively, had at least one predicted peptide, while 67.6% of the singletons in PartiGene presented at least one predicted protein. Focusing on the contigs, 91.2%, 77.4% and 77.7% of PartiGene, OakContigV1 and MIRA elements, respectively, had at least one predicted peptide. Of 132,406 predicted peptides from OakContigV1, 91,148 (68.8%) had N-terminal or C-terminal peptide, while the rest (31.2%, i.e. 41,310 elements) was assumed to be full-length peptide with both start and stop codons identified.
BLASTClust, a part of BLAST package [20
], found 114,977 peptide clusters at 70% coverage and 75% similarity for the 132,406 OakContigV1 FrameDP-predicted peptides, which corresponded to 14.2% reduction in the total number of predicted peptides (Figure ). Even with the 100% coverage and 100% similarity, 1,651 peptides clustered into 719 clusters. Those peptide sequences in the same cluster showed complete identity. When we performed BLASTClust analysis for 189,171 FrameDP-predicted peptides from MIRA assembly plus 'debris', there were 2,188 clusters (6,339 elements), in which all of the cluster members showed the same peptide sequence. At the 70% coverage and 75% similarity, the rate of unique peptide was 67.1% (corresponding to 32.9% reduction) (Figure ), which was smaller than that found in OakContigV1 (85.8%). Because reduction rate was higher using MIRA, this analysis suggests that MIRA is more efficient to distinguish not only polymorphisms and substitutions but also splice variants in the assembly step. This partly explains the difference in the depth of contigs. Contigs by MIRA had an average depth of 13.3, while that of OakContigV1 was 22.4. The BLASTClust result for PartiGene unigene elements (Figure ) showed similar trend to that of OakContigV1.
BLASTClust clustering of peptides predicted from (A) PartiGene, (B) OakContigV1 and (C) MIRA unigene elements. Sixteen combinations of percentage of similarity (horizontal axis) and coverage (four lines) between two sequences were plotted.
All together, the comparison of the procedures that were tested to assemble Sanger and 454-reads resulted into the following conclusions: first, there was an added value (in terms of integration of 454-reads) to perform a combined analysis of 454 and Sanger reads compared to a simple mapping procedure of the 454 data onto a Sanger unigene set, second, a seeded assembly using TGICL was found to be more efficient than a direct assembly using MIRA because i/ MIRA excluded a great number of ESTs from the unigene set (so called "debris"), most (67.4%) corresponding to singletons in the TGICL assembly and ii/ TGICL produced less contigs (69.2 k vs. 113.6 k) with higher depth (22.4 vs. 13.3 reads on average) and longer length (705 vs. 671 bp on average).
Similarity searches were carried out using the hybrid assembly resulting from the TGICL pipeline (OakContigV1) that provides an approximate estimate of unique transcripts, because it discriminates alternative spliced transcripts. Out of 222,671 elements of OakContigV1, homology search against protein databases resulted into 32,810 (14.7%), 52,959 (23.8%) and 37,262 (16.7%) elements with at least one hit against SWISS-PROT [21
], RefSeq_protein [22
] and Pfam [23
] database, respectively at the e-value cut-off of 1e-5, while that against nucleotide databases resulted in 93,658 (42.1%) and 143,830 (64.6%) unigene elements with at least one hit against Refseq_RNA and eight TIGR gene indices [24
], respectively. The result of BlastN against eight gene indices showed that both the number of hits and aligned length of the high-scoring segment pair (HSP) were the greatest for sequences in VVGI (Vitis vinifera
) and least in SGI (Picea
sp.) (Figure ). This may partly reflects the phylogenetic position of Quercus
within the eurosids I. In total, 150,063 (67.4%) of OakContigV1 sequences had at least one hit in this homology search process, while the remaining sequences (72,608, i.e. 32.6%) were orphans, which may be considered as oak specific. However, caution should be made to consider orphan sequences as oak specific without experimental validation of such sequences in cDNAs. Gene ontology (GO) [25
] annotation assigned at least one GO term for 29,303 (13.2%) of OakContigV1 sequences. The average number of GO annotations per sequence was 5.08, while the maximum number of annotation per sequence was 46. The total number of GO terms was 4,960. When these terms were mapped onto plant specific GO slim terms, the number of term converged to 69 terms (Figure ). The most abundant GO slim terms were Transport, Nucleotide binding, Plastid, in terms of Biological process, Molecular function and Cellular component, respectively. Candidate genes of ecological or economic importance were found in OakContigV1 as illustrated for bud phenology (additional file 6
: Table S3), drought stress resistance with emphasis on cuticle formation (additional file 7
: Table S4) and phenylpropanoid biosynthesis (additional file 8
: Figure S4). Genes relating to cell wall formation were detected based on tBlastX searches against MAIZEWALL database [26
] (Additional file 9
: Table S5). These results demonstrate the value of the EST catalogue that was produced for future functional genomics studies in oaks.
Figure 6 Number of hits and high-scoring segment pair aligned length of BlastN (OakContigV1) against gene indices. The e-value cut-off was set at 1e-3. The gene indices abbreviations are as follows: AGI; Arabidopsis thaliana, HAGI; Helianthus annuus, NTGI; Nicotiana (more ...)
Figure 7 Gene ontology classification of OakContigV1 using GO slim terms of plants. GO terms were assigned by BlastX against SWISS_PROT database with e-value cut-off of 1e-5. GO slim terms are as follows for Biological process (A): A, Transport; B, Response to (more ...)
To further analyse the added value of Sanger reads in terms of annotation, we compared the annotation rate of 454 and Sanger unigene elements. From the 40,542 contigs and 139,444 singletons containing 454-reads only, 5,404 (13.3%) contigs and 33,047 (23.7%) singletons did not show a single Blast hit, whereas, from the 28,612 contigs and 14,073 singletons containing at least one Sanger read, these numbers drop down to 391 (1.37%) contigs and 1,351 (9.60%) singletons. This result clearly indicates the value of Sanger reads for functional annotation. Therefore it can be concluded that Sanger reads improve not only the assembly but also the annotation of large dataset produced by next generation technology. The fact that the lower annotation (no blast hit) rate of unigene elements containing 454-reads only may also suggest they contain higher rate of novel or artifactual transcripts. Tedersoo et al. [27
] indicated that singletons from 454-reads contained higher rate of artifactual reads. Further laboratory work and/or bioinformatic characterization may be needed for the validation of singletons in OakContigV1.
In silico mining of Simple Sequence repeats (SSRs) and Single Nucleotide Polymorphisms (SNPs)
Using mreps [28
], we found 52,834 SSRs (microsatellites) with minimum repeat of five, four, three, three and three for di-, tri-, tetra-, penta- and hexa-SSRs, respectively, in 38,653 unigene elements of OakcontigV1. Specific information for each SSR included the unigene element ID and the annotation, the repeat motif, its length and position (Additional file 10
: Table S6, also available through the Quercus portal https://w3.pierroton.inra.fr:8443/QuercusPortal/Home.jsf
). Dinucleotide as well as trinucleotide motifs were frequent, summing up 72.9% of the total number of microsatellites (Table ). Among dinucleotide and trinucleotide repeats, AG and AAG motifs, respectively, were the most frequent. Only 40 CG repeats were found. Tetranucleotide (10.5%), pentanucleotide (6.8%) and Hexanucleotide (9.9%) repeats were of low abundance. The frequency of microsatellites was 23.7% considering multiple occurrence in a same unigene element, which was close to that calculated by Durand et al. [29
] for 28,024 Sanger unigene elements in oak (18.6%). When we screened microsatellites within eight TIGR gene indices [24
] used in the similarity search (see Method section) with the same method (ie
. mreps), the most frequent motif was tri-SSRs (Additional file 11
: Figure S5), which confirmed the general trend in SSR frequency for plant ESTs [30
]. It should be noted, however, that definition of microsatellite and detection algorithm have great impact on number of detected microsatellite in silico [31
]. When the distribution of SSR motif was visualized by SOM (Self Organizing Map), OakContigV1 located near PPLGI (Populus
) (Figure ), which may again reflect the phylogenetic position of oaks in the eurosid I. When SSR locations (coding or non-coding) were estimated by combining results from ESTScan [32
] and mreps as in Durand et al. [29
], the location for 38,649 (73.2%) SSRs was estimated and the same trends were found (Additional file 12
: Figure S6). In brief, tri-SSRs were the most frequently found in coding regions (33.4% of the total SSRs with location estimation), while di-SSRs were frequent (27.2%) in non-coding regions. Because of functional constraints of peptides, tri-SSRs with no frame shift mutations are preferable for coding regions [33
]. As discussed in [27
], di-SSRs in non-coding regions were more frequently found in 5' UTRs of plant transcripts [34
], suggesting that they may be involved in gene expression regulation.
In silico mining of microsatellites within OakContigV1
Figure 8 Self organizing map for microsatellite motif distribution between eight gene indices and OakContigV1. The gene indices abbreviations are as follows: AGI; Arabidopsis thaliana, HAGI; Helianthus annuus, NTGI; Nicotiana tabacum, MTGI; Medicago truncatula (more ...)
SNP detection was carried out on a subset of the 69,154 contigs. We first took into account the presence of duplicated reads in order to avoid false SNP detection [35
], i.e. a single representative was kept for the analysis. Then, putative SNPs were screened for contigs with a coverage depth of more than six sequences. If the less frequent allele count was more than two and 100% identical for four bases before and after the polymorphic site, we considered this site as a putative SNP. The putative SNPs were identified among 13,334 (19.2%) contigs, resulting in 36,411 sites and an average of 1 SNP every 471 bp in contigs with putative SNPs. Average and maximum number of SNPs detected in a contig was 2.7 and 19, respectively. Transition type SNPs (A/G and T/C) were relatively frequent and amounted to 67.18% (Table ). Within FrameDP-predicted peptides, there were 48,247 SNPs, which resulted in 27,762 (57.54%) SNPs in coding regions, including 17,620 (36.52%) and 10,140 (21.02%), synonymous and non-synonymous SNPs, respectively. SNP density in oak was lower compared to that found in Eucalyptus grandis
transcriptomes (1 SNP every 192 bp) based on 21 individuals using GS Reference Mapper (454 Life Science, Branford, CT, USA) [37
]. If we apply the same criteria to calculate SNP frequency to that of the Eucalyptus study (SNP called within contig length >200 bp and contig depth >10), the frequency remained identical (1 SNP every 471 bp), though we used more than 200 individuals for the sequencing step. In a de novo
assembly of a coral larval transcriptome with 454 GS-FLX pyrosequencing [38
], SNPs were screened by QualitySNP program and 33,433 SNPs were identified resulting in 1 SNP per 207 bp. The oak SNP frequency was still lower, probably due to more stringent criteria used for SNP detection. Using information from the predicted peptides by FrameDP (only one peptide predicted for each unigene element to avoid chimeric elements) and clustering by BLASTClust (at 70% coverage and 75% similarity), a set of 20,826 SNPs, including 16,196 and 4,630 potential coding and non-coding SNPs, respectively, was selected. We also found 59 SNPs relating to chloroplast (45) and mitochondrial (14) sequences. After these SNPs were eliminated, 20,810 genomic SNPs were retained for future genetic study. This SNP data set has also been made available for downloading at the Quercus portal https://w3.pierroton.inra.fr:8443/QuercusPortal/Home.jsf
In silico mining of SNPs within OakContigV1
Gene diversity was calculated for 308 and 1,770 SNP sites within Q. robur
and Q. petraea
, respectively (for criteria to select SNP sites for gene diversity calculation, see Methods section). The averages were 0.3336 and 0.3340 for Q. robur
and Q. petraea
, respectively. These values were comparable to those calculated from marker-based analysis for Cryptomeria japonica
= 0.322) [39
] and Eucalyptus grandis
and E. smithii
= 0.357) [40
Detection of orthologous and paralogous gene pairs between oak and the eudicotyledons sequenced reference genomes
Recently, Salse et al. [41
] published an original and robust method for the identification of orthologous regions between plant genomes as well as for the detection of duplications within genomes based on integrative sequence alignment criteria combined with a statistical validation. This approach was applied to identify 7 paleo-duplications in monocots and eudicots and to propose a common ancestor with 5 and 7 chromosomes for the monocots and eudicots respectively [42
]. In the current study, we used the 31,798 unigene set resulting from the PartiGene assembly and FrameDP analysis to integrate the oak transcriptome information into previous paleo-genomics analysis in order to unravel the oak evolutionary paleo-history.
Using the alignment parameters and statistic tests previously described by Salse et al. [42
], we analysed the orthologous relationships between oak, Arabidopsis (33,198 gene models), poplar (30,260 gene models), grape (21,189 gene models) and soybean (46,194 gene models) genomes. Based on the 31,798 oak unigene elements, we identified 4,574 orthologous gene pairs between oak and Arabidopsis
(477 orthologs), poplar (658 orthologs), grape (1,825 orthologs) and soybean (1,614 orthologs) genomes. The Ks distribution analysis (Figure ) performed between the 4,574 orthologous gene pairs establishes that oak is most closely related at the sequence level to grape (brown curve and arrow) than any other eudicot genome included in the analysis. We then produced a heterologous oak gene map based on the precise identification of oak orthologs on the 19 grape chromosomes (Figure ). This 1,825 robust orthologs identified between oak and grape can be considered as a valuable source of COS (Comparative Orthologous Sequences) markers for further comparative genomics and genetics analysis [43
Figure 9 Oak genome orthologous and paralogous relationships. A. The distribution of Ks distance (scaled in MYA) values observed for the orthologous gene pairs identified between oak and Arabidopsis (light blue curve), poplar (red curve), grape (brown curve) and (more ...)
We applied the most robust and direct approach allowing the characterization of genome duplications that consists of aligning the available unigene set (31,798 elements) on itself using stringent alignment criteria and statistical validation described in Salse et al. [43
]. The Ks distribution analysis (Figure ) obtained for 1,526 (43%) of the 3,520 paralogous gene pairs available (black bars) when compared to that obtained for Arabidopsis
(1,646 paralogs, light blue curve), poplar (4,164 paralogs, red curve), grape (542 paralogs, brown curve), and soybean (9,532 paralogs, blue curve) genomes clearly established that the actual oak genome went through at least two rounds or series of whole genome duplications (grey boxes on Figure ), such as ancestral (referenced as γ in the literature based on grape distribution peak in brown) shared by the eudicots and more recent (referenced as α/β based on the Arabidopsis
distribution peak in light blue or p based on soybean and poplar distribution peaks respectively in dark blue and red).