In total, we found 136 866 nuclear encoded sequences, five pseudo-genes (FJ854546, FJ854545, D14632, AF310844, AJ404858, not included in PR2) and 34 sequences we could only assign as putative rRNA sequences (HM538255, GU385678, AB275106, AJ628837, AY180011, CP000499, CP000499, AY256215, EU402432, AB017015, GQ330639, GU820811, JF488788, AF239231, DQ423737, DQ104596, AY835700, DQ423728, EU545797, GU072272, GU072526, GQ247249, HM174255, DQ104594, EU174762, FN598473, EU726200, EF695080, GQ483783, GQ462590, EU173354, EF567390, EF695215, HQ871039, not included in PR2). Manual analyses of some of them allowed concluding for the presence of artefactual sequence internal or at the 5′ or 3′ end. Among nuclear-encoded sequences, we detected 1756 putative chimeric sequences, either using the KeyDNAtools and/or by manual inspection (listed on the website). For example, sequence EF023694.1.1975_U is a chimera between parent sequences of Opisthokonta, Amoebozoa and Rhizaria in position 179-471, 623-1264 and 1536-1925, respectively. Other ‘18S’ sequences are nucleomorphs (262 sequences). In all, 9657 sequences have a chloroplastic origin, 33 051 are from mitochondria, six from hydrogenosomes (AJ237907, AJ237908, AJ871215, AJ871217 AJ871267, Y16670) and 26 from apicoplasts (U87145, AB471801, AB471802, AB471803, AB471804, AB471805, AB471806, AB471807, AB471808, AB471809, AB471810, AB471811, AB471812, AB649417, AB649418, AB649419, AB649420, AB649421, AB649422, AB649423, AB649424, HQ110105, JQ437257, JQ437258, JQ437259, U28056).
Within nuclear-encoded sequences, 54 data entries remained unassigned at the Super-Group level (), meaning that they could not be assigned to any specific taxon group within the domain Eukaryota (Eukaryota_X). The Super-Group ‘Eukaryota_Mikro’ was created for sequences HM563060, AF477623 and HM563061, for which no consensus has been reached for their affiliation, although Haplosporidiidae has been suggested (13
). BLAST analyses conducted at NCBI against non-redundant or at DNA Data Bank of Japan (DDBJ) against all showed extremely weak sequence similarity with sequences of fungi. Using our global similarity tool (Crunch_Assign) showed no other sequence similar at ≥80% along the entire sequence. These results conducted to the creation of this new Super-Group (rank 2). For unassigned nuclear-encoded sequences (Eukaryota_X), either no other similar sequence was found or similar sequences were detected but also annotated by us as Eukaryota_X. A BLAST on NCBI non-redundant (excluding environmental sequences) and at DDBJ (all) revealed that a large number of them probably contained undescribed introns. Therefore, these sequences probably require a manual curation, but again highlight the importance of intron identification in eukaryotic sequences.
Number of nuclear-encoded sequences in PR2 as annotated at the Super-Group taxonomic level
For lower taxonomic ranks, there were primarily two types of cases resulting in a failure to assign a taxonomic identity:
- No agreement between experts to resolve at a given rank. For example, the genus (rank 7) is assigned, the order (rank 5) is assigned, but a family (rank 6) has not yet been described, or this rank is in fact polyphyletic, with no proper descriptions of the different families.
- A given sequence is similar at the family level with several sequences from different families; however, they agree at the order level.
In such cases, this sequence was assigned as … |Order| Order_X[Genus|Genus + species. If a genus was not described (i.e. uncultured), the taxonomy becomes: … |Order| Order_X[Order_XX|Order_XX + sp.
More than 74 000 sequences (54% of total number of sequences in the PR2 database) belong to Opisthonkonta (). Alveolata and Archaeplastida are second in abundances (15 and 12%, respectively). Stramenopiles and Rhizaria represent 7.2 and 5.6 %, respectively. Others SuperGroups represent less than 2.2%. Only 29.4% are complete or nearly complete. In total, 63.7% of sequences include the V4 region and only 12.1% and 11.7% include the V9 region as recognized by primers Biomarks and Wamps (see the legend of ), respectively. Apusozoa, Hacrobia, Excavata and Opisthokonta have <10% of their sequences that include the V9 region. V9 region of Amoebozoa and Archaeplastida are better represented (34% and 25%, respectively, using the Biomarks primers).
Figure 1. Total number of SSU rDNA gene sequences in the PR2 database for each main eukaryotic lineage (all sequences = grey + black, complete or nearly complete sequences in light-grey). Note that nucleomorphs were extracted from Archaeplastida. Numbers indicated (more ...)