In the four datasets, sequences that were assigned to known cellular organisms varied from 9% to 18% (Table ). The vast majority of the assigned sequences were to Eukaryota and to Bacteria. Few sequences were assigned to the NCBI Taxonomy categories: Archaea, Viroids, Other and Unclassified. Only among the D. pulicaria sequences were hits (a total of 4) found to viruses. However, the low bit scores suggest that these may have other origins. As the scaffolds of D. pulex included in this study had been presorted to include only bacteria, there might have been more hits to taxa other than Bacteria and Eukaryota.
Number of sequences assigned and unassigned in the MEGAN analysis.
The numbers of bacterial genera (excluding the Firmicutes) with at least two reads assigned were 90, 123, 37 and 51 for the D. pulex, D. pulicaria, D. magna GS 20 and D. magna GS FLX datasets, respectively. The lower number of genera revealed by the D. magna datasets corresponds with the smaller size of these datasets (Table , Table ). This large number of genera indicates a rich community of bacteria in and on Daphnia. In all datasets the majority of the sequences were assigned to the Gamma- and Betaproteobacteria (Fig. ), which together accounted for more than 87% of the sequences assigned to bacteria. Outside the Proteobacteria, the Bacteroidetes and to a lesser degree to the Actinobacteria were found, the later however, mainly in the D. pulicaria dataset. Except the Actinobacteria, all taxa with substantial number of sequences assigned to were found in datasets from all three Daphnia species.
Summary of the four datasets included in this analysis.
Figure 2 The comparative taxonomic tree of the bacterial orders found in the three Daphnia datasets. The data of the two D. magna datasets were combined for this figure. Only bacterial orders, with at least 2 sequences assigned are included. The Firmicutes were (more ...)
Assignment of sequences to the bacteria, without Firmicutes and Cyanobacteria
The majority of the assigned sequences fall on two phyla, the Bacteroidetes and the Proteobacteria. Among the Bacteroidetes, most sequences were assigned to the Flavobateriales (between 187 to 463 sequences per sets, or 1.3 – 7.7% of the sequences) and a very large proportion of those to the genus Flavobacterium
(Fig. ). Within this genus, no single species stuck out as giving a better match than other species. Flavobacteria
are a group of opportunistic pathogens (e.g. salmon), commensals (e.g. in infusoria, cnidaria) [27
] and intracellular symbionts of insects [28
]. They are widely distributed in freshwater habitats, but also occur in association with terrestrial hosts. Some members of Flavobacteria
are known to play a significant role in the degradation of proteins, polysaccharides, and diatom debris in natural environments [31
]. Cultured representatives of Flavobacteria with ability to degrade various biopolymers such as cellulose, chitin and pectin were described [33
]. The commonness in all datasets here indicates that they may indeed be symbionts of Daphnia
. One may speculate that Flavobacterium
may play a role in food digestion in Daphnia
, which mainly feed on unicellular planktonic algae [34
]. This hypothesis has to be tested with a targeted approach.
Taxonomic diversity of the three Daphnia datasets within the Bacteroidetes/Chlorobi group. For more explanation see legend to Fig. 2.
Another genus of the Bacteroidetes, which was consistently found in all datasets is Cytophaga (Fig. ) These are gliding bacteria found in freshwater and marine habitats, in soil and in decomposing organic matter. However, hits to this genus were never frequent (between 10 and 25 hits).
The phylum Proteobacteria attracted 98, 94, 84 and 88% of the sequences assigned to bacteria in the D. pulex, D. pulicaria, D. magna GS 20 and D. magna GS FLX datasets, respectively. Table shows the distribution of all Proteobacteria genera for which at least one dataset attracted more than 1% of the sequences assigned to Bacteria.
Taxa within the Proteobacteria, which attracted at least 1% of the sequences within at least one of the four datasets.
The Alphaproteobacteria attracted a lager number of hits (3.9 to 8% of sequences), with the genus Rhodobacter being the most common in all three Daphnia species (0.4 to 2.8% of reads) (Fig. ). Other genera of the Alphaproteobacteria were only found in the D. pulex or the D. pulicaria datasets (Fig. ). Alphaproteobacteria are commonly found in freshwater environments, including sewage. They are known for a wide range of metabolic capabilities. Rhodobacter were isolated from sea and freshwater.
Taxonomic diversity of the three Daphnia datasets within Alphaproteobacteria. For more explanation see legend to Fig. 2.
The majority of the sequences assigned to the Proteobacteria (overall about 50% of sequences) where assigned to the Burkholderiales within the Betaproteobacteria (Fig. , Table ). Within the Burkholderiales, one family, the Comamonadaceae accounted for most of these hits (Fig. ). The Comamonadaceae is a family of gram-negative aerobic bacteria, encompassing the acidovorans rRNA complex. Some species are pathogenic for plants. Within this family four genera (Acidovorax, Rhodoverax, Polaromonas and Verminephrobacter) showed up repeatedly and in high numbers in all datasets (Table , Fig. ). The genera Acidovorax and Polaromonas were particularly common. A further genus, Delftia was only common in the D. pulex and D. pulicaria sequences (Fig. ).
Taxonomic diversity of the three Daphnia datasets within Betaproteobacteria. For more explanation see legend to Fig. 2.
A few other genera within the Betaproteobacteria attracted relatively high numbers of sequences across all or most of the datasets: Chromobacterium, Methylibium, Bordetella, Burkholderia and Cupriavidus (Table , Fig. ). Of those Methylibium petroleiphilum was highly represented. However, a closer inspection of the sequence alignments indicates that the species in our datasets is not exactly this, but a related species.
Four genera within the Gammaproteobacteria attracted larger numbers of sequences, but in contrast to the genera in the other classes of the Proteobacteria, here the distribution was not even across the datasets (Table , Fig. ).
Taxonomic diversity of the three Daphnia datasets within Gammaproteobacteria. For more explanation see legend to Fig. 2.
Hits to species of the genus Aeromonas were found in large number in the D. pulicaria dataset, but hardly in the other sets (Table , Fig. ). Hits were mainly to A. hydrophila and A. salmonicida, but similarities were below 100%. Both can live under aerobic or anaerobic conditions and are found in water. A. hydrophila is an opportunistic pathogen of humans, A. salmonicida causes the fish disease, furunculosis.
The single most often assigned genus in the entire analysis was Pseudomonas
in the D. pulex
dataset (10,994 assigned reads, 43.3%). These hits were mainly to the species P. fluorescens
(7,067 reads), and in particularly to the strain PfO-1. Similar, but not as extreme was the presence of the same bacterium in the D. pulicaria
sequences (Table , Fig. ). The P. fluorescens
PfO-1 genome project was run in the same genome center (The DOE Joint Genome Institute (JGI, http://www.jgi.doe.gov/
) where the D. pulex
and the D. pulicaria
sequences were obtained and it seemed possible, that these hits reflect a contamination in the D. pulex
scaffolds, rather than a symbiont of D. pulex
. However, inspection of bit scores and sequence identity values in the BLASTN outputs indicated that the Daphnia
symbiont is clearly not P. fluorescens
PfO-1. The P. fluorescens
group includes diverse bacteria that are found in soil, but also in aquatic environments.
A further contamination candidate is the Gammmaproteobacterium Serratia, to which we found 2,184 matched sequences in the D. pulex genome. However, it is hardly seen among the D. pulicaria sequences, and not seen at all among the D. magna sequences (Table , Fig. ). The species to which most sequences were assigned is Serratia proteamaculans 568, whose genome was sequenced as well by the DOE Joint Genome Institute. Also here, the inspection of the BLASTN results indicated high similarity, but few perfect matches, excluding contamination at the JGI. Serratia are often associated with the human gut, but are not pathogenic.
Another genus with many hits to the D. pulex and the D. pulicaria sequences, but not to the D. magna sequences (Table ), is the already mentioned Betaproteobacterium Delftia (Fig. ). The DOE Joint Genome Institute sequenced Delftia acidovorans strain SPH-1, which is the strain most of the sequences were assigned to. However, inspection of the BLASTN results again showed that the Daphnia symbiont is clearly not D. acidovorans strain SPH-1.
About 200 sequences matched Deltaproteobacteria (Fig. ). Within this order various taxa matched sequences from the datasets. However, there was no consistent picture across the three Daphnia species (Fig. ).
Taxonomic diversity of the three Daphnia datasets within Delta- and Epsilonproteobacteria. For more explanation see legend to Fig. 2.
Searching for Cyanobacteria and plastid sequences
Following the suggestion of Chang and Jenkins [18
] that Daphnia
may carry symbiontic plastids or cyanobacteria with them, we looked more closely into these two groups. The D. magna
sequences revealed no hit to any Cyanobacteria taxon. Of the D. pulex
sequences 44 (= 0.17% of the assigned sequences) were assigned to the Nostocales, a taxon of the Cyanobacteria. 19 (= 0.074%) of these hits were to the genus Nostoc
. In the D. pulicaria
we found 22 sequences assigned to the Cyanobacteria, half of which were to the Nostocales (Fig. ).
Taxonomic diversity of the three Daphnia datasets within Cyanobacteria and Actinobacteria. For more explanation see legend to Fig. 2.
The D. pulicaria dataset revealed 23 sequences assigned to plastids. One of them was a short sequence (100 bps) to the chloroplasts of the green algae Chlamydomonas, the other to the chloroplasts of flowering plants. Hits to the later came mostly from one scaffold and had high bit scores (> 500) and similarities of more than 90%. The D. pulex sequences revealed no hits to plastids, but this is not surprising, as the dataset had been sorted out to contain predominately prokaryote sequences. The D. magna GS 20 dataset did not reveal any hits to plastids. The D. magna GS FLX sequences contained a short sequence (104 bps) matched to a plastid, the chloroplast of the green algae Stigeoclonium helveticum.
The presence of plastid sequences in Daphnia
shotgun datasets has however, to be looked at with care, as unicellular green algae are the main food of Daphnia
, both in the field and in the laboratory [34
]. However, the few sequences assigned to plastids here seem not to correspond closely with the algae, which were used to feed the Daphnia
in the cultures, before they were used for DNA extraction. The D. magna
and the D. pulex
clone had been kept on an exclusive diet of the green algae Scenedesmus
sp. and the D. pulicaria
clone on a diet of the green algae Ankistrodesmus falcatus
All in all we consider this as rather weak evidence for plastid symbionts in these Daphnia
samples. The original finding was done in D. obtusa
], which was not included in our study. The authors had observed variation in the type and frequency of plastid occurrence in this species, so it may not be surprising that things are different in other species. Furthermore, the long maintenance of the Daphnia
clones in laboratory cultures may have contributed to a loss of plastids. Therefore, the absence of evidence from our metagenomics analysis is certainly not evidence for the absence of possible plastid symbionts in Daphnia
Searching for 16S rDNA sequences
All four datasets were also analyzed with a more conventional approach, which was to identify contigs/scaffolds similar to known 16S rDNA sequences. We compared our data with a collection of 471,792 16S rDNA sequences collected by the Ribosomal Database Project (RDP release 9 update 57) [36
]. In total, 27 16S rDNA fragments were identified in the D. pulicaria
dataset, 13 in the D. pulex
, 14 in the D. magna
GS 20, and 11 in the D. magna
GS FLX. Of those, 17, 11, 9, and 10 bacterial species could be inferred in the D. pulicaria
, D. pulex
, D. magna
GS 20, and D. magna
GS FLX dataset, respectively. Other partial 16S rDNA sequences were identical or almost identical to regions conserved across species, thus could not be used to infer the species. In Table we listed close to full length 16S rDNA sequences found in the four datasets. The nucleotide sequence identity between these sequences and their corresponding best matches ranged from 91% to 100%. Most best matched 16S rDNAs to our sequences were from uncultured bacteria. Bacterial species that could be inferred using 97% sequence identity as the cutoff value included Pseudomonas
sp., E. coli
and the already discussed (see above) Flavobacterium
sp. (Table ). In both D. pulex
and D. pulicaria
datasets, sequences highly similar to 16S rDNA of unclassified aquatic bacterium R1-B19 were found, an undescribed beta proteobacterium (Table ).
16S rDNA sequences close to full length identified in the four datasets.
The 16S rDNA sequences identified only a small subset of the species/genus found in our main analysis based on comparison to NCBI-nt database. One likely explanation of this discrepancy is the low sequencing coverage within the 16S rDNA regions in the shotgun datasets. Another explanation could be that some of the earlier predictions were false positives. However, MEGAN associates a sequence to the lowest common ancestor of the set of taxa defined by all matches above defined thresholds. The amount of false predictions is predicted to be low since the algorithm makes higher amount of unspecific assignments to higher taxonomy levels [20
]. Certainly when taxa were inferred regardless if the matched sequence was a suitable phylogenetic marker or not, it could not be excluded that some of the predictions were results of horizontal gene transfer events. However, if this were the case, MEGAN would assign the hit to the least common ancestor of the species, which were involved in horizontal gene transfer, unless neither these species nor related species are in the NCBI database. It was predicted that computing taxonomic content based on sequence comparison to NCBI-nt database will show better resolution at all levels of the taxonomy than an analysis based on a small set of phylogenetic markers or on 16S rDNA sequences alone [20
]. Our results are consistent with this prediction.
Despite the under-prediction and the differences between the NCBI-nt and the 16S rDNA databases, quantitatively, the two approaches correlated fairly well at higher taxonomic level (Fig. ).
Figure 9 Correlation of taxonomic content computed by comparison to NCBI-nt and comparison to 16S rDNA database. The number of sequences assigned to the following taxonomic nodes were plotted: Bacteria, Proteobacteria, Bacteroidetes, Gammaproteobacteria, Deltaproteobacteria, (more ...)
Searching for identical and similar sequences common in four datasets
Although sequences in all datasets were assigned to similar bacterial taxa, it is not clear how similar the sequences are across datasets. To identify common sequences, we compared the D. magna GS 20 sequences with sequences from D. magna GS FLX, D. pulex, and D. pulicaria using BLASTN. Identical or nearly identical sequences were identified when a stretch longer than 80% of a query sequence can be aligned with over 98% nucleotide sequence identity to a hit sequence. With this criterion five D. magna GS 20 contigs (corresponding to six D. pulex scaffolds and 12 D. pulicaria reads) were identified. Hits identical to these sequences were all found in complete genome sequences of Escherichia coli W3110 (AP009048.1) and E. coli K12 MG1655 (U00096.2), which suggests that commensal E. coli strains carried by the three Daphnia species are highly similar.
With a less stringent criterion (a stretch longer than 50% of a query sequence can be aligned with over 90% nucleotide sequence identity to a hit sequence), similar sequences to about 80 GS 20 contig sequences were also identified across the datasets. These sequences mainly fall into taxa within the Proteobacteria, with a few sequences assigned to Flavobacterium.
The small number of similar sequences shared across the datatsets suggested the bacterial community carried by the three Daphnia clones from which our datasets originated might be diverse at species and strain level, despite very high homogeneousness observed at higher taxonomy nodes. It should be noted however, that our datasets do not originate directly from field samples, but from three clones, which had been kept in three different laboratories for several generations before the DNA was isolated. This may possibly influence our results in two ways. First, we cannot truly make statements about three Daphnia species, but only about three clones, each coming from a different Daphnia species. Including more clones, might reveal more bacterial symbionts. Second, while culturing these clones in the laboratory, the symbiont community may have changed both qualitatively and quantitatively. New bacterial species may have arrived with food or culture conditions, while other bacteria may have been lost due to the inappropriateness of the laboratory conditions for their culture. For the current analysis, no attempts have been undertaken to vary the culture conditions for any of the three clones and the bacteria associated with the food alga have not been analyzed.
Repeatability of the metagenomics approach
For D. magna we obtained two shotgun datasets, with sequences produced with two different sequencing platforms, the pyrosequencers GS 20 and GS FLX. Figure shows the number of sequences assigned to all prokaryote genera (excluding the Firmicutes) in the two datasets. The two datasets gave very congruent results, with a correlation coefficient of r = 0.98 (P < 0.001, n = 55). The plot shows clearly that stochastic differences occur for genera with very few hits. Expectedly, below 10 sequences assigned to a genus, the datasets lead to quite divergent result.
Comparison of the number of assigned sequences (log10(x+1)) to prokaryote genera (excluding Firmicutes) of the combined two D. magna datasets.
Using contigs instead of reads
For the D. pulicaria dataset, both contigs and singleton raw reads were included in our analysis. For the other three datasets, we used only sequences, which had previously been assembled to contigs or scaffolds. This reduced the number of sequences and thus the number of BLASTN searches considerably. Using large numbers of raw reads would have been beyond our computing power and the abilities of the MEGAN software within a reasonable time period. Using contigs and scaffolds influences the results in various ways. First, it strongly reduces redundancy in the dataset and therefore makes the analysis much quicker. Second, it compromises somewhat the usefulness of the number of assigned sequences as a measure for the abundance for the different taxa. The number of assigned sequences is still a relative measure for the frequency of a given taxa, but the larger the real number of hits would have been, the more strongly the value is reduced. Third, rare members of the symbiont community are likely to remain undetected, because the few reads sequenced for rare species, were unlikely to be assembled in contigs. Thus, our estimates of the number of taxa detected are likely to underestimate the true number of taxa in the community. This conclusion is also supported by the observation that the D. pulicaria dataset contained the highest number of taxa identified.