“Community” is commonly defined several ways, including “the species that occur together in space and time” [30
] and “an association of interacting populations” [31
]. Assemblage is probably the most proper term to describe viral groups, and most instances of “community” in the literature, both by ourselves and others, is not correct. See [32
] for a disambiguation of some important ecological terms.
General Characteristics of the Marine Viral Metagenomes
On average, >91% of the sequences were not significantly similar to those in the extant databases (A). A partial explanation for the high percentage of unknowns is almost certainly due to the shorter sequences (~100 bp on average) that are generated by pyrosequencing at 454 Life Sciences. Previous viral metagenomic studies that used Sanger sequencing (~650 bp fragments) found that >60% of the sequences were unknowns [33
]. The Arctic Ocean sample had the highest percentage of known similarities (11%) to the SEED database, mostly because of the large number of prophage-like sequences (). Comparison of the marine viral sequences to the environmental database did not yield a significant number of new similarities compared to the SEED database (~2% to the environmental database), with the notable exception of the Sargasso Sea sample, where >9% of the similarities were to the environmental database, presumably because the major sources of sequences for the environmental database were the Sargasso Sea microbial metagenomes, originally collected in 2003 [17
]. The overlap between the viral metagenome and the microbial metagenomes raises several important points. First, a significant number of viral sequences are retained on the larger-pore filters, either as free viruses, proviruses, or in cells undergoing a burst. The latter explanation was hypothesized by Delong et al. [19
], who observed a large number of viral similarities at one depth at the Hawaii Oceanic Time-series (HOT) station. Second, the microbial assemblages in the Sargasso Sea appear to be relatively stable over prolonged periods (~2 y). Finally, the small amount of sampling and sequencing represented by these two studies (~1012
bp) is already constricting the unknown sequence space of the Sargasso Sea. With the continual decline in Sanger sequencing costs and introduction of large-scale pyrosequencing, metagenomic approaches should be able to characterize global sequence diversity in a relatively short period of time.
Number of Similarities to Phage Genomes and Groups of Interest in the Four Metagenomes
Among the fraction of sequences with similarity to the SEED database, most of the “knowns” were similarities to bacterial sequences in the Arctic, British Columbia, and Gulf of Mexico samples (B). This can be accounted for by the following: (i) the larger number of microbial rather than viral genomes in the database, (ii) unidentified prophages within microbial genomes, (iii) the large amount of horizontal gene transfer between phages and their hosts, (iv) phages carrying full genes from their host, as observed in sequenced phage genomes [34
], and (v) the overall larger size of bacterial genes relative to viral genes, statistically increasing the probability of sequencing and hitting them.
The sample from the Sargasso Sea was exceptional in that the majority of “known” sequences were most similar to three Prochlorococcus
phage genomes () originally isolated from the same area of the ocean [34
]. This finding suggests that just a few phage genomes from novel environments will greatly increase our understanding of viral diversity in these environments. The distribution of BLASTN similarities along the Prochlorococcus marinus
P-SSP7 genome [34
] is shown in A. There is almost complete coverage of the genome within the Sargasso Sea sample. In contrast, the similarly sized Roseobacteria
SIO1 genome [36
], which was isolated from near-shore waters in California, is only sparsely covered in the Sargasso Sea sample, but has higher coverage in the Gulf of Mexico and British Columbia samples. This supports the idea that certain phage groups are more prevalent in certain biogeographic regions. This general pattern was reinforced by the observation of a number of phage genomes and groups prevalent in different oceanic regions ().
Distribution of Similarities and Assembly Controls
The five most abundant putative viral-encoded enzymes () appear to be involved in scavenging host nucleotides (e.g., riboreductases) and supporting host metabolism through the infection cycle (e.g., carboxylyases and transferases). The viral fraction also contained psbA
genes, which encode the D1 protein of photosystem II in the cyanobacteria. The majority of sequenced cyanophages carry this gene, and evidence is mounting that the cyanophages need the D1 protein for successful infection and replication [34
]. The occurrence of psbA
was lowest in the Arctic sample, probably reflecting a decrease in the host and cyanophage numbers in the colder environments.
The Most Abundant Enzyme-Coding Genes in the Four Oceanic Viral Metagenomes
Discovery of an Abundant Marine ssDNA Phage Group
The Sargasso Sea sample had a large number of sequences (6% of the total; ) with significant similarities to chp1-like Chlamydiamicrovirus (Microviridae family). These viruses are small ssDNA phages. Assemblies from these sequences resulted in the near-complete genomes of several marine Microviridae phages from the Sargasso Sea sequences (B). To our knowledge, this is the first report describing the presence of this phage group in the marine environment, which was previously overlooked because the amplification and cloning methods excluded ssDNA viruses. The only other report of ssDNA viruses in the marine environment was a Circovirus that infected diatoms [39
]. However, the marine sequences in this study did not show any similarity to that virus. Sequences with significant similarity to the chp1-like phages were observed less frequently in the British Columbia (~10-fold less common than in SAR) and Gulf of Mexico samples (~100-fold less common than in SAR). No sequences from this group were found in the Arctic sample ( and ). Primers were designed against these genomes and appropriately sized DNA fragments were amplified from the Sargasso Sea sample (C). No amplicons were detected in the Gulf of Mexico or British Columbia samples, suggesting that they were present at numbers below the level of detection in this PCR or had a divergent sequence. A geographical constraint that limits the distribution of these viruses would be most consistent with these results. However concerns about sample amplification and storage bias make it impossible to accurately access the relative abundances of these viruses at this point.
Types of Phages in the Four Metagenomes
Every Phage Everywhere?
The distribution of similarities to the chp1-like Microphage, P. marinus
SIO1, and others in the viral-fraction suggests that viruses have restricted geographical distributions similar to those observed in micro- and macro-organisms [40
]. This is in contrast to studies that have shown that identical phage genes are distributed throughout the biosphere and that phages from soils and sediments can replicate in marine microbial populations [3
]. To determine whether all marine phages are spread everywhere or if there is a strong regionalization, three different approaches were used.
A new version of the Phage Proteomic Tree was constructed, and similarities from the samples were mapped onto this tree (). Eighty-four phage species were specific to one marine region, whereas 45 were common to all four. From the remaining phage species, 102 were found in several oceanic regions. The phylogenetic parsimony of phages from each sample was compared to the Phage Proteomic Tree using the PTP tests, because viruses do not have a single genetic locus conserved across all genomes. The PTP test showed that the distribution of phages in the marine samples is not random. First, marine phages are phylogenetically distinct from the available genomes, suggesting a “marine-ness” to the group as a whole (p < 0.0001; 10,000 randomizations). Second, there was a significant difference between phages from the different oceanic regions (p < 0.0001; 10,000 randomizations), supporting a geographical specificity for viruses despite the wide prevalence of some phage species.
An Isolation By Distance (IBD) approach demonstrated that there was a significant positive correlation between geographic distance (km) and genetic distance (as measured by ΦST) (Mantel test; Z = −78.9; r = 0.585; p < 0.017) (), indicating that the further two sites are from each other, the more differences there are between the viral assemblages. The magnitude of the slope was very small with only 3.28 × 10−5 ΦST/km.
Relationship between Geographic and Genetic Distances of Marine Viral Assemblages
Considering that any two locations on Earth can be separated by a maximum of 20,000 km (half the circumference of the globe), by extrapolation, any two viral assemblages could have a phylogenetic diversity of at most 0.656 ΦST. Although these data suggest a limit to the distribution of viruses among marine environments (e.g., due to limited viral movement or geographical selective pressure) (ΦST >> 0), it also indicates that no two marine viral assemblages could be totally different (ΦST << 1). Rather, they would exhibit a relatively large phylogenetic overlap.
Together the PTP and IBD test support that the marine virome is composed of specific viral groups. These viral assemblages undergo a regionalization, although a large fraction is vastly widespread. It is possible that some viruses are distributed ubiquitously, but their relative contribution to overall assemblage structure differs between oceanic regions. If this were true, then cross-contigs—i.e., contigs made of sequences from different metagenomes—would reflect this composition.
In the computer model of cross-contig analysis, all four viral assemblages were considered at the same time. Assemblies were performed and cross-contigs were identified. A Monte Carlo simulation was used to explain the average cross-contig spectrum. A full description of the assemblies and Monte-Carlo simulations are in the Protocol S1
A number of genotypes (varied between 0% and 100%) were arbitrarily and randomly defined as shared between samples; at the same time, the occurrence of individuals in the viral assemblage was also varied (). As an illustration, imagine two assemblages sharing 100 viruses, but with the relative rank on a rank-abundance curve being shuffled for the top viruses in the assemblage (see Protocol S1
). The best explanation of the observed cross-contigs is shown in and estimates that 35% of the most abundant genomes in any sample would have to be permuted in their relative abundance rank and that 100% of the viruses would have to be shared between samples. The intrasample controls showed that 85%–95% of the most abundant genomes were shared and 0%–0.5% were permuted (although 100% and 0% were expected, respectively). This discrepancy is probably due to limitations in the methodology used.
Monte Carlo Simulation of Cross-Contigs between Metagenomic Samples
This cross-contig analysis suggests that any two viral assemblages could have a vast majority of species in common and the order of the ranks in the rank-abundance curve could be determined by shuffling about a third of the most abundant species. These results confirm that geographical and changing environmental conditions allow different viral genotypes to become more or less prevalent within different assemblages while sharing essentially the same types of viruses. The less abundant viruses are not lost altogether, merely reduced in occurrence.
Local Versus Global Diversity
Using the PHACCS analysis system [29
], the genotype richness, diversity, and evenness of the different metagenomes were estimated (). The British Columbia viral metagenome was the most genotype-rich (129,000 predicted genotypes) and diverse (H′ of 10.8 nats), whereas the Arctic metagenome was the least genotype-rich (532 predicted genotypes) and diverse (H′ of 6.05 nats).
Viral Assemblage Structure Predicted from Assembly of Metagenomic Sequences
Being located on the west coast of the North American continent, the coast of British Columbia is in an upwelling area. It is also enclosed and fed by many rivers. These conditions might importantly increase the diversity of microbial communities and thus provide an explanation for the very high viral assemblage diversity estimated in this oceanic region. Omitting the BBC, the viral diversity of the other regions (the Gulf of Mexico, Sargasso Sea, and Arctic Ocean) correlate with the well-established North-South latitudinal diversity gradient [44
], with a larger diversity at lower latitudes. Planktonic diversity patterns of near-shore versus off-shore (more diverse plankton assemblages off-shore) [45
] were not observed here; the large spatial scale of the sampling probably masked this effect if present.
Assemblies of the mixed sample were used to predict global viral diversity using PHACCS. A total of 57,600 different viral genotypes in all four regions (H′ of 9.8 nats) was estimated. This number is smaller than the number of genotypes predicted in the BBC sample, which may indicate an undersampling for the mixed metagenome or be due to some of the assumptions of the model. Taken together, these data indicate that the global marine viral richness could be as high as a few hundred thousand species, with a regional richness sometimes almost as high, likely because of migration processes.
Integrative Versus Single Samples
It was expected that the integrated samples would be more even because it is assumed the viruses that were most abundant at one spatial-temporal time point would be rarer at another (“kill-the-winner” hypothesis). As summarized in , the evenness of the single time point sample (SAR 0.905) fell in between that of the three integrated samples (Arctic 0.964; BBC 0.918; GOM 0.851). Similarly, the predicted richness (5140 genotypes) and diversity (H′ 7.74 nats) at the single point represented by the Sargasso Sea sample fell in between that of the integrated samples (richness 532–129,000; H′ 6.05–10.8 nats). Because of factors with a supposedly greater impact, like latitude, it is not clear that integrating individual samples gave a greater depth of coverage.
Without a doubt, many interesting trends based on depth and a wide variety of other spatial, biological, and temporal parameters were missed by the integrative sampling used here. However, this sampling does provide a useful overview of the marine virome on a global and regional scale. Currently, there are no real criteria as to what constitutes a useful size or time scale for sampling natural viral assemblages, so there is no particular advantage or disadvantage to keeping samples separate or analyzing them as a metadataset. Rather the sampling scheme should be driven by the question being addressed. Viral assemblages are interesting in their own right, not just in context of their host communities. However, future studies should also start cross-correlating the viruses with their hosts. Of particular interest will be determining if the “islands” and ORFans observed in microbial genomes are represented in the virome [6
Potential Sampling and Processing Biases
Sampling bias in the current datasets was primarily due to loss of large viruses during filtering. Currently, there is no experimental method to avoid this problem. The cesium chloride gradients used here recover all known phage groups, and essentially all the viral-like particles in the starting samples migrate to the proper density in these preparations (as observed by epifluorescence microscopy; unpublished data). Unfortunately, the cloning methods used here will not recover RNA viruses. Suttle et al. [47
] have shown that RNA viruses are present in the marine environment. Whereas most electron microscopy [49
] and nucleic acid–based studies [51
] have not found RNA viruses in large numbers, RNA viruses are still believed to be important components of the marine virome that need additional study.
Another potential source of bias is the different times that the samples were stored before processing. Phage particles are very stable and often stored for decades at 4 °C. This is a commonly known lab phenomenon and is supported by the observation that the oldest viral concentrates (~12 y old) in this study had very high concentration of viruses (>109 viral-like particles per ml). Different phages, however, may have different decay rates under these conditions. This does not seem to be especially problematic, because there is no correlation between the types of viruses observed and the storage time. For example, the Arctic and SAR samples are the most recently harvested samples, yet they have the biggest differences in terms of types of phages (). Nonetheless, there may be effects of storage on the composition of the viral assemblages. For this reason, analyses based on absolute abundances of one specific virus to another were avoided in this study. Instead, the presence of a sequence in the metagenome was simply assumed to mean that the virus was in the original sample (i.e., an occurrence).
Whole-genome amplification techniques introduce biases in the relative concentrations of different genomes. Tests of Genomiphi by the manufacturer and others [52
] have not reported a significant bias in the amplification of circular double-stranded DNA (dsDNA), with the exception of very small dsDNA targets (<1 kb), which are much smaller than the vast majority of marine viruses, and of ssDNA, which will probably be a preferred target for the DNA polymerase. Although not bias-free, Genomiphi is the most accurate amplification method available [54
]. Interesting trends associated with viral assemblage structure may have been missed because of our choice of using presence/absence data for the analyses presented here, but by being conservative there should not be any effects of storage, amplification, and sampling biases on our interpretations.