Sampling and the Metagenomic Dataset
Microbial samples were collected as part of the Sorcerer II expedition between August 8, 2003, and May 22, 2004, by the S/V Sorcerer II, a 32-m sailing sloop modified for marine research. Most specimens were collected from surface water marine environments at approximately 320-km (200-mile) intervals. In all, 44 samples were obtained from 41 sites (), covering a wide range of distinct surface marine environments as well as a few nonmarine aquatic samples for contrast ().
Sampling Locations and Environmental Data
Several size fractions were isolated for every site (see Materials and Methods
). Total DNA was extracted from one or more fractions, mostly from the 0.1–0.8-μm size range. This fraction is dominated by bacteria, whose compact genomes are particularly suitable for shotgun sequencing. Random-insert clone libraries were constructed. Depending on the uniqueness of each sampling site and initial estimates of the genetic diversity, between 44,000 and 420,000 clones per sample were end-sequenced to generate mated sequencing reads. In all, the combined dataset includes 6.25 Gbp of sequence data from 41 different locations. Many of the clone libraries were constructed with a small insert size (<2 kbp) to maximize cloning efficiency. As this often resulted in mated sequencing reads that overlapped one another, overlapping mated reads were combined, yielding a total of ~6.4 M contiguous sequences, totaling ~5.9 Gbp of nonredundant sequence. Taken together, this is the largest collection of metagenomic sequences to date, providing more than a 5-fold increase over the dataset produced from the Sargasso Sea pilot study [19
] and more than a 90-fold increase over the other large marine metagenomic dataset [20
Assembling genomic data into larger contigs and scaffolds, especially metagenomic data, can be extremely valuable, as it places individual sequencing reads into a greater genomic context. A largely contiguous sequence links genes into operons, but also permits the investigation of larger biochemical and/or physiological pathways, and also connects otherwise-anonymous sequences with highly studied “taxonomic markers” such as 16S or recA,
thus clearly identifying the taxonomic group with which they are associated. The primary assembly of the combined GOS dataset was performed using the Celera Assembler [21
] with modifications as previously described [19
] and as given in Materials and Methods. The assembly was performed with quite stringent criteria, beginning with an overlap cutoff of 98% identity to reduce the potential for artifacts (e.g., chimeric assemblies or consensus sequences diverging substantially from the genome of any given cell). This assembly was the substrate for annotation (see the accompanying paper by Yooseph et al. [18
The degree of assembly of a metagenomic sample provides an indication of the diversity of the sample. A few substantial assemblies notwithstanding, the primary assembly was strikingly fragmented (). Only 9% of sequencing reads went into scaffolds longer than 10 kbp. A majority (53%) of the sequencing reads remained unassembled singletons. Scaffolds containing more than 50 kb of consensus sequence totaled 20.7 Mbp; of these, >75% were produced from a single Sargasso Sea sample and correspond to the Burkholderia
assemblies described previously [19
]. These results highlight the unusual abundance of these two organisms in a single sample, which significantly affected our expectations regarding the current dataset. Given the large size of the combined dataset and the substantial amount of sequencing performed on individual filters, the overall lack of assembly provides evidence of a high degree of diversity in surface planktonic communities. To put this in context, suppose there were a clonal organism that made up 1% of our data, or ~60 Mbp. Even a genome of 10 Mbp—enormous by bacterial standards—would be covered ~6-fold. Such data might theoretically assemble with an average contig approaching 50 kb [22
]. While real assemblies generally fall short of theory for various reasons, Shewanella
data make up <1% of the total GOS dataset, and yet most of the relevant reads assemble into scaffolds >50 kb. Thus, with few scaffolds of significant length, we could conclude that there are very few clonal organisms present at even 1% in the GOS dataset.
To investigate the nature of the implied diversity and to see whether greater assembly could be achieved, we explored several alternative approaches. Breaks in the primary assembly resulted from two factors: incomplete sequence coverage and conflicts in the data. Conflicts can break assemblies when there is no consistent way to chain together all overlapping sequencing reads. As it was possible that there would be fewer conflicts within a single sample (i.e., that diversity within a single sample would be lower), assemblies were attempted with individual samples. However, the results did not show any systematic improvements even in those samples with greater coverage (unpublished data). Upon manual inspection, most assembly-breaking conflicts were found to be local in nature. These observations suggested that reducing the degree of sequence identity required for assembly could ameliorate both factors limiting assembly: effective coverage would increase and many minor conflicts would be resolved.
Accordingly, we produced a series of assemblies based on 98%, 94%, 90%, 85%, and 80% identity overlaps for two subsets of the GOS dataset, again using the Celera Assembler. Assembly lengths increased as the overlap cutoff decreased from 98% to 94% to 90%, and then leveled off or even dropped as stringency was reduced below 90% (). Although larger assemblies could be generated using lower identity overlaps, significant numbers of overlaps satisfying the chosen percent identity cutoff still went unused in each assembly. This is consistent with a high rate of conflicting overlaps and in turn diagnostic of significant polymorphism.
Evaluation of Alternative Assembly Methods
In mammalian sequencing projects the use of larger insert libraries is critical to producing larger assemblies because of their ability to span repeats or local polymorphic regions [23
]. The shotgun sequencing libraries from the GOS filters were typically constructed from inserts shorter than 2 kb. Longer plasmid libraries were attempted but were much less stable. We obtained paired-end sequences from 21,419 fosmid clones (average insert size, 36 kb; [24
]) from the 0.1-micron fraction of GS-33. The effect of these long mate pairs on the GS-33 assembly was quite dramatic, particularly at high stringency (e.g., improving the largest scaffold from 70 kb to 1,247 kb and the largest contig from 70 kb to 427 kb). At least for GS-33 this suggests that many of the polymorphisms affect small, localized regions of the genome that can be spanned using larger inserts. This degree of improvement may be greater than what could be expected in general, as the diversity of GS-33 is by far the lowest of any of the currently sequenced GOS samples, yet it clearly indicates the utility of including larger insert libraries for assembly.
In the absence of substantial assembly, direct comparison of the GOS sequencing data to the genomes of sequenced microbes is an alternative way of providing context, and also allows for exploration of genetic variation and diversity. A large and growing set of microbial genomes are available from the National Center for Biotechnology Information (NCBI; http://www.ncbi.nlm.nih.gov
). At the time of this study, we used 334 finished and 250 draft microbial genomes as references for comparison with the GOS sequencing reads. Comparisons were carried out in nucleotide-space using the sequence alignment tool BLAST [26
]. BLAST parameters were designed to be extremely lenient so as to detect even distant similarities (as low as 55% identity). A large proportion of the GOS reads, 70% in all, aligned to one or more genomes under these conditions. However, many of the alignments were of low identity and used only a portion of the entire read. Such low-quality hits may reflect distant evolutionary relationships, and therefore less information is gained based on the context of the alignment. More stringent criteria could be imposed requiring that the reads be aligned over nearly their entire length without any large gaps. Using this stringent criterion only about 30% of the reads aligned to any of the 584 reference genomes. We refer to these fully aligned reads as “recruited reads.” Recruited reads are far more likely to be from microbes closely related to the reference sequence (same species) than are partial alignments. Despite the large number of microbial genomes currently available, including a large number of marine microbes, these results indicate that a substantial majority of GOS reads cannot be specifically related to available microbial genomes.
The amount and distribution of reads recruited to any given genome provides an indication of the abundance of closely related organisms. Only genomes from the five bacterial genera Prochlorococcus, Synechococcus, Pelagibacter, Shewanella, and Burkholderia yielded substantial and uniform recruitment of GOS fragments over most of a reference genome (). These genera include multiple reference genomes, and we observed significant differences in recruitment patterns even between organisms belonging to the same species (A–I). Three genera, Pelagibacter (A), Prochlorococcus (B–F), and Synechococcus (G–I), were found abundantly in a wide range of samples and together accounted for roughly 50% of all the recruited reads (though only ~15% of all GOS sequencing reads). By contrast, although every genome tested recruited some GOS reads, most recruited only a small number, and these reads clustered at lower identity to locations corresponding to large highly conserved genes (for typical examples see E–F). We refer to this pattern as nonspecific recruitment as it reflects taxonomically nonspecific signals, with the reads in question often recruiting to distantly related sets of genomes. Most microbial genomes, including many of the marine microbes (e.g., the ubiquitous genus Vibrio), demonstrated this nonspecific pattern of recruitment.
Microbial Genera that Recruited the Bulk of the GOS Reads
The relationship between the similarity of an individual sequencing read to a given genome and the sample from which the read was isolated can provide insight into the structure, evolution, and geographic distribution of microbial populations. These relationships were assessed by constructing a “percent identity plot” [27
] in which the alignment of a read to a reference sequence is shown as a bar whose horizontal position indicates location on the reference and whose vertical position indicates the percent identity of the alignment. We colored the plotted reads according to the samples to which they belonged, thus indirectly representing various forms of metadata (geographic, environmental, and laboratory variables). We refer to these plots that incorporate metadata as fragment recruitment plots. Fragment recruitment plots of GOS sequences recruited to the entire genomes of Pelagibacter ubique
HTCC1062, Prochlorococcus marinus
MIT9312, and Synechococcus
WH8102 are presented in Poster S1
Within-Ribotype Population Structure and Variation
Characteristic patterns of recruitment emerged from each of these abundant marine microbes consisting of horizontal bands made up of large numbers of GOS reads. These bands seem constrained to a relatively narrow range of identities that tile continuously (or at least uniformly, in the case when abundance/coverage is lower) along ~90% of the reference sequence. The uninterrupted tiling indicates that environmental genomes are largely syntenic with the reference genomes. Multiple bands, distinguished by degree of similarity to the reference and by sample makeup, may arise on a single reference (Poster S1
D and S1
F). Each of these bands appears to represent a distinct, closely related population we refer to as a subtype. In some cases, an abundant subtype is highly similar to the reference genome, as is the case for P. marinus
MIT9312 (Poster S1
) and Synechococcus
RS9917 (unpublished data). P. ubique
HTCC1062 and other Synechococcus
strains like WH8102 show more complicated banding patterns (Poster S1
D and S1
F) because of the presence of multiple subtypes that produce complex often overlapping bands in the plots. Though the recruitment patterns can be quite complex they are also remarkably consistent over much of the reference genome. In these more complicated recruitment plots, such as the one for P. ubique
HTCC1062, individual bands can show sudden shifts in identity or disappear altogether, producing a gap in recruitment that appears to be specific to that band (see P. ubique
recruitment plots on Poster S1
B and S1
E, and specifically between 130–140 kb). Finally, phylogenetic analysis indicates that separate bands are indeed evolutionarily distinct at randomly selected locations along the genome.
The amount of sequence variation within a given band cannot be reliably determined from the fragment recruitment plots themselves. To examine this variation, we produced multiple sequence alignments and phylogenies of reads that recruited to several randomly chosen intervals along given reference genomes to show that there can be considerable within-subtype variation (A–B). For example, within the primary band found in recruitment plots to P. marinus MIT9312, individual pairs of overlapping reads typically differ on average between 3%–5% at the nucleotide level (depending on exact location in the genome). Very few reads that recruited to MIT9312 have perfect (mismatch-free) overlaps with any other read or to MIT9312, despite ~100-fold coverage. While many of these differences are silent (i.e., do not change amino acid sequences), there is still considerable variation at the protein level (unpublished data). The amount of variation within subtypes is so great that it is likely that no two sequenced cells contained identical genomes.
Population Structure and Variation as Revealed by Phylogeny
Identifying Genomic Structural Variation with Metagenomic Data
Variation in genome structure in the form of rearrangements, duplications, insertions, or deletions of stretches of DNA can also be explored via fragment recruitment. The use of mated sequencing reads (pairs of reads from opposite ends of a clone insert) provides a powerful tool for assessing structural differences between the reference and the environmental sequences. The cloning and sequencing process determines the orientation and approximate distance between two mated sequencing reads. Genomic structural variation can be inferred when these are at odds with the way in which the reads are recruited to a reference sequence. Relative location and orientation of mated sequences provide a form of metadata that can be used to color-code a fragment recruitment plot (). This makes it possible to visually identify and classify structural differences and similarities between the reference and the environmental sequences (). For the abundant marine microbes, a high proportion of mated reads in the “good” category (i.e., in the proper orientation and at the correct distance) show that synteny is conserved for a large portion of the microbial population. The strongest signals of structural differences typically reflect a variant specific to the reference genome and not found in the environmental data. In conjunction with the requirement that reads be recruited over their entire length without interruption, recruitment plots result in pronounced recruitment gaps at locations where there is a break in synteny. Other rearrangements can be partially present or penetrant in the environmental data and thus may not generate obvious recruitment gaps. However, given sufficient coverage, breaks in synteny should be clearly identifiable using the recruitment metadata based on the presence of “missing” mates (i.e., the mated sequencing read that was recruited but whose mate failed to recruit; ). The ratio of missing mates to “good” mates determines how penetrant the rearrangement is in the environmental population.
Categories of Recruitment Metadata
Fragment Recruitment at Sites of Rearrangements
In theory, all genome structure variations that are large enough to prevent recruitment can be detected, and all such rearrangements will be associated with missing mates. Depending on the type of rearrangement present other recruitment metadata categories will be present near the rearrangements' endpoints. This makes it possible to distinguish among insertions, deletions, translocations, inversions, and inverted translocations directly from the recruitment plots. Examples of the patterns associated with different rearrangements are presented in . This provides a rapid and easy visual method for exploring structural variation between natural populations and sequenced representatives (Poster S1
A and S1
Genomic Structural Variation in Abundant Marine Microbes
Variation in genome structure potentially results in functional differences. Of particular interest are those differences between sequenced (reference) microbes and environmental populations. These differences can indicate how representative a cultivated microbe might be and shed light on the evolutionary forces driving change in microbial populations. Fragment recruitment in conjunction with the mate metadata helped us to identify both the consistent and the rare structural differences between the genomes of microbial populations in the GOS data and their closest sequenced relatives. Our analysis has thus far been confined to the three microbial genera that were widespread in the GOS dataset as represented by the finished genomes of P. marinus MIT9312, P. ubique HTCC1062, and to a lesser extent Synechococcus WH8102. Each of these genomes is characterized by large and small segments where little or no fragment recruitment took place. We refer to these segments as “gaps.” These gaps represent reference-specific differences that are not found in the environmental populations rather than a cloning bias that identifies genes or gene segments that are toxic or unclonable in E. coli. The presence of missing mates flanking these gaps indicates that the associated clones do exist, and therefore that cloning issues are not a viable explanation for the absence of recruited reads. Although the reference-specific differences are quite apparent due to the recruitment gaps they generate, there are also sporadic rearrangements associated with single clones, mostly resulting from small insertions or deletions.
Careful examination of the unrecruited mates of the reads flanking the gaps allowed us to identify, characterize, and quantify specific differences between the reference genome and their environmental relatives. The results of this analysis for P. ubique
and P. marinus
have been summarized in . With few exceptions, small gaps resulted from the insertion or deletion of only a few genes. Many of the genes associated with these small insertions and deletions have no annotated function. In some cases the insertions display a degree of variability such that different sets of genes are found at these locations within a portion of the population. In contrast, many of the larger gaps are extremely variable to the extent that every clone contains a completely unrelated or highly divergent sequence when compared to the reference or to other clones associated with that gap. These segments are hypervariable and change much more rapidly than would be expected given the variation in the rest of the genome. Sites containing a hypervariable segment nearly always contained some insert. We identified two exceptions both associated with P. ubique.
The first is approximately located at the 166-kb position in the P. ubique
HTCC1062 genome. Though no large gap is present, the mated reads indicate that under many circumstances a highly variable insert is often present. The second is a gap on HTCC1062 that appears between 50 and 90 kb. This gap appears to be less variable than other hypervariable segments and is occasionally absent based on the large numbers of flanking long mated reads (Poster S1
A). Interestingly, the long mated reads around this gap seem to be disproportionately from the Sargasso Sea samples, suggesting that this segment may be linked to geographic and/or environmental factors. Thus, hypervariable segments are highly variable even within the same sample, can on occasion be unoccupied, and the variation, or lack thereof, can be sample dependent.
Atypical Segments in P. marinus MIT9312 and P. ubique HTCC1062 (SAR11)
Hypervariable segments have been seen previously in a wide range of microbes, including P. marinus
], but their precise source and functional role, especially in an environmental context, remains a matter of ongoing research. For clues to these issues we examined the genes associated with the missing mates flanking these segments and the nucleotide composition of the gapped sequences in the reference genomes. In some rare cases the genes identified on reads that should have recruited within a hypervariable gap were highly similar to known viral genes. For example, a viral integrase was associated with the P. ubique
HTCC1062 hypervariable gap between 516 and 561 kb. However, in the majority of cases the genes associated with these gaps were uncharacterized, either bearing no similarity to known genes or resembling genes of unknown function. If these genes were indeed acquired through horizontal transfer then we might expect that they would have obvious compositional biases. Oligonucleotide frequencies along the P. ubique
HTCC1062 and Synechococcus
WH8102 genomes are quite different in the large recruitment gaps in comparison to the well-represented portions of the genome (Poster S1
). Surprisingly, this was less true for P. marinus
MIT9312, where the gaps have been linked to phage activity [28
]. These results suggest that these hypervariable segments of the genome are widespread among marine microbial populations, and that they are the product of horizontal transfer events perhaps mediated by phage or transposable elements. These results are consistent with and expand upon the hypothesis put forward by Coleman et al. [28
] suggesting that these segments are phage mediated, and conflicts with initial claims that the HTCC1062 genome was devoid of genes acquired by horizontal transfer [29
Though insertions and deletions accounted for many of the obvious regions of structural variation, we also looked for rearrangements. The high levels of local synteny associated with P. ubique and P. marinus suggested that large-scale rearrangements were rare in these populations. To investigate this hypothesis we used the recruitment data to examine how frequently rearrangements besides insertions and deletions could be identified. We looked for rearrangements consisting of large (greater than 50 kb) inversions and translocations associated with P. marinus; however, we did not identify any such rearrangements that consistently distinguished environmental populations from sequenced cultivars. Rare inversions and translocations were identified in the dominant subtype associated with MIT9312 (). Based on the amount of sequence that contributed to the analysis, we estimate that one inversion or translocation will be observed for every 2.6 Mbp of sequence examined (less than once per P. marinus genome).
Six Large-Scale Translocations and Inversions Were Identified in the Abundant P. marinus Subtype
A further observation concerns the uniformity along a genome of the evolutionary history among and within subtypes. For instance, the similarity between GOS reads and P. marinus
MIT9312 is typically 85%–95%, while the similarity between MIT9312 and P. marinus
MED4 is generally ~10% lower. However, there are several instances where the divergence of MIT9312 and MED4 abruptly decreases to no more than that between the GOS sequences and MIT9312 (Poster S1
G). These results are consistent either with horizontal transfer (recombination) or with inhomogeneous selectional pressures. Similar patterns are present in the two high-identity subtypes seen on the P. ubique
HTCC1062 genome (Poster S1
D). Other regions show local increases in similarity between MIT9312 and the dominant subtype that are not reflected in the MIT9312/MED4 divergence (e.g., near positions 50 kb, 288 kb, 730 kb, 850 kb, and 954 kb on MIT9312; also see Poster S1
G). These latter regions might reflect either regions of homogenizing recombination or regions of higher levels of purifying selection. However, the lengths of the intervals (several are 10 kb or more) are longer than any single gene and correspond to genes that are not extremely conserved over greater taxonomic distances (in contrast to the ribosomal RNA operon). Equally, if widespread horizontal transfer of an advantageous segment explains these intervals, the transfers occurred long enough ago for appreciable variation to accumulate (unpublished data).
Extreme Assembly of Uncultivated Populations
The analyses described above have been confined to those organisms with representatives in culture and for which genomes were readily available. Producing assemblies for other abundant but uncultivated microbial genera would provide valuable physiological and biochemical information that could eventually lead to the cultivation of these organisms, help elucidate their role in the marine community, and allow similar analyses of their evolution and variation such as those performed on sequenced organisms. Previous assembly efforts and the fragment recruitments plots showed that there is considerable and in many cases conflicting variation among related organisms. Such variation is known to disrupt whole-genome assemblers. This led us to try an assembly approach that aggressively resolves conflicts. We call this approach “extreme assembly” (see Materials and Methods
). This approach currently does not make use of mate-pairing data and, therefore produces only contigs, not scaffolded sequences. Using this approach, contigs as large as 900 kb could be aligned almost in their entirety to the P. marinus
MIT9312 and P. ubique
HTCC1062 genomes (J–L). Consistent patterns of fragment recruitment (see below) generally provided evidence of the correctness of contigs belonging to otherwise-unsequenced organisms. Accordingly, large contigs from these alternate assemblies were used to investigate genetic and geographic population structure, as described below. However, the more aggressive assemblies demonstrably suffered from higher rates of assembly artifacts, including chimerism and false consensus sequences (). Thus, the more stringent primary assembly was employed for most assembly-based analyses, as manual curation was not practical.
Examples of Chimeric Extreme Assemblies
As just noted, many of the large contigs produced by the more aggressive assembly methods described above did not align to any great degree with known genomes. Some could be tentatively classified based on contained 16S sequences, but the potential for computationally generated chimerism within the rRNA operon is sufficiently high that inspection of the assembly or other means of confirming such classifications is essential. An alternative to an unguided assembly that facilitates the association of assemblies with known organisms is to start from seed fragments that can be identified as belonging to a particular taxonomic group. We employed fragments outside the ribosomal RNA operon that were mated to a 16S-containing read, limiting extension to the direction away from the 16S operon. This produced contigs of 100 kb or more for several of the ribotypes that were abundant in the GOS dataset. When evaluated via fragment recruitment (M–O), these assemblies revealed patterns analogous to those seen for the sequenced genomes described above: multiple subtypes could be distinguished along the assembly, differing in similarity to the reference sequence and sample distribution, with occasional gaps. Hypervariable segments by definition were not represented in these assemblies, but they may help explain the termination of the extreme assemblies for P. marinus and SAR11 and provide a plausible explanation for termination of assemblies of the other deeply sampled populations as well.
This directed approach to assembly can also be used to investigate variation within a group of related organisms (e.g., a 16S ribotype). We explored the potential to assemble distinct subtypes of SAR11 by repeatedly seeding extreme assembly with fragments mated to a SAR11-like 16S sequence. compares the first 20 kb from each of 24 independent assemblies. Eighteen of these segments could be aligned full-length to a portion of the HTCC1062 genome just upstream of 16S, while six appeared to reflect rearrangements relative to HTCC1062. The rearranged segments were associated with more divergent 16S sequences (8%–14% diverged from the 16S of HTCC1062), while those without rearrangements corresponded to less divergent 16S (averaging less than 3% different from HTCC1062). In each segment, many reads were recruited above 90% identity, but different samples dominated different assemblies. Phylogenetic trees support the inference of evolutionarily distinct subtypes with distinctive sample distributions ().
Fragment Recruitment Plots to 20-kb Segments of SAR11-Like Contigs Show That Many SAR11 Subtypes, with Distinct Distributions, Can Be Separated by Extreme Assembly
Phylogeny of GOS Reads Aligning to P. ubique HTCC1062 Upstream of 16S Gene Indicates That the Extreme Assemblies in Correspond to Monophyletic Subtypes
Environmental surveys provide a cultivation-independent means to examine the diversity and complexity of an environmental sample and serve as a basis to compare the populations between different samples. Typically, these surveys use PCR to amplify ubiquitous but slowly evolving genes such as the 16S rRNA or recA
genes. These in turn can be used to distinguish microbial populations. Since PCR can introduce various biases, we identified 16S genes directly from the primary GOS assembly. In total, 4,125 distinct full-length or partial 16S were identified. Clustering of these sequences at 97% identity gave a total of 811 distinct ribotypes. Nearly half (48%) of the GOS ribotypes and 88% of the GOS 16S sequences were assigned to ribotypes previously deposited in public databases. That is, more than half the ribotypes in the GOS dataset were found to be novel at what is typically considered the species level [30
]. The overall taxonomic distribution of the GOS ribotypes sampled by shotgun sequencing is consistent with previously published PCR based studies of marine environments () [31
]. A smaller amount (16%) of GOS ribotypes and 3.4% of the GOS 16S sequences diverged by more than 10% from any publicly available 16S sequence, thus being novel to at least the family level.
Taxonomic Makeup of GOS Samples Based on 16S Data from Shotgun Sequencing
A census of microbial ribotypes allows us to identify the abundant microbial lineages and estimate their contribution to the GOS dataset. Of the 811 ribotypes, 60 contain more than 8-fold coverage of the 16S gene (); jointly, these 60 ribotypes accounted for 73% of all the 16S sequence data. All but one of the 60 have been detected previously, yet only a few are represented by close relatives with complete or nearly complete genome sequencing projects (see Fragment Recruitment for further details). Several other abundant 16S sequences belong to well-known environmental ribotypes that do not have cultivated representatives (e.g., SAR86, Roseobacter NAC-1–2, and branches of SAR11 other than those containing P. ubique). Interestingly, archaea are nearly absent from the list of dominant organisms in these near-surface samples.
Most Abundant Ribotypes (97% Identity Clusters)
The distribution of these ribotypes reveals distinct microbial communities ( and ). Only a handful of the ribotypes appear to be ubiquitously abundant; these are dominated by relatives of SAR11 and SAR86. Many of the ribotypes that are dominant in one or more samples appear to reside in one of three separable marine surface habitats. For example, several SAR11, SAR86, and alpha Proteobacteria, as well as an Acidimicrobidae group, are widespread in the surface waters, while a second niche delineated by tropical samples contains several different SAR86, Synechococcus and Prochlorococcus (both cyanobacterial groups), and a Rhodospirillaceae group. Other ribotypes related to Roseobacter RCA, SAR11, and gamma Proteobacteria are abundant in the temperate samples but were not observed in the tropical or Sargasso samples. Not surprisingly, samples taken from nonmarine environments (GS33, GS20, GS32), estuaries (GS11, GS12), and larger-sized fraction filters (GS01a, GS01b, GS25) have distinguishing ribotypes. Furthermore, as the complete genomes of these dominant members are obtained, the capabilities responsible for their abundances may well lend insight into the community metabolism in various oceanic niches.
Presence and Abundance of Dominant Ribotypes
The most common approach for comparing the microbial community composition across samples has been to examine the ribotypes present as indicated by 16S rRNA genes or by analyzing the less-conserved ITS located between the 16S and 23S gene sequences [7
]. However, a clear observation emerging from the fragment recruitment views was that the reference ribotypes recruit multiple subtypes, and that these subtypes were distributed unequally among samples (, , ; Poster S1
F, and S1
We developed a method to assess the genetic similarity between two samples that potentially makes use of all portions of a genome, not just the 16S rRNA region. This similarity measure is assembly independent; under certain circumstances, it is equivalent to an estimate of the fraction of sequence from one sample that could be considered to be in the other sample. Whole-metagenomic similarities were computed for all pairs of samples. Results are presented for comparisons at ≥98% and 90% identity. No universal cutoff consistently divides sequences into natural subsets, but the 98% identity cutoff provides a relatively high degree of resolution, while the 90% cutoff appears to be a reasonable heuristic for defining subtypes. For instance, a 90% cutoff treats most of the reads specifically recruited to P. marinus MIT9312 as similar (those more similar to MED4 notably excepted), while reasonably separating clades of SAR11 ( and ). Reads with no qualifying overlap alignment to any other read in a pair of samples are uninformative for this analysis, as they correspond to lineages that were so lightly sequenced that their presence in one sample and absence in another may be a matter of chance. For the 90% cutoff, 38% of the sequence reads contributed to the analysis. The resulting similarities reveal clear and consistent groupings of samples, as well as the outlier status of certain samples ( and ).
Similarity between Samples in Terms of Shared Genomic Content
The broadest contrast was between samples that could be loosely labeled “tropical” (including samples from the Sargasso Sea [GS00b, GS00c, GS00d] and samples that are temperate by the formal definition but under the influence of the Gulf Stream [GS14, GS15]) and “temperate.” Further subgroups can be identified within each of these categories, as indicated in and . In some cases, these groupings were composed of samples taken from different ocean basins during different legs of the expedition. A few pairs of samples with strikingly high similarity were observed, including GS17 and GS18, GS23 and GS26, GS27 and GS28, and GS00b and GS00d. In each case, these pairs of samples were collected from consecutive or nearly consecutive samples. However, the same could be said of many other pairs of samples that do not show this same degree of similarity. Indeed, geographically and temporally separated samples taken in the Atlantic (GS17, GS18) and Pacific (GS23, GS26) during separate legs of the expedition are more similar to one another than were most pairs of consecutive samples. The samples with least similarity to any other sample were from unique habitats. Thus, similarity cannot be attributed to geographic separation alone.
The groupings described above can be reconstructed from taxonomically distinct subsets of the data. Specifically, the major groups of samples visible in were reproduced when sample similarities were determined based only on fragments recruiting to P. ubique HTCC1062 (unpublished data). Likewise, the same groupings were observed when the fragments recruiting to either HTCC1062 or P. marinus MIT9312, or both, were excluded from the calculations (unpublished data). Thus, the factors influencing sample similarities do not appear to rely solely on the most abundant organisms but rather are reflected in multiple microbial lineages.
It is tempting to view the groups of similar samples as constituting community types. Sample similarities based on genomic sequences correlated significantly with differences in the environmental parameters (), particularly water temperature and salinity (unpublished data). Samples that are very similar to each other had relatively small differences in temperature and salinity. However, not all samples that had similar temperature and salinity had high community similarities. Water depth, primary productivity, fresh water input, proximity to land, and filter size appeared consistent with the observed groupings. Other factors such as nutrients and light for phototrophs and fixed carbon/energy for chemotrophs may ultimately prove better predictors, but these results demonstrate the potential of using metagenomic data to tease out such relationships.
Examining the groupings in in light of habitat and physical characteristics, the following may be observed. The first two samples, a hypersaline pond in the Galapagos Islands (GS33) and the freshwater Lake Gatun in the Panama Canal (GS20) are quite distinct from the rest. Salinity—both higher and lower than the remaining coastal and ocean samples—is the simplest explanation.
Twelve samples form a strong temperate cluster as seen in the similarity matrix of as a darker square bounded by GS06 and GS12. Embedded within the temperate cluster are three subclusters. The first subcluster includes five samples from Nova Scotia through the Gulf of Maine. This is followed by a subcluster of four samples between Rhode Island and North Carolina. The northern subcluster was sampled in August, the southern subcluster in November and December. Though all samples were collected in the top few meters, the southern samples were in shallower waters, 10 to 30 m deep, whereas most of the northern samples were in waters greater than 100 m deep. Monthly average estimates of chlorophyll a concentrations were typically higher in the southern samples as well (). All of these factors—temperature, system primary production, and depth of the sampled water body—likely contribute to the differences in microbial community composition that result in the two well-defined clusters. The final temperate subgroup includes two estuaries, Chesapeake Bay (GS12) and Delaware Bay (GS11), distinguished by their lower salinity and higher productivity. However, GS11 is markedly similar not only to GS12 but also to coastal samples, whereas the latter appears much more unique. Interestingly, the Bay of Fundy estuary sample (GS06) clearly did not group with the two other estuaries, but rather with the northern subgroup, perhaps reflecting differences in the rate or degree of mixing at the sampling site.
Continuing to the right and downward in , one can see a large cluster of 25 samples from the tropics and Sargasso Sea, bounded by GS47 and GS00b. This can be further subdivided into several subclusters. The first subcluster (a square bounded by GS47 and GS14) includes 14 samples, about half of which were from the Galapagos. The second distinct subcluster (a square bounded by GS16 and GS26) includes seven samples from Key West, Florida, in the Atlantic Ocean to a sample close to the Galapagos Islands in the Pacific Ocean. Loosely associated with this subcluster is a sample from a larger filter size taken en route to the Galapagos (GS25). The remaining samples group weakly with the tropical cluster. GS32 was taken in a coastal mangrove in the Galapagos. The thick organic sediment at a depth of less than a meter is the likely cause for it being unlike the other samples. Sample 00a was from the Sargasso Sea and contained a large fraction of sequence reads from apparently clonal Burkholderia and Shewanella species that are atypical. When this sample is reanalyzed to exclude reads identified as belonging to these two groups, sample GS00a groups loosely with GS00b, GS00c, and GS00d (unpublished data). Finally, three subsamples from a single Sargasso sample (GS01a, GS01b, GS01c) group together, despite representing three distinct size fractions (3.0–20, 0.8–3.0, and 0.1–0.8 μm, respectively; ).
The complete set of sample similarities is more complex than described above, and indeed is more complex than can be captured by a hierarchical clustering. For instance, the southern temperate samples are appreciably more similar to the tropical cluster than are the northern temperate samples. GS22 appears to constitute a mix of tropical types, showing strong similarity not only to the GS47–GS14 subcluster to which it was assigned, but also to the other tropical samples.
These results may be compared to the more traditional view of community structure afforded by 16S sequences (). Some of the same groupings of samples are visible using both analyses. Several ribotypes recapitulated the temperate/tropical clustering described above. Others were restricted to the single instances of nonmarine habitats. Several of the most abundant organisms from the coastal mangrove, hypersaline lagoon, and freshwater lake were found exclusively in these respective samples. However, while several ribotypes recapitulated the temperate/tropical distinction revealed by the genomic sequence, others crosscut it. A few dominant 16S ribotypes, related to SAR11, SAR86, and SAR116, were found in every marine sample. The brackish waters from two mid-Atlantic estuaries (GS11 and GS12) contained a mixture of otherwise exclusively marine and freshwater ribotypes; similarity of these sites to the freshwater sample (GS20) was minimal at the metagenomic level, while the greater similarity of GS11 to coastal samples visible at the metagenomic level was not readily visible here. A fuller comparison of metagenome-based measurements of diversity based on a large dataset of PCR-derived 16S sequences will be presented in another paper (in preparation).
Variation in Gene Abundance
Differences in gene content between samples can identify functions that reflect the lifestyles of the community in the context of its local environment [20
]. We examined the relative abundance of genes belonging to specific functional categories in the distinct GOS samples. Genes were binned into functional categories using TIGRFAM hidden Markov models [18
], which are well annotated and manually curated [33
The results can be filtered in various ways to highlight genes associated with specific environments. One catalog of possible interest is genes that were predominantly found in a single sample. We identified 95 TIGRFAMs that annotated large sets of genes (100 or more) that were significantly more frequent (greater than 2-fold) in one sample than in any other sample (). Not surprisingly, this approach disproportionately singles out genes from the samples collected on larger filters (GS01a, GS01b, and GS25) and from the nonmarine environments, particularly the hypersaline pond (sample GS33). Another contrast might be between the temperate and tropical clusters ( and ). We identified 32 proteins that were more than 2-fold more frequent in one or the other group (). The presence of various Prochlorococcus-associated genes in this list highlights some of the potential challenges with this sort of approach. Overrepresentation may reflect: a direct response to particular environmental pressures (as the excess of salt transporters plausibly do in the hypersaline pond); a lineage-restricted difference in functional repertoire (as exemplified by the excess of photosynthesis genes in samples containing Prochlorococcus); or a more incidental “hitchhiking” of a protein found in a single organism that happens to be present.
Relative Abundance of TIGRFAMs Associated with a Specific Sample
Relative Abundance of TIGRFAM Matches in Temperate and Tropical Waters
We explored whether clearer and more informative differences could be discovered between communities by focusing on groups of samples that are highly similar in overall taxonomic/genetic content. Two pairs of samples provide a particularly nice illustration of this approach. Samples GS17 and GS18 from the western Caribbean Sea and samples GS23 and GS26 from the eastern Pacific Ocean were all very similar based on the presence of abundant ribotypes and overall similarity in genetic content (–). Despite these similarities, several genes are found to be up to seven times more common in the pair of Caribbean samples than the Pacific pair (). No genes are more than 2-fold higher in the Pacific than the Caribbean pair of samples. Several of the most differentially abundant genes are related to phosphate transport and utilization. It is very plausible that this is a reflection of a functional adaptation: these differences correlate well with measured differences in phosphate abundance between the Atlantic and eastern Pacific samples [34
], and phosphate abundance plays a critical role in microbial growth [36
]. Indeed, the ability to acquire phosphate, especially under conditions where it is limited, is thought to determine the relative fitness of Prochlorococcus
Relative Abundance of TIGRFAM Matches in Atlantic and Pacific Open Ocean Waters
The single greatest difference between GS17 and GS18 on the one hand and GS23 and GS26 on the other was attributed to a set of genes annotated by the hidden Markov model TIGR02136 as a phosphate-binding protein (PstS).
This TIGRFAM identified a single gene in both P. marinus
MIT9312 and P. ubique
HTCC1062. In P. marinus
MIT9312, this gene is located at 672 kb lying roughly in the middle of a 15-kb segment of the genome that recruits almost no GOS sequences from the Pacific sampling sites (Poster S1
H). In P. ubique
HTCC1062, the PstS
gene is found at 1,133 kb in a 5-kb segment that also recruited far fewer GOS sequences from all the Pacific samples except for GS51 (Poster S1
E). These genomic segments differ structurally among isolates but they are no more variable than the flanking regions, and thus are not hypervariable in the sense used previously (unpublished data). Nor are they particularly conserved when present, indicating that they are not the result of a recent lateral transfer. Phylogenetic analyses outside these segments did not produce any evidence of a Pacific versus Caribbean clade of either Prochlorococcus
or SAR11 (A–B). The presence or absence of phosphate transporters is not limited to these two types of organisms. The number of phosphate transporters that were found in the Caribbean far exceeds the number that can be attributed to HTCC1062- and MIT9312-like organisms. However, these results indicate that within individual strains or subtypes the ability to acquire phosphate (in one or more of its forms) can vary without detectable differences in the surrounding genomic sequences.
Biogeographic Distribution of Proteorhodopsin Variants
Variation in gene content is only one aspect of the tremendous diversity in the GOS data. The functional significance of all the polymorphic differences between homologous proteins remains largely unknown. To look for functional differences, we analyzed members of proteorhodopsin gene family. Proteorhodopsins are fast, light-driven proton pumps for which considerable functional information is available though their biological role remains unknown. Proteorhodopsins were highly abundant in the Sargasso Sea samples [19
] and continue to be highly abundant and evenly distributed (relative to recA
abundance) in all the GOS samples. A total of 2,674 putative proteorhodopsin genes were identified in the GOS dataset. Although many of the sequences are fragmentary, 1,874 of these genes contain the residue that is primarily responsible for tuning the light-absorbing properties of the protein [39
], and these properties have been shown to be selected for under different environmental conditions [42
]. Variation at this residue is strongly correlated with sample of origin (). The leucine (L) or green-tuned variant was highly abundant in the North Atlantic samples and in the nonmarine environments like the fresh water sample from Lake Gatun (GS20). The glutamine (Q) or blue-tuned variant dominated in the remaining mostly open ocean samples.
Distribution of Common Proteorhodopsin Variants across GOS Samples
Given our limited understanding of the biological role for proteorhopsin, the reason for this differential distribution is not immediately clear. In coastal waters where nutrients are more abundant, phytoplankton is dominant. Phytoplankton absorbs primarily in the blue and red spectra; consequently, the water appears green [43
]. Conversely, in the open ocean nutrients are rare and phytoplanktonic biomass is low, so waters appear blue because in the absence of impurities the red wavelengths are absorbed preferentially [44
]. It may be that proteorhodopsin-carrying microbes have simply adapted to take advantage of the most abundant wavelengths of light in these systems.
Proteorhodopsins encoded on reads that were recruited to P. ubique
HTCC1062 account for a fraction (~25%) of all the proteorhodopsin-associated reads, suggesting that the remainder must be associated with a variety of marine microbial taxa (see also [45
]). Phylogenetic analysis of the SAR11-associated proteins revealed that each variant has arisen independently at least two times in the SAR11 lineage (C). Consistent with other findings that proteorhodopsins are widely distributed throughout the microbial world [48
], we conclude that multiple microbial lineages are responsible for proteorhodopsin spectral variation and that the abundance of a given variant reflects selective pressures rather than taxonomic effects. Similar mechanisms seem to be involved in the evolution and diversification of opsins that mediate color vision in vertebrates [49