Overview of genome assemblies
Metagenomic assembly and single-cell sequencing provide a cultivation-independent approach to generating genome sequences (
Woyke et al., 2009;
Rusch et al., 2010;
Hess et al., 2011), and the two techniques were used to produce four SAR86 assemblies. With GOS samples as a metagenomic dataset, a combination of aggressive assembly followed by binning based on sample distribution, similarity to SAR86 BACs, oligonucleotide frequencies and manual curation resulted in the assembly of two nearly complete genomes (SAR86A and B). The SAR86A consensus genome consists of two scaffolds in 41 contigs with a total length of 1.25

Mbp. The SAR86B consensus genome consists of 31 scaffolds and is substantially larger at 1.70

Mbp. The two genomes contain 1316 and 1712 open reading frames, respectively (). These consensus genomes are not equivalent to assemblies from a clonal isolate, thus single cell techniques (
Raghunathan et al., 2005;
Lasken, 2007) were used to acquire further data. Sequences from each amplified single cell should represent an independent assessment of the gene content, and furthermore should not suffer from the biases and limitations that might be introduced in a metagenomic assembly. Single cells were isolated from coastal waters (San Diego, CA, USA) using flow cytometry, followed by amplification of genomic DNA with MDA (
Dean et al., 2001,
2002). Two of the amplified genomes identified as SAR86 by 16S rRNA sequencing were further sequenced using Titanium 454 pyrosequencing. This resulted in two partial genomes (SAR86C and D) with a total length of 750 and 925

Kbp split among 142 and 194 contigs, respectively (). All four genomes have low proportion of guanine-cytosine (%GC).
| Table 1Genomic characteristics of SAR11 (Pelagibacteraceae) and SAR86 |
Several analyses show that the metagenomic assemblies represent naturally occurring SAR86 populations in terms of gene content and genome structure. Pairwise alignments reveal substantial similarity in genome content and organization or synteny between the metagenomic assemblies and large (20

kbp+) SAR86 genome fragments acquired using single cell methods or molecular cloning (
Supplementary Figure S1). While not insightful to genome structure, the smaller single cell contigs contain genes found on the metagenomic assemblies, establishing a consistent gene content. Recruitment of metagenomic reads to the assemblies is uniform over the length of the SAR86 genomes in terms of depth of coverage and the frequency of recruited reads among the different samples (). Greater than 90% of the high identity Sanger mate pairs were recruited in the proper orientation and distance (). Finally, scatter plots of the first three components of a principle component analysis of tetranucleotide usage form tight clusters that are consistent with a single genomic source ((
Teeling et al., 2004),
Supplementary Figure S2).
To estimate actual genome size from our partial genomes, a catalog of 107 single copy genes, including nearly all ribosomal proteins and tRNA synthases found in nearly all free-living bacteria, was compiled using the Comprehensive Microbial Resource (see methods). A total of 100 and 99 of these were found spread across the SAR86A and B genomes, respectively, suggesting they are greater than 90% complete. Seven of the missing proteins are the same: the signal recognition protein, signal recognition docking protein, initiation factor 3, ribosomal proteins L20 and L35, and both subunits of phenylalanyl tRNA synthase. SAR86B appears to lack dimethyladenosine transferase, though a manual search detected a putative protein that just missed the HMM cutoff for ksgA in
Supplementary Table S1. The same seven proteins were not found in SAR86C and D, which are estimated to be 54% and 48% complete, respectively (). Each genome contains one complete rRNA loci. While tentative, it is possible that SAR86 has dispensed with each of these seven proteins; for example, as the signal recognition proteins were thought to be essential, this has been recently challenged (
Hasona et al., 2005).
In abundant marine bacteria, such as SAR11,
Prochlorococcus and
Synechococcus, high variability of gene content at several locations comprising ~10% of the genomes is expected to prevent complete assembly from metagenomic datasets (
Rusch et al., 2007,
2010). Based on an examination of mated reads recruited to ends of the SAR86A scaffolds, there is at least one hypervariable region in natural SAR86A genomes. The high variability results in low coverage, thus preventing both assembly of the hypervariable region and circularization of the two scaffolds. Unfortunately, the single cells contigs do not extend into or clarify this hypervariable region.
Phylogenetic and biogeographic characterization of the genomes assemblies
The 16S rRNA sequences from the SAR86 A–D group in SAR86 clusters I and IIa are >98% similar to each other in nucleotide sequence (). The 16S rRNA sequence cannot resolve global phylogenies of γ-proteobacteria (22) or even SAR86 confidently (), thus conserved proteins were used to construct maximum likelihood phylogenies (). A 12 protein phylogeny recapitulated the topology of a recent multi-protein phylogeny of γ-proteobacteria (
Williams et al., 2010) while loosely pairing the SAR86A and B assemblies with
Francisella tularensis as basal nodes (
Supplementary Figure S3). A reduced seven-protein phylogeny allowed inclusion of the SAR86C and D genomes and retains a consistent global γ-proteobacterial topology (). The large phylogenetic distance from SAR86 to any cultivated γ-proteobacteria suggests a relatively ancient divergence or an artifact caused by rapid rates of evolution. For example, both SAR86 and
F. tularensis exhibit metabolic streamlining (
Larsson et al., 2005), which is accompanied by rapid protein sequence evolution (
Dufresne et al., 2005). The loose association of these genomes in our phylogenies thus might be an artifact caused by long-branch attraction.
Fragment recruitment (
Rusch et al., 2007) is akin to
in silico whole-genome DNA hybridization with a known nucleotide identity. The presence of genomes highly similar to the SAR86A assembly within multiple metagenomic datasets is confirmed by the abundance of reads at greater than 95% nucleotide identity (). By visualizing only metagenomic reads that are a best match to SAR86A, a separate genome that could not be assembled can be observed at 80% nucleotide identity (SAR86A-like, ). Similar trends are observed for SAR86B (). SAR86A and B are phylogenetically distinct, at least at the protein level, but we still have not captured the full diversity of even these individual SAR86 lineages.
The most abundant genomes in a GOS dataset of 10 million Sanger reads were determined using fragment recruitment at 90% and 50% nucleotide identity (). The SAR86 assemblies recruit more of the GOS dataset than any other non-photosynthetic bacteria except organisms in the SAR11 clade (
Pelagibacteraceae) (), including the flavobacteria genomes sequenced by
Woyke et al. (2009). At 50% nucleotide identity, the SAR86 and
Pelagibacter genomes recruit ~50 × more metagenomic data than at 90% nucleotide identity. This is in sharp contrast to
Prochlorococcus and
Synechococcus genomes where a relaxation of the identity cutoff only increased recruitment 10–50% (). Our interpretation is that there are numerous and abundant subclades of SAR86 and
Pelagibacteraceae for which genomic representation is lacking.
| Table 2The most abundant genomes in the GOS data set |
Across the GOS dataset, the four genomes exhibit a biogeography that may be related to physiology (,
Supplementary Table S2). For example, the two most related genomes, SAR86 C and D, are found at colder coastal sites, consistent with their isolation in coastal California, whereas SAR86A is found at all open ocean locations (). SAR86B appears to have a very specific geographic distribution as it only recruits metagenomes collected from a small subset of warm coastal sites, specifically Zanzibar and the Gulf of Panama.
Metabolic streamlining in SAR86
Metabolic analyses of genomic information can provide information on the physiological capabilities of an organism, though conclusions are speculative even when a strain is cultivated and a completely finished genome is available. However, the hypotheses presented will certainly be useful for future physiological experiments conducted with a cultivated strain or natural microbial communities. For the sake of a comprehensive analysis, we will note when genes or pathways were not found, though only when the trends are consistent across all four genomes or genome alignments provide ancillary evidence of absence. The combined datasets catalog a portion of the SAR86 core genome and some of accessory proteins associated with specific lineages. Due to the cohabitation of the planktonic fraction on the surface ocean (), many of the putative features of SAR86 are directly compared with Pelagibacteraceae ().
A loss of biosynthetic pathways was observed in the
Pelagibacteraceae Candidatus Pelagibacter ubique, presumably resulting from metabolic streamlining as a way to reduce nutrient requirements in the oligotrophic ocean (
Giovannoni et al., 2005). Consistent with the modest size and low %GC of SAR86 genomes, all the genes in several vitamin biosynthesis pathways, including B
12, B
6, biotin, pantothenate, thiamine and retinol, are absent from SAR86A, C and D. These pathways are also missing from SAR86B, with the exception of thiamine biosynthesis (). Failure to find any of the genes for the vitamin synthesis pathways in four different genomes (excepting thiamine) provides strong, but not conclusive, evidence for auxotrophy of these vitamins in SAR86. SAR86B contains a putative B
12 transporter, consistent with auxotrophy, though transporters for the other vitamins could not be identified. It is also possible that alternative biosynthetic routes are used (
Webb et al., 2007). The potential for vitamin auxotrophy should be considered in future cultivation efforts.
SAR86A lacks the proteins required for methionine (Met), histidine (His), and arginine (Arg) synthesis. SAR86B contains the genes for the synthesis of His and Arg within contigs that are otherwise syntenic to the SAR86A genome (
Supplementary Figures S4a and b), suggesting the absence in SAR86A is not an artifact. It is possible that these proteins are found in different genomic locations of SAR86A that we did not recover. SAR86D also contains the His synthesis operon. The SAR86B and D amino-acid synthesis genes were used to estimate the presence of these pathways in natural populations. The proteins for the Met, His and Arg synthesis pathways are always less abundant than a set of 107 core genes across 73 metagenomes after normalization (
Supplementary Figure S4c). If most natural SAR86 populations retained the synthesis pathways, normalized recruitment of the genes for the amino-acid synthesis genes should be roughly equivalent to that of the core genes, which was not observed. Instead, it appears that substantial portions of natural SAR86 populations are auxotrophs for Met, His and Arg.
Each genome contains the demethylase
dmdA that produces methyl-mercaptopropionate from the algal osmolyte dimethyl-sulfoniopropionate, providing reduced sulfur (
Howard et al., 2006). Each genome also contains putative transporters for glycine-betaine that may facilitate DMSP uptake. In contrast to SAR11, all SAR86 genomes contain genes for putative transporters of glutathione (γ-glutamate-cysteine-glycine) and γ-glutamyl transferases, which are required to break the otherwise recalcitrant γ-bond in glutathione, expanding the diversity of organic sulfur available to SAR86 relative to SAR11 (). The concentrations of dissolved glutathione and γ-Glu-Cys in the oligotrophic north Pacific reach 10–15

n
M, with turnover times on the order of days to weeks (
Dupont et al., 2006). These known metal-binding compounds likely influence Cu, Fe and Hg speciation; thus, consumption of this pool by SAR86 adds an additional dimension to models of trace metal biogeochemistry. All of the various SAR86 assemblies lack the enzymes required for sulfate uptake or assimilatory reduction, as was also observed in SAR11 (
Tripp et al., 2008).
SAR86B does have a much larger genome than SAR86A. Many of the SAR86B-specific genes could not be functionally annotated, though there are expansions in the numbers of TonB receptors, ABC transporters and beta-lactamases (). Several glycosyl hydrolases found in the SAR86A genome are also duplicated in the SAR86B genome.
Proteorhodopsin phototrophy
Proteorhodopsin was originally found on a BAC of SAR86 origin and, when heterologously expressed in
E. coli and provided with exogenous retinol, functioned as a light-driven proton pump (
Beja et al., 2000). In some marine bacteria, proteorhodopsin can facilitate energy generation under nutrient-limiting conditions (
Steinder et al., 2011). SAR86A and C contain one putative green-light tuned proteorhodopsin, whereas SAR86B and D each contain two. Autotrophic carbon-fixation pathways are lacking in the SAR86 genomes, so the proteorhodopsin-generated pH gradient across the cytoplasmic membrane may be used for phosphorylation or transport. Proteorhodopsin requires retinol for functionality, and in many proteorhodopsin-containing marine BACs and genomes, genes for pigment synthesis are colocalized with proteorhodopsin and exhibit parallel phylogenetic differentiation (
McCarren and DeLong, 2007). Thus, it is surprising that the five proteins for the retinol biosynthesis pathway are lacking from all four SAR86 genome assemblies and the numerous SAR86 BACs and fosmids. Like some SAR86 BACs (
Sabehi et al., 2004,
2005), the proteorhodopsins in SAR86 A-D are flanked by a short-chain dehydrogenase gene that might be used to convert retinal or β-carotene to retinol. This could only catalyze the conversion of an already synthesized hydrophobic pigment, with the initial five steps of carotenoid synthesis missing. Either retinol biosynthesis pathways are part of hypervariable genomic regions, making them rare, or SAR86 must scavenge retinol or a structurally related pigment. Retinol uptake pathways are unknown, as are the concentrations in seawater.
Carbon metabolism in SAR86
All four genomes contain the core components for aerobic respiration and lack the proteins required for carbon fixation via the reverse tricarboxylic acid (TCA) cycle, the reductive CoA pathway and the 2-hydroxypropionate cycle (). The genomes do not contain nitrate reductase, nitrite reductase, sulfite reductase or the cytochromes typically involved in anaerobic metabolism like c3 and b1. Thus, SAR86 appears to be an aerobic heterotroph with the potential for phototrophic ATP production via proteorhodopsin. SAR86A, B and D contain a full complement of the TCA cycle proteins except for citric acid synthase. Instead, genes coding for 2-methylcitrate synthase, methylcitrate lyase and methylcitrate dehydrogenase are present. 2-Methylcitrate synthase can catalyze citrate synthesis from acetyl-CoA and methylcitrate synthesis from propionyl-CoA, which are derived from even and odd length fatty acids, respectively. Thus, SAR86 likely uses a dual TCA/methylCA cycle (). An elegant parallel is the overabundance of lipases and the enzymes required to catalyze the beta-oxidation of fatty acids (), providing NADH, acetyl-CoA and propionyl-CoA.
The SAR86A, B and D genomes contain a complete Emden–Meyerhof–Parnas glycolysis pathway. SAR86B contains a complete pentose phosphate pathway, but SAR86A lacks the oxidative arm of the pathway (). As with the amino-acid synthesis pathways, this does not appear to be an artifact, as the corresponding proteins in SAR86B occur in a genomic locale syntenic to that of SAR86A (). In addition to the pentose phosphate genes, this genomic region codes for critical steps in glycolysis and glucose uptake. Across 73 oceanic metagenomes, the shared genes involved in glycolysis and glucose uptake are always more abundant than those coding for the oxidative arm of the pentose phosphate pathway, implying that substantial portions of natural SAR86 populations lack the oxidative pentose phosphate pathway (). This would result in one less metabolic source of NADH production. An analogous scenario is observed in the marine cyanobacteria, where gain and loss of proteins within one genomic location results in dramatically different nitrogen assimilation capabilities among different lineages (
Scanlan et al., 2009). This genomic region contains an abundant non-coding RNA identified in marine metatranscriptomic libraries that previously lacked genomic context (
Shi et al., 2009) (,
Supplementary Figure S5). This implies that carbon assimilation in SAR86 is controlled by a rapidly responsive RNA-based regulation. The expanded sugar utilization metabolism in SAR86 contrasts sharply to SAR11, where some strains lack glycolysis altogether and others contain a modified Entner–Duodroff pathway (
Schwalbach et al., 2010).
In addition to sugars and lipids, all SAR86 may be able depolymerize polysaccharides with glycoside and glycosyl hydrolases and degrade peptidoglycan into D-amino acids and D-sugars using a set of murein lytic hydrolases, D,D-carboxypeptidases and D,L amidases (). The conversion to L-amino acids could be catalyzed by the two D-amino-acid racemases. Relative to SAR86A, SAR86B also has an additional D-sugar racemase, an β-agarase and several extra glycosyl/glycoside hydrolases, consistent with its genomic expansion in sugar utilization. SAR86 appears to have an incomplete gluconeogenesis pathway; all genomes lack fructose-1,6-bisphosphatase but contain 6-phosphofructokinase, which would prevent de-phosphorylation of 1,6-P-fructose into 6-P-fructose but allow the reverse reaction.
A focus on pmf-dependent transport across the outer membrane
Whereas the SAR11 genomes contain numerous substrate-binding protein ABC-type transporters for nutrient uptake across the cytoplasmic membrane, SAR86 contains only two (), specifically for oligo-peptides, which is consistent with amino acid auxotrophy, and ferric iron. All of the SAR86 genomes contain multiple major facilitator superfamily type transporters for simple metabolites, ammonium and phosphate. When considered with respect to genome size, the SAR86 genomes contain a highly disproportionate number of putative tonB-dependent outer membrane receptors (TBDR) relative to other bacteria (,
Supplementary Figure S6). TBDRs are outer membrane receptors that catalyze high affinity transport of compounds larger than 600

Da across the outer membrane, including Fe-, Cu- and Ni-chelates, vitamin B
12,
N-acetyl-glucosamine, and carbohydrates (
Blanvillian et al., 2008;
Schauer et al., 2008).
TBDR-dependent transport requires a pmf across the cytoplasmic membrane. As noted by
Morris et al. (2010), proteorhodopsin may provide a pmf for TBDR uptake. SAR86 has other mechanisms for generating a pmf, including respiration or NADH dehydrogenation (). All four genomes contain multiple V-type pyrophosphatases, which generate a pmf through the breakdown of cytoplasmic pyrophosphates (), and are important during shifts between starvation-to-nutrient-replete and dark-to-light conditions in the non-purple sulfur bacteria (
Garcia-Contreras et al., 2004). Respiration is likely the dominant pathway for generating a pmf, with proteorhodopsin, NADH dehydrogenation and pyrophosphate breakdown providing ancillary support.
Although the vast majority of TBDRs are uncharacterized and phylogenetic analyses are uninformative about substrate specificity due to poor bootstrap support (
Schauer et al., 2008), genome neighborhood analysis can provide insight. Many of the SAR86 TBDR genomic regions contain lipid-degrading enzymes, a phenomenon not previously observed in any organism, much less marine bacteria (
Supplementary Figure S7). For example, one genomic locus includes a patatin (a phopholipase), a protein in the rhodanese/beta-lactamase family, and a flavin-dependent oxidoreductase. While the patatin might break a bond found in phospholipids, the other two enzymes do not. Instead, this enzymatic combination is potentially capable of breaking key bonds in sulfoquinovosyldiacylglycerol, a sulfolipid used by marine cyanobacteria (
Van Mooy et al., 2009). After cytoplasmic uptake, the sulfoquinovose polar group can be degraded by the TCA cycle, recovering more ATP and NADH (
Roy et al., 2003). A combination of esterase family proteins and a choline/carnitine/betaine transportor could potentially catalyze fatty acid removal and subsequent cytoplasmic uptake of polar head groups from betaine polar lipids. In the hyper-oligotrophic South Pacific, combined sulfoquinovosyldiacylglycerol and betaine polar lipids concentrations are typically >1.5

n
M, each molecule containing >30

acyl bonds, making them highly energy rich (
Van Mooy and Fredricks, 2010). Notably, the tonB loci in each genome contain different combinations of catabolic enzymes; diversification of carbon compound transport may drive the high phylogenetic diversity of SAR86.
Antibiotics gain access to the periplasm of gram-negative bacteria via non-specific TBDR uptake (
Schauer et al., 2008), thus the presence of a disproportionate number of β-lactamases and a macrolide efflux system is not surprising (). All of the genomes also contain at least one cytochrome P450 that may be used for xenobiotic degradation. In contrast to all other abundant cultivated marine microbes, each genome encodes putative nitroreductases and nitropropane dioxygenases (), proteins generally associated with bacteria found in soils contaminated with industrial chemicals like trinitrotoluene (dynamite). Nitropropane dioxygenases degrade nitro-aromatic compounds, producing nitrate and nitrite in the process (
Nishino et al., 2010). Nitroreductases degrade the same compounds but generate nitrite and ammonium. There is no evidence that SAR86 can subsequently assimilate the nitrate or nitrite. The presence of these enzymes within four genomes of a highly abundant bacterial clade suggests that nitro-aromatics are a biologically relevant and heretofore overlooked portion of the dissolved organic matter in the surface ocean. Potentially, the addition of nitroaromatics would also select for SAR86 over the more abundant SAR11 during cultivation efforts.
Carbon source specialization in metabolically streamlined planktonic bacteria
The ultimate source of dissolved organic carbon in the open ocean is planktonic biomass, which, in terms of carbon, is 65±9% protein, 19±4% carbohydrate and 16±6% lipid (
Hedges et al., 2002). This material is released upon death, either by viral lysis, apoptosis or predation, creating dissolved and particulate organic carbon pools. Several abundant marine genera with TBDR-rich genomes have been implicated in the colonization and degradation of marine particles (
Bauer et al., 2006;
Thomas et al., 2008), fueling further production of dissolved organic carbon (
Azam and Long, 2001). However, the SAR86 genomes lack the genes required for pili or flagellin formation, chemotaxis and motility, EPS production and other pathways known to mediate particle adhesion. This, along with its prevalence in metagenomes from the 0.1–0.8

μm size fraction, implies that SAR86 is predominantly free living (planktonic).
The two most abundant identifiable genomes of heterotrophic planktonic bacteria within the GOS dataset are SAR11 and SAR86 (). Despite a cosmopolitan distribution, neither of these organisms is a generalist; each exhibits genome and metabolic streamlining that precludes using all carbon compounds. Consistent with genomic predictions (
Giovannoni et al., 2005;
Schwalbach et al., 2010), cultivated strains of SAR11 can grow on dicarboxylic acids and simple peptides, a pool of organic carbon derived from upwards of 85% of cellular carbon biomass. SAR86 appears to specialize in the transport and degradation of lipids and polysaccharides. By focusing on different compounds, SAR11 and SAR86 might compete only sparingly for dissolved organic carbon. This also provides a tangible link between the crude biochemical composition of the dominant phytoplankton and the associated bacterial community, specifically that the relative abundance of SAR11 and SAR86 is controlled by the stoichiometry of protein, carbohydrate and lipid in plankton.