Detection of marker genes
We used a set of 35 OGs that are widely conserved (present in most species), rarely occur as duplicate genes in known genomes, are not subject to horizontal gene transfer, and do not scale with genome size as marker genes (set largely overlapping with the one used by Ciccarelli and coworkers [
25]; Figure and Additional data file 7). Marker gene counts were carried out using an approach similar to the one described previously [
15], based on comparisons with known proteins. In brief, DNA sequence reads (or, alternatively, randomly generated genome fragments) were searched against the extended database of proteins assigned to OGs in the latest STRING release (6.3) [
30], using BLASTX [
44], and an OG was called present when a hit matching one of its proteins occurred (with a BLAST score of at least 60 bits). Note that this procedure is largely independent on varying annotation qualities across genomes, and avoids biases owing to lower gene prediction quality on short sequences [
17], such as the reads used in this study.
In order to avoid potential biases introduced by any uneven phylogenetic representation within the reference set of known proteins, all BLASTX matches exceeding an overall protein identity of 50% were discarded. This latter step is needed to avoid artefacts introduced by the occasional organism in the sample that happens to be closely related to a known organism in the BLAST database. For such an organism, even marker genes contained only partially on a read can be detectable by BLAST because of their high sequence identity to known genes. In contrast, most environmental genes have low sequence identity to known genes, and so fragmented marker genes often escape detection. Thus, without the above threshold, marker gene counts would be higher for well known organisms, biasing the results. (Note that the threshold does not select against well known organisms either; their genes will still generate hits with other organisms in the database, with identities below 50%, and will thus be counted like any other environmental marker gene.)
Query sequences were allowed to map to several OGs, provided these overlapped by no more than 50% of the shortest assignment. BLASTX was run using the BLOSUM62 matrix, and low-complexity filtering was enabled. Marker gene density x was then defined as the number of matches to reference genes, divided by the total number of megabases surveyed.
Calibrating marker gene density with genome size, and genome size prediction
To determine the relationship between the occurrence of marker genes and genome size, we used fully sequenced genomes for calibration. We simulated the widely used whole genome shotgun (WGS) sequencing process by randomly extracting 'reads' of variable read length from 154 previously completely sequenced bacterial and archaeal genomes (EBI Genome Reviews release 17). In total, 50 genomes were randomly chosen per read length bin (300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200 base pairs [bp]), and reads were sampled until 3× coverage was achieved (see Additional data file 8 for a list of genomes). We did not distinguish between plasmid and chromosomal DNA (each DNA fragment of a completely sequenced genome was equally likely to be considered).
We determined the occurrences of marker genes among these 'reads' as described above. On these counts, we based a three-dimensional calibration, relating known genome size to the parameters read length and marker gene density. Because the total number of marker genes per genome does not vary with genome size, we expect that genome size increases proportionally to the inverse marker gene density 1/x at any given read length L: EGS = c(L)/x, where c(L) is a read-length-dependent calibration factor. The exact form of c(L) is determined not only by the read-length dependence of the probability of sequencing a portion of a marker gene, but also by the probability of identifying a read as a marker gene. Because the latter depends on sequence features of individual marker genes, it is not easily possible to specify the analytical form of c(L) a priori. Based on manual comparison of a variety of possible functional forms, we found that c(L) is well approximated by a power law, c(L) = a + b × L-c. This is indeed the best (R2 = 0.97) 'simple' three-parameter formula relating genome size to marker gene density and read length, as confirmed using the TableCurve3D v.4.0 package; a 'simple' equation is defined here as a three parameter equation consisting of a constant and two coefficients that multiply a function of x or L. This resulted in the prediction formula:

We estimated the parameters of this formula with a nonlinear least-squares fit, as implemented with the
nls function in the R programming environment [
45]. First, we randomly selected half of the species in the simulated data. Their complete sets of simulation results (marker gene densities
x at specified read lengths
L), together with the known genome sizes
z, were used as calibration data (the remaining data were later used for detailed error estimation; see Additional data file 1 and Additional data file 2). Parameters
a,
b, and
c were chosen such as to minimize the weighted sum of squares ([
a +
b ×
L-c]/
x -
z)
2/
z. This led to the parameter estimates
a = 21.2,
b = 4230, and
c = 0.733.
Because Eqn 1 is linear in the inverse marker gene density x-1, it can be directly applied to mixtures of genomes, which was supported by our simulations. In the case of species mixtures, the estimated mean EGS is the number of megabases per genome present in the sample (effective genome sizes of different species are weighted by their genome count, not by their contribution to the number of sequenced base pairs). Because Eqn 1 is further approximately linear in the inverse read length L-1, it can also be applied to sequence datasets with mixed read lengths. (For a full discussion and simulation of EGS prediction in mixtures, see Additional data file 1 and Additional data file 3).
So far, calibration was based on simulated reads from fully sequenced genomes, ignoring potential biases introduced, for instance, by cloning. To test our predictions on actual sequencing data, we downloaded and analyzed shotgun data of completed genomes from the NCBI and Ensembl trace repositories [
46,
47], excluding those projects for which sequencing coverage was low (<1.5×; Additional data file 9). In order to ensure consistency we applied a uniform quality clipping method to all 32 datasets, rather than using the provided coordinates of each sequencing centre (phred quality cut-off score of 15, using a perl script kindly provided by Jarrod Chapman [JGI]; clipped reads with fewer then 100 nucleotides remaining were discarded). We found that Eqn 1 overestimates genome size on average by 15.9%. This reflects a strong bias against the marker genes, probably resulting from the fact that these 35 OGs were chosen for their strongly conserved single-copy distribution across genomes, and hence introduction of additional genes into the cloning vector is often lethal. Removal of this bias by an additional scaling factor (excluding the outliers
Wolbachia and
Dehalococcoides ethenogenes 195, as discussed in the main text) results in the new parameter values
a = 18.26 and
b = 3650 for Eqn 1.
Species mixture simulations
In order to generate species mixtures, we first randomly picked 60 out of the entire list of completely sequenced cellular genomes, and simulated WGS sequencing for all 60 genomes using read-length bins from 600 to 900, as described above. We then generated 1,000 simulated metagenomes, by repeating the following procedure 1,000 times.
First, a random number i of species were picked (with 1 <i ≤ 60). Second, for each of the i species, its contributing nucleotides ni was randomized, using the following condition regarding the total number of nucleotides in the metagenome:
Third, a readlength was randomly picked from the available 600, 700, 800, or 900 bp. Fourth, a 'large' metagenome was generated (total sequence about 40 Escherichia coli K12 sized genomes) by randomly extracting reads from each of the I contributing species, using the read length randomly picked in lead three. Fifth, the theoretical genome size T of the simulated metagenome is calculated from the actual contributions ci and from the genome sizes si of the given species, using the following equations:
Sixth, the effective genome size for the simulated metagenome is predicted from the randomly extracted reads, as described in the main text (Eqn1). Seventh (and finally), errors e are calculated using the following equation:
Results are given in Additional data file 3.
Mixed read lengths simulations
We further generated datasets with mixed read lengths. Specifically, we applied the following procedure 1,000 times.
First, a species S was randomly picked from the pool of completely sequenced genomes. Second, a random number j of read lengths (with 1 <j ≤ 4) were picked. Third, a 'large' metagenome was randomly generated (total sequence about 40 Escherichia coli K12 sized genomes), consisting only of species S (with genome size s). Fourth, genome size is predicted, as described in the main text (Eqn 1). Fifth, errors e are calculated using the following equation:
Results are given in Additional data file 3.
Estimation of effective genome size restricted to bacteria
For EGS estimation restricted to only the bacteria in the sample, we calculate a domain-specific marker gene density xbacteria, by dividing the number of hits to marker gene OGs (n) by the number of hits to any OG (ntotal), with the limitation that an OG mapping is only counted if the best BLAST hit of that read region to STRING is a bacterial protein. In this way, only reads of bacterial origin are considered. Because marker gene density is now estimated per read rather than per base pair, this measure requires a new calibration analogous to the one described above. Again, we found Eqn 1 to be the best fitting simple formula for simulated data (R2 = 0.93), with parameter estimates a = 0.0389, b = 0.81, and c = 0.78 from a weighted fit to a training dataset consisting of data for half of the included genomes.
Performance was tested on the remaining data as for the general measure (see Additional data file 1 and Additional data files 5 and 6). Comparison with real reads, as above, revealed an average bias of 5.2%, which lead to adjusted parameter values of a = 0.0370 and b = 0.770. After correction, we found an unsigned median error of 8.0% (SD 14.6%). An analogous procedure was also performed to estimate EGS restricted to archaeal genomes in the sample. However, currently only very few archaea are fully sequenced, and hence there was insufficient data for a full fit and error estimation. To allow at least an approximate analysis of the archaeal fraction in the acid mine drainage sequences (see below), we obtained a rough measure by scaling Eqn 1 with the bacterial parameter set to fit simulated data from the available fully sequenced archaea. This resulted in the parameter estimates a = 0.045, b = 2.91, and c = 0.78 for archaea (R2 = 0.87). Median unsigned error on the simulated data was 14.7% and the SD was 15.8%.
Before comparing bacterial EGS estimates across environments, we first confirmed that there were no significant differences among the three whalefall samples and among the six Sargasso Sea samples (excluding sample 1), by calculating a
z score and
P value (Additional data file 1; all pairwise raw
P > 0.05). We then pooled all whalefall samples, and separately all Sargasso Sea samples, to reduce the total number of comparisons to be made. Statistical significance of differences in EGS was then estimated by calculating
z score and
P value in all remaining pairwise comparisons, and correcting the resulting raw
P values for multiple comparisons [
48].
Environmental sequencing data
The same data were used as in the report by Foerstner and coworkers [
32], with the exception of the Sargasso Sea data, where now all samples were used. Reads were trimmed as described above. Additional data file 10 gives an overview of sequence data after trimming.
Scripting, statistical analyses, and parameter estimation was performed using the R environment for statistical computing [
45] and perl [
49].