The first truly large-scale random shotgun sequencing data from an environment have been published only recently (Tyson et al. 2004
), characterizing an underground biofilm under extremely acidic conditions (less than pH 1) in an iron mine drainage path. Just a month later, a much more complex environmental sample from surface water of the Sargasso sea has been reported (Venter et al. 2004
), containing an order of magnitude more data (see ). This latter dataset alone comprises more predicted open reading frames (ORFs) than contained in all the completely sequenced genomes available at the time (although metagenomics ORFs are sometimes fragmented). Early in 2005, two more shotgun datasets have been released, from yet other, very different habitats, namely 116
Mbp from whalebone samples in more than 500
m water depth in two different oceans (hereafter whalefall), as well as 208
Mbp from surface soil on a Minnesota farm (Tringe et al. 2005
; see for a summary). Several more datasets of up to 200
Mb are underway, as is a more data-rich and systematic sampling of ocean water.
Large-scale environmental sequencing projects: properties and scope.
Although the resulting sequences are hard data, the experimental sampling protocols can be quite different, leading to considerable biases. For example, size filters have been used in the Sargasso sea that are likely to select against small viruses as well as against larger eukaryotic cells. This is simplifying the analysis of prokaryotic diversity, but has to be taken into account when re-analysing and comparing the data to other samples. Furthermore, as the data come from different laboratories, the protocols for read quality filtering, assembly and gene prediction can vary considerably, making it difficult to compare basic properties between different habitats such as the number of annotated ORFs or the degree of assembly. This will also have an impact on downstream analyses, such as determining the phylogenetic or functional composition.
Unfortunately (for details see ), not only the habitats, sampling procedures and the data treatments vary considerably but also the nature of the data itself. In some environments, certain species dominate, as exemplified in the acid mine drainage sample where five prokaryotes contribute greater than 80% of all the sequences obtained (notably, one of them, Leptospirillum, was the first sequenced member of an entire phylum, that of Nitrospira, illustrating the bias in classical genome sequencing).
On the contrary, the assembly rate of the much more complex soil data (less than 1%) indicates that a single species is unlikely to be abundant in this sample. It has been estimated that at least 1
Gbp (Tringe et al. 2005
) would have to be sequenced before the most abundant species could be reasonably covered by assembling the reads. Thus, while the amount sequenced might have been sufficient to capture the major trends and functional repertoires in the acid mine drainage data, the coverage of the soil might still not be fully representative despite consisting of more than 200
Mbp of raw sequence.
Another factor to consider is the diversity of species within an environment, which is presumably much higher in 0.5
g of soil than even in hundreds of litres of ocean water (e.g. Torsvik et al. 2002
). This is also reflected in higher estimates of species numbers: more than 3000 in the soil sample versus 1800 in the Sargasso sea samples (Venter et al. 2004
; Tringe et al. 2005
). In addition, the heterogeneity of a sample (0.5
g of soil harbours various differently populated subhabitats) and the number of individuals can only be estimated, yet will impact the data. The different constraints imposed by the environments are reflected in the genome sizes (estimates range from 2 to 6
Mbp in water and soil, respectively; Venter et al. 2004
; Tringe et al. 2005
). This all makes it difficult to extrapolate from individual ORFs to entire species in a sample and leaves a considerable uncertainty in ORF-based estimates. However, the elucidation of the phylogenetic composition of the communities in each sample remains one of the big scientific challenges in metagenomics. Is the current overrepresentation of proteobacteria in the set of completely sequenced genomes a result of their general abundance, or of a sampling bias? They certainly seem to dominate in the more complex samples of soil and surface water, but this might be a chicken-and-egg problem as we can possibly identify them better than other phyla, knowing more about them already.