Soil is one of the most diverse environments on earth and the depth of the microbial diversity is still poorly understood. High throughput sequencing technologies, coupled with appropriate DNA extraction methods, provide a means to explore the soil ecosystem with an unprecedented level of detail (Vogel et al., 2009
). In this study, pyrosequencing from 13 samples generated nearly 5 × 109
bp of sequence data with average read size of 386
bp. Three key parameters were varied: soil depth, sample collection season and DNA extraction method. Sequence samples were annotated with the MG-RAST online server, revealing broad functional (835 of 878 possible functional subsystems) and taxonomic (detection of 1214 putative taxa) diversity in the Rothamsted Park Grass soil metagenome.
The most abundant functional subsystems in the Rothamsted soil seemed to be related to microbial cAMP signaling and Ton and Tol transport (Supplementary Table S2
). The same subsystems were prevalent in metagenomes in soil at Waseca farm, in Puerto Rico and Italy. These trends in soil functional content are robust enough to be observed on a global scale. cAMP is an important secondary messenger in Eukarya and Bacteria. cAMP is a universal cell energy/metabolism regulator as well as being involved with cell–cell signaling. Soil bacteria might have to deal with frequently fluctuating substrate levels so that they would need extra regulation rather than interacting with plants. Interestingly, since cAMP is also a subversion mechanism, some bacterial pathogens might also subvert plant cAMP production for their own benefit, through injection of adenylate cyclase and/or various toxins that alter adenylate cyclase levels (adenylate cyclase is essential to the production of cAMP) (Akhter et al., 2008
; Agarwal and Bishai, 2009
). Iron is an essential element for most organisms (Weinberg, 1984
), but can be a limiting reagent for life (often in oceans, Boyd et al., 2007
) owing to its insolubility in aerobic environments at neutral pH. In response to this stress, some bacteria possess high-affinity transport systems (Crosa et al., 1997
) and generate high-affinity siderophores that complex extracellular iron to optimize its acquisition. The presence of Ton-related proteins in the soil is likely due to TonB, an energy-dependent cell envelope protein that assists iron uptake through accommodation of ferric siderophores, too large to cross porins, through the outer membrane (Klebba et al., 2003
MG-RAST annotation also revealed the presence of several highly abundant CBSS. These are groups of functionally coupled genes (genes found proximal to each other in the genomes of diverse taxa) whose functional attributes are not well understood. The relatively high abundance of these subsystems across all Park Grass samples, as well as the other sequenced soils, suggests that they have key roles in soil ecosystems across the globe, and should be explored in future research efforts to understand the composition of soil ecosystems. The CBSS-258594.1.peg.3339, CBSS-269799.3.peg.2220, CBSS-83332.1.peg.3803, CBSS-249196.1.peg.364 (Supplementary Table S2
) are thought to be a galactoglycan biosynthesis, a molybdenum oxidoreductase, a PKS-related, and a fatty acid metabolism subsystem, respectively.
The comparison of the runs corresponding to the same DNA sample (F2a/F2b) provided important information about the reproducibility of pyrosequence generation in highly biodiverse environments. The Fisher's exact test operated by the STAMP software did identify some functions (about 7%) and taxa that varied significantly (at the 95% CI) between replicates. The lower P
-value was on the order of 10−7
when comparing F2a and F2b at the functional level, so some comparisons between seasons and depths were possible. On the basis of these observations, functional comparisons having at most a minimum P
-value of 10−8
(cutoff based on the observed technological reproducibility) were considered to have distributions that varied significantly. Unfortunately, the technological reproducibility is not the only limit for robust metagenomic comparisons. Even if a stringent P
-value is used, the DNA extraction approach influenced the experimental conclusions. When comparing the seasonal effect by using two different extraction approaches (direct:F1/J1 and indirect F4/J4), some differences in relative predominance of different subsystems were found. On the basis of the comparison of F1 and J1, sequences related to the type 4 secretion and conjugative transfer and cellulosome subsystems are more represented in February (P
-value of 10−8
in the two cases). When comparing F4 and J4, the cellulosome subsystem is still detected more in February (P
) but the type 4 secretion and conjugative transfer is not. In contrast, sequences related to bacterial cAMP signaling are more present in July (P
-value of 10−12
), but only when comparing F4 and J4. Thus, only sequences related to cellulosome dominated one season's metagenome independent of the extraction method applied. Major environmental difference between the two studied seasons was temperature (from 6
°C in February to 16.6
°C in July). In addition, snow lay on the ground for weeks in February of the same year, thus limiting active grass growth. As a consequence, soluble root exudates were possibly in short supply during this relatively cold period and cellulosome from root residues would be the main source of carbon and energy supporting soil microbial communities.
On the other hand, depth had more effect with sequences related to genes involved in bacterial chemotaxis, Ton and Tol transport systems, flagellum mechanism, D
-ribose and L
-Arabinose utilization represented more in the surface sample (0–10
cm) and sequences related to selenocysteine metabolism and tRNA aminoacylation represented more at depth (11–20
cm). However, these results were generated using only one DNA extraction method. In comparison to depth and seasonal variables, the extraction method was able to influence functional distributions (), especially when using methods with striking differences in cell lysis (for example, Gram positive kit versus in agarose plug lysis or DNA tissue). Thus, the stringency of lysis appears to be a crucial step for soil metagenomic analysis, confirming previous results with RISA and phylogenetic microarray analyses (Delmont et al., 2011b
In addition, when studying the distribution of sequences based on their G+C%, clear variations were found among the different runs. Direct lysis versus indirect lysis had more impact on the G+C% profile than any other variable. The indirect lysis provided more sequences possessing a higher G+C ratio (from 60% to 72%), whereas the direct lysis had a more even distribution with more sequences in the 50 to 58 G+C% range (Supplementary Figure S1
). Both metagenomic s.d. and G+C% ratio profile fluctuations are limited by the experiments and variables used. However, this effort provides both significant soil metagenomic sequences and data useful to appreciate methodological differences in microbial community diversity accessibility.
Given the relatively low functional subsystem variations between different soils (), soil microbial community metagenomes from Rothamsted, Puerto Rico, Italy and the Waseca farm soil (Tringe et al., 2005
) could be compared with metagenomes from oceans and human feces. This comparison might help identify some of the soil ecosystem unique functional attributes. In order to make the comparison, principal component analysis was generated based on the distribution of general functional subsystem classes with metagenomes publically available from these ecosystems (). Some general functional classifications appear to be relatively more represented in one ecosystem in comparison with the others. Sequences related to RNA and protein metabolism, photosynthesis, fatty acids and lipids, and macromolecular synthesis are more highly represented in ocean metagenomes. In contrast, phosphorus metabolism and virulence are less represented in ocean metagenomes than in those sequenced for soil and human microbiomes. Sulfur and potassium metabolism, membrane transport, stress response and regulation, and cell signaling are more represented, and nucleosides and nucleotides, and RNA and protein metabolism are less represented in soil metagenomes. In human microbiomes, cell division and cell cycle, DNA and phosphorus metabolism, cell wall and capsule, dormancy and sporulation, and carbohydrates are more represented than in those of oceans and soils (). When comparing the taxonomical structure of these metagenomes, Cyanobacteria and Bacteroidetes appear to be more represented in the oceans. In addition, eukaryotic sequences were also detected and represent additional specificities of these metagenomes (Supplementary Figure S2
). Actinobacteria, Chloroflexi, Fibrobacteres and Acidobacteria group, Planctomycetes, and Synergistetes are more numerous in soils. Chlorobi, Firmicutes, Spirochaetes, Fusobacteria and the Bacteroidetes Chlorobi group are clearly relatively dominant in human digestive tracts. In contrast, more Proteobacteria are present in oceans and soils. The metagenomes are clearly grouped as a function of the environment based on both general functional and taxonomical distributions. So in spite of important DNA extraction biases and sequencing technology differences (Illumina, Pyrosequencing and Sanger), global metagenomic comparisons are possible and provide unique information about the functional and taxonomical differences of each environment (Delmont et al., 2011a
). As an example, sequences related to metabolism of aromatic compounds are more abundant in soils possibly due to the presence of these compounds in this environment. However, additional comparisons, such as qPCR and metatranscriptomics, need to be performed to confirm which taxa and functions are unusually active in soil to gain a better understanding of soil microbial community function.
Figure 6 The principal component analysis of three ecosystems using the relative distribution of reads in the different metabolic subsystems for the metagenomic sequences available in the public database in addition to those produced here. The large metabolic (more ...)
The relative percentage of orphan reads decreased continually when accumulating pyrosequences. Therefore, an estimate of the number of reads needed to avoid having orphan reads would possibly provide the absolute minimum number of reads needed to sequence the entire soil metagenome. Rarefaction analysis of this sequencing effort () indicated that the equivalent of about 450 Titanium runs would be required to create contigs from all of the soil pyrosequence reads generated. Of course, chimeras might be generated due to the complexity of communities, and a much larger effort would be needed to assemble the soil metagenome, but as new efficient high-throughput sequencing technologies and valuable assembling tools are developed, this goal will become less utopic. Genomes from Proteobacteria might be assembled more rapidly than those from Firmicute or Verrucomicrobia phyla. The presence of regions that limit assembly (for example, insertion sequences regions) and the complexity of diversity among taxa might explain in part the efficiency differences observed between these phyla (4.5 × and 30 × ), but additional experiments are needed to understand the two trends observed in the .
In this study, >12 million reads were generated from the soil of the Rothamsted Research Park Grass experiment. These sequences were generated in 13 separate sequencing runs producing over 4 × 109
bp. The results demonstrated both some DNA extraction biases and relatively low seasonal (when comparing February and July months) and vertical soil metagenomic functional class fluctuations. In addition, this approach provided a statistical view of functional distributions in this soil. This metagenomic study increased our knowledge about soil microbial communities at a metagenomic level by integrating both natural and methodological fluctuations. The metagenomic variance so generated represents a global picture of the Rothamsted soil metagenome that can be used for specific questions and future inter-environmental metagenomic comparisons. However, only 34.5% of the reads were assigned to functions and <1% of annotated sequences correspond to already sequenced genomes (at 96% similarity), therefore, many soil microorganisms remain elusive and genome constructions are needed.