Targeted sequencing is a powerful tool for assessing the organisms that are present in microbial communities, but it is limited in terms of the functional and genetic information produced. Organisms for which the genome sequences are known (currently there are several thousand sequenced bacterial genomes) can be used to infer the genes and functional capabilities of the community (). However, many organisms have no reference sequence. Furthermore, a reference sequence does not completely describe the genes that are contributed by an organism. There is considerable variation in the genomes between strains of the same species. Two strains of Escherichia coli, O157:H7 and K-12, both have 16S rRNA gene sequences of E. coli, but differ in hundreds of genes. There are limits to what can be learned about the genetic content of communities from 16S rRNA gene sequences alone.
Moving beyond this level of functional inference requires a gene-based census. This catalogue of genes can be provided by shotgun sequencing of DNA that has been extracted from the community as a whole and samples the mixture of genomes that make up the metagenome (). In a community in excess of hundreds of species with varying abundance, deep sequencing is needed to sample minor constituents that are not necessarily unimportant. The bacterial concentration in the gut can be 1011
(refs. 38, 39
), so for an organism that is present at a concentration of 1 per 106
there are 105
, which is sufficient for the organism’s products, such as metabolites and toxins, to have an effect on the community and the host.
Illumina sequencing of faecal samples produced 4 gigabases per sample and 10 Gb per sample in the Metagenomics of the Human Intestinal Tract (MetaHIT)6
projects, respectively, which corresponded to tens of millions of reads per sample. At this depth of sequencing, the genomes of minor constituents such as E. coli
(with an abundance of about 1% or lower) are sampled almost completely, and organisms with an even lower abundance have some of their genome represented. This extraordinary sampling of complex microbial communities is made possible by producing large amounts of data and by the low cost of NGS methods.
Shotgun sequence data, in addition to 16S rRNA gene analysis, provide information on the organisms that make up communities. Extracting 16S rRNA gene sequences from shotgun reads to determine the organisms present is possible; however, targeted 16S rRNA gene sequencing tends to introduce biases (owing to the broad-range PCR used to amplify 16S rRNA gene sequences or the choice of region within the 16S rRNA gene), which shotgun sequencing does not. Shotgun sequencing is less sensitive than targeted rRNA sequencing because a small fraction of the sequences are from 16S rRNA genes. Another approach is to align shotgun sequences to bacterial reference genomes33,40,41
, allowing the relative abundance of species to be determined on the basis of the number of reads that align to each reference genome (also useful for the comparative studies already described). The MetaHIT project has used this approach to classify individuals into different groups, called enterotypes, on the basis of the community structure in their faecal samples40
. The same enterotypes have been found in 16S rRNA gene-based analysis42
. The vaginal microbiome has also been classified into five groups43
. These observations suggest the human microbiome may exist in distinct states in different people, although correlation with environmental, genetic or health status is not yet clear. Stratifying future studies depending on which community class an individual belongs to may be important for identifying correlations with phenotypic data.
The need for reference genome sequences is clear both to infer genetic content of organisms identified by 16S rRNA genes and to identify sources of shotgun reads by aligning to reference genomes, and so determining organismal content of communities from shotgun data. NGS techniques have reduced the cost of bacterial sequences to less than US$1,000 per genome and led to an increase in the production of ‘complete’ genome sequences. Current methodology relies mainly on Illumina shotgun sequencing and a variety of methods to assemble the reads into a genome. The product is not a true complete genome, but a high-quality draft that covers almost all of the genome and results in a high-quality base sequence27
. Programmes such as the HMP32,44
and the Genomic Encyclopedia of Bacteria and Archaea (GEBA)45
are producing reference genomes by the thousands.
Although bacteria are the main components of the human microbiome, eukaryotic microbes and viruses (both human viruses and bacteriophages) are also present (). The study of eukaryotic microbes is not as advanced as that of bacteria46
, but the organisms are identified by signature sequences (such as fingerprinting and 18S rRNA) and shotgun sequencing analogous to bacteria. The number of reference genomes for eukaryotic microbes is smaller than that for bacteria, and progress will depend on addressing this shortfall.
By contrast, considerable effort is being given to characterizing the genomes of human viruses47
, known as the virome (Box 1
). This work is based on shotgun sequencing (), although oligonucleotides microarrays for virus detection are also used49,50
. Viral sequences can be detected in shotgun data from different body sites, and viruses can also be enriched by processing samples before DNA extraction51
. Virome analysis by shotgun sequencing of microbial communities (discussed later) has led to the identification of human viruses52–54
, as well as the detection of known viruses in healthy subjects and diseases of unknown aetiology55
. Likewise, bacteriophages are found to be highly diverse at different body sites56–58
, with differences between individuals as a result of diet59
or disease states60,61
Sequencing for gene catalogues and functional inference
Metagenomic shotgun data also sample community gene content, which is useful to define community capabilities and identify particular members. Deep sequencing, such as that used in the MetaHIT and the HMP, broadly samples the genomes of even minor constituents, facilitating the identification of genes present within a given community (). By using the sequence reads themselves, or by first assembling them into contigs (Box 1
), sequence data can be compared with databases such as the National Institutes of Health’s GenBank to identify which genes are present. De novo
prediction of genes from metagenomic data is also possible33
, which provides motifs for functional inference even if the sequence does not find a match in a database. Finally, alignment of reads or contigs to reference genomes identifies which organisms are present, along with their known gene content. These methods convert metagenomic sequence data into catalogues of genes that can be further analysed.
Gene catalogues can be compared with databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG)62
, which sorts gene products into pathways and processes. Such analyses provides lists of pathways, identify which pathway genes are in the community and quantify the abundances of genes and pathways63
. Comparing gene catalogues to specialized metabolic databases, such as the Carbohydrate-Active Enzymes database64
, is also useful. Carbohydrate-degrading capabilities of communities differ between body sites, suggesting the carbohydrate spectrum of each body site has determined which organisms and pathways are present65
In addition to pathway analysis, determining the presence and abundance of genes, such as antibiotic-resistance genes or virulence factors, in a community is possible using similar methods to those already described, and can shed light on pathogen burden in an individual and consequences of antibiotic treatment. The importance of functional analyses cannot be overemphasized, and functional properties of communities are thought to be more important than their taxonomic composition66