We live in a microbial world, with microscopic organisms filling discrete ecosystems in such environments as soil, lakes and oceans, the human gut or skin, and even computer keyboards. Though microbiota include bacteria, archea, viruses and microscopic eukaria, we will consider only bacterial examples in this article. Bacteria comprise most of the Earth's biomass and richness [1
]. They dominate ecological functions such as carbon cycling, greenhouse gas emission and oxygen production. Ninety per cent of the cells in a human body are bacterial, as are 99% of the gene transcripts [2
]. However, most of the microbial world has been inaccessible to us, a kind of biological ‘dark matter’, since we do not know how to culture over 97% of all bacteria, and since older cultivation-independent microbial survey techniques such as TRFLP (Terminal Restriction Fragment Length Polymorphism), ARISA (Automated Intergenic Spacer Analysis) and gradient gel electophoresis have significant limitations. ‘Next Generation’ sequencing technologies have enabled, for the first time, high-throughput microbial sampling [3
Current microbiome studies extract DNA from a microbiome sample, quantify how many representatives of distinct populations (species, ecological functions or other properties of interest) were observed in the sample, and then estimate a model of the original community. Ambitious projects are underway to catalog microbial life for the entire Earth, the ocean and the human body [4–6
]. Surveys of transcriptomes and entire genomes have revealed more than half of all known protein sequences. Existing methods for estimating richness and community structure from observed samples are becoming more refined, improving model estimation, confidence quantification and comparative methods [7–9
]. Finally, interactive, visual techniques are emerging with which to explore these complicated data sets prior to formal analysis.
The new sequencing technologies have idiosyncratic strengths and weaknesses, which are not fully understood, and are beyond the scope of this review [10
]. Currently, most researchers use the Roche 454 GS-FLX or Illumina GAIIx/HiSeq2000 sequencing platforms. The Roche 454 GS-FLX Titanium can now generate in excess of 1 million reads per run, which takes 23
h, with read lengths up to 1000
bp (average ~500
bp); the average run generates 750
Mbp of sequencing data. The Illumina HiSeq2000 platform can now generate ~4 billion paired-end reads per run (with two flow cells of 1 billion fragments each), which takes 10 days, with (usually) 150
bp paired-end reads to create an ~250-bp product; the average run generates 1
Tbp of sequencing data. Of course, there is wide variation between individual labs for these statistics. Emerging technologies, such as single molecule sequencing and smaller single lab devices are not widely used yet, and Sanger sequencing of large-insert libraries is still significant [11
Recent bioinformatics advances have significantly improved sequencing and assembly errors detection and correction. Several packages provide pipelines to bring these new algorithms into the lab [12
]. Bioinformaticists continue to improve algorithms for detecting specific types of error, such as chimeric sequences [14
] and precise but inaccurate reads [15
In this review, we survey recent advances in genome-based analytical techniques to measure the diversity of complete microbial communities. There are, of course, many other ways for analytical scientists to advance microbiome studies, which we do not review here, such as new quality control methods, large-scale data curation, knowledge mining and novel data-analytic techniques such as metaproteomics and advanced mass spectrometry. So, for working purposes here we consider a ‘microbiome’ to be a well-defined patch of an ecosystem, such as all bacteria in a prescribed sector of the ocean or all bacteria from a specific body part of several humans. We use microbial ecology terminology rather than statistical conventions, so that a ‘population’ is a collection of all organisms of a given species, a ‘community’ is a collection of ‘populations’ that share a specific ecosystem, and a ‘sample’ or ‘specimen’ is a physical extract from a given microbiome. Finally, we limit references for the most part to recent publications that serve as jumping off points for further exploration, rather than a complete literature survey.
In this article, first we discuss studies based on 16S rRNA amplicons. Next, we review analyses of metagenomic and metatranscriptomic data from shotgun sequencing of multiple genomes or genome transcripts. We then consider advances and limitations in statistical techniques for diversity estimation. Then we discuss visual analytics, hypothesis generation by visually exploring these very large sequence data sets. Finally, we speculate on how microbiome studies may change in the next 2 years.