|Home | About | Journals | Submit | Contact Us | Français|
In the past decade, remarkable relationships have been documented between dysbiosis of the human microbiota and adverse health outcomes. This review seeks to highlight some of the challenges and pitfalls that may be encountered during all stages of microbiota research, from study design and sample collection, to nucleic acid extraction and sequencing, and bioinformatic and statistical analysis.
Literature focused on human microbiota research was reviewed and summarized.
While most studies have focused on surveying the composition of the microbiota, fewer have explored the causal roles of these bacteria, archaea, viruses, and fungi in affecting disease states. Microbiome research is in its relatively early years and many aspects remain challenging, including the complexity and personalized aspects of microbial communities, the influence of exogenous and often confounding factors, the need to apply fundamental principles of ecology and epidemiology, the necessity for new software tools, and the rapidly evolving genomic, technological, and analytical landscapes.
Incorporating human microbiome research in large epidemiological studies will soon help us unravel the intricate relationships that we have with our microbial partners and provide interventional opportunities to improve human health.
It is believed that in the human body, microorganisms are more numerous than human somatic and germ cells . Together, the genomes of these microbial mutualists (collectively defined as the metagenome) provide traits and services to humans, and in some cases, are associated with disease pathogenesis . Over the past 10 years, with the advent of high-throughput sequencing technologies, there has been an exponential increase in molecular studies of the human microbiome. Rather than relying solely on bacterial cultivation for identification, partial sequencing of the bacterial 16S rRNA gene has become the standard in cataloguing organisms in biological samples.
If humans are thought of as a composite of microbial and human cells, and the human genetic landscape as an aggregate of the human genome, the microbiota (bacteria, archaea, and lower eukaryotes) and the virome (the collective set of bacteriophages and viruses), then the picture that emerges is one of a human 'supra-organism' . It therefore becomes necessary to consider human health and disease outcomes in the context of our microbial partners. Microbiology is now entering a new era where the focus moves from the properties of single organisms in isolation to the operations of whole communities. The new field of metagenomics involves the genomic characterization of the entire microbial communities and not just cultivation of single organisms.
Molecular methods for interrogating microbial communities have led to a better understanding of the organisms present at specific sites on the human body and their potential roles in human health. The respective microbiota in each body niche can influence a wide variety of health outcomes including obesity , brain chemistry , ulcerative colitis , gynecologic and obstetric health , and periodontal disease . Efforts to describe a “core” human microbiome, in the hopes of providing a baseline for comparisons , have proven to be challenging because bacterial communities show high inter-subject variability in species composition , while functional gene expression is more conserved .
With large datasets capturing many dimensions of the microbiota, including diversity, relative abundance and absolute abundance of bacterial taxa, as well as functional measurements of the microenvironment, there are tremendous opportunities for epidemiological studies to describe the microbiota’s role in transitions between healthy and disease states. To date, most studies have focused on quantifying the statistical associations between the compositions of the human microbiota with health outcomes, however, fewer have been able to document how microbial changes are part of the causal chain leading to disease [8, 11].
Early studies on microbes were constrained to culture-based methods that were limited by the large numbers of species that resisted cultivation. While cultivation of microbes has improved and the proportion of organisms not yet cultivated is rapidly decreasing, the development of molecular methods for characterizing the microbiota, including marker gene amplicon, metagenomic, and metatranscriptomic sequencing, brought about rapid access to the identification and genomic information of previously uncultivated organisms. Marker gene amplicon (mainly 16S rRNA gene) sequencing involves interrogating a single gene to identify which species are present. Combined with broad and species-specific quantitative PCR, this approach affords cataloguing species and their abundance in biological samples. For an overview of the human microbiome and 16S rRNA gene-based analyses for characterizations of the human microbiota, as well as terminology in this rapidly evolving field, we refer the reader to an excellent review by Tyler et al.  as well as an editorial by Marchesi and Ravel .
To gain insight into the functional make up of microbial communities, metagenomic sequencing is applied by sequencing all of the DNA recovered from a sample. Analyzing these reads can identify what organisms are present and the community’s genomic content and functional potential. Metatranscriptomic sequencing, which surveys expressed genes in a sample, defines the function of the community at the time of sampling. These approaches could be further expanded by looking at the metaproteome [14-16] or the community metabolic outcomes, the metabolome [17-19]. Recent technological advances in high-throughput sequencing has enabled the parallel processing of large number of samples at affordable costs. As a consequence, these methodologies can now be integral to large-scale epidemiological studies.
In this review, we seek to detail what is involved with analyses of the human microbiota from an epidemiological perspective, with specific attention to the associated difficulties in designing, executing, and interpreting studies of the human microbiome. Figure 1 presents a sample workflow for conducting a 16S rRNA sequencing study, and while the details would differ when conducting a metagenomic or metatranscriptomic study, this flow chart highlights the issues to consider at each step of the process.
One of the first issues that arises when planning epidemiological studies on of the human microbiome is determining collection methods for the samples. Collection should recover samples that are representative of the true microbiota present at the site, while limiting sampling biases and contamination. Less invasive sampling methods encourages recruitment and retention of study participants, and a pilot study can help inform and validate sampling methods. For example, recent studies on the methods for sampling the sinonasal microbiota  and intestinal mucosa  found the less invasive methods provided samples that had consistent microbiota profiles with samples obtained using classical sampling methods. In contrast, fecal transport swabs recovered less DNA and showed altered microbiota profiles compared to that of fecal material samples , stressing the importance of validating collection methods.
An important aspect of sampling strategy also includes sampling frequencies, which if performed in a clinical setting is often limited by the willingness of participants to return to the study site frequently as well as staffing requirements. However, participants are capable and willing to perform self-sampling at home and with high compliance rates [7, 23-29], thus enabling large field-based longitudinal epidemiological studies. Numerous groups have validated the use of self-collected samples compared to clinician-collected samples for microbiome studies and pathogen detection, as well as confirmed uniformity from repeated sampling at the same sitting [30-33]. The number of samples to be collected at each time point should also be considered. Excessive sampling can be difficult from a human subject perspective, and may in itself disturb the microenvironment thus introducing compounding biases over time, making it potentially difficult to interpret longitudinal patterns of change.
Following sample collection, it is then important to take into consideration methods for sample transport and both short-term and long-term storage. Delays often occur between sampling and final storage because of logistical issues, and it is not always possible to process samples immediately after collection. Numerous studies have evaluated the effect of temperature and duration of storage on fecal samples and have found conflicting results in terms of the effect on microbiota composition based on 16S rRNA gene profiling, with some samples showing little change [22, 34-37] and others showing significant differences [38, 39]. Amies transport media has been a successful choice for preserving fecal [40, 41], vaginal [7, 31, 42], and nasal  samples for DNA extraction and sequencing. Samples taken for transcriptomic analysis need to be stored appropriately to minimize RNA degradation, so preservation with guanidine thiocyanate is usually used to prevent nucleases from degrading RNA molecules . RNAlater has been used successfully for recovery of DNA and RNA from fecal samples [38, 44, 45] and saliva .
A critical step to microbiome analyses is DNA extraction, as in principle this is where most biases could be introduced, mostly from uneven cell lysis across the microbial community. Cell lysis, typically achieved through enzymatic and/or mechanical manipulations, would ideally work on all cell types equally, resulting in DNA being representative of the composition of the starting material. However, cells can vary in their susceptibility to lysing methods, with some lysing under fairly gentle conditions, and others, particularly Gram-positive organisms or spores, needing much harsher conditions that may result in shearing of DNA from easily-lysed organisms. Several studies have shown the use of mechanical lysis gives the highest bacterial diversity in 16S rRNA gene surveys [47, 48], and performs particularly well in the recovery of Gram-positive organisms in fecal communities . Oral samples extracted using either mechanical or enzymatic lysis steps have shown overall similar microbiota profiles based on 16S rRNA gene amplicon sequencing, but with higher recovery of certain taxa with either method . It is therefore important to consider what types of organisms are expected in a specific sample when choosing an extraction method, and noting that no methods are inherently free of biases . Similar considerations apply to RNA extraction methods.
Of vital importance at every stage of sample manipulation is minimizing the introduction of non-indigenous microbes or DNA. Any contamination from, for example, the lab environment, DNA/RNA extraction kits  or PCR reagents [52, 53], can be difficult to distinguish from the microbial content of the samples themselves. The effects of contamination (often present in very low abundance) are generally minimal when dealing with high biomass samples, however samples with low biomass can have so little template DNA that they produce 16S rRNA gene amplicons and metagenomic results that represent the contaminating DNA and not the sample’s true composition . Preparing metagenomic sequencing libraries from low levels of input DNA (a situation encountered with low biomass samples) could result in enrichment of AT-rich DNA during amplification . Dealing with this background amplification becomes a critical matter when working with these kinds of samples [56, 57]. In addition to including proper negative and positive controls to monitor for contamination at each step of the process and maximizing the genomic material used in experiments, Weiss et al. also suggests randomizing the order of extractions to control for batch effects that arise from contamination that is unique between different batches or lot numbers of reagents .
Extracted DNA can be used for the phylogenetic molecular assessment of the composition of the microbiota, either through marker gene amplicons or metagenomic sequencing. Marker gene amplicons sequencing involves the enrichment of a targeted gene that is phylogenetically informative. The sequence of this gene is used to identify what taxa are present in the sample. The most commonly used marker gene is the 16S rRNA gene, which is ubiquitously found in all bacteria and archaea. This gene consists of nine hypervariable regions, the combined sequence of which is unique to bacterial or archaeal taxa and thus can be used for taxonomic classification by comparison to databases. Interspersed between these variable regions are conserved regions that can function as priming sites for “universal” amplification. There are some drawbacks to using the 16S rRNA gene for microbiota analysis studies, including 1) some “universal” primer combinations give poor amplification of certain taxonomic groups, leading to underrepresentation of these organisms ; 2) some variable regions lack specificity and may not be able to discriminate between taxa at the species level ; and 3) microorganisms contain varying copy numbers of the 16S rRNA gene, making quantification of relative abundance somewhat inaccurate . The ubiquity of 16S rRNA gene, and the high amount of reference sequences deposited in databases, make it the most common target for phylogenetic analyses.
While the full 16S rRNA gene is over 1,500 base pairs long, high-throughput sequencing methods produce reads that are significantly shorter. One to three specific variable regions are targeted for amplification and subsequent taxonomic assignment and analysis. It is important to consider the ability of the targeted variable region(s) to discriminate amongst taxa known to be predominant in the types of samples to be analyzed and minimize the effects of known primer biases. Although primers are designed based on a consensus sequence, some taxa do have mismatches in these consensus regions, potentially leading to their underrepresentation in terms of relative abundance . In addition, the information contained within each variable region can be more or less informative when it comes to taxonomic assignment. For example, the V6 region of 16S rRNA gene performs poorly compared to others in discriminating taxa in human gut samples . Primers for amplification of the V3-V4 or V4 regions are able to detect both bacteria and archaea, so those would be optimal when interested in analyzing the archaeal portion of the microbiota in samples such as stool . An assessment of commonly used primer pairs found that longer 16S rRNA gene amplicons did not necessarily confer better classification; rather, it was the specific target region, depending upon sample origin, that had the biggest impact on classification [64, 65]. Thus it is recommended to select the PCR primers and the targeted hypervariable regions based on sample types.
The next steps in describing the molecular profile of microbial communities include sequencing of the amplified region or library and quality control of resulting sequence reads. These reads are then used for picking OTUs and/or taxonomic classifications which will be used for downstream analyses. The technical details of this are perhaps outside the scope of this review. Therefore, we have presented a more detailed description of sequencing technologies, sequence read pre-processing, and taxonomic classification in Appendix A.
Statistical analysis of 16S rRNA gene sequence data could include the application of ecological concepts such as alpha diversity, an estimate of the mean diversity within a sample, and beta diversity, the comparison of diversity between samples. Alpha diversity describes the richness and evenness of the microbiota in a given sample. Common workflows of 16S rRNA gene sequence analysis include using relative abundances obtained with and without normalizing read counts between samples through subsampling [66, 67]. Work by McMurdie and Holmes has shown that such normalization procedures can lead to overestimates of differentially abundant species across samples and a loss of statistical power , instead recommending the use of unrarefied data set and statistical models that account for differences in total read counts between samples . Alpha diversity measures could be used to explore differences between healthy and disease states in epidemiological studies. For example, an increase in Shannon diversity was found in vaginal samples from women diagnosed with bacterial vaginosis  as well as Caucasian women who delivered prematurely , while a decrease in diversity was observed in fecal samples from individuals with inflammatory bowel disease compared to those without . Interpretation of these differences in ecological metrics should be done with care as it is still unclear how to translate these measures in clinical settings.
Comparisons between samples can be done with multivariate analysis, which can take into account the presence or absence of species and their abundance or phylogeny [65, 71]. Distance between samples can be calculated in a number of ways; Sorensen or Jaccard indexes consider the presence or absence of OTUs, while Bray-Curtis also takes into account OTU abundance. Jensen-Shannon divergence can also be calculated to show the levels of similarity between communities,  and has been used successfully in epidemiological studies [72-75]. UniFrac distances consider the phylogeny of member OTUs, and can be calculated either weighted or unweighted for OTU abundances . Visualization of these distances can be accomplished with a clustering approach that produces a dendrogram demonstrating the similarity between samples. Principle coordinates analysis (PCoA) plot single points representing each sample in multiple dimensions which are separated by principle coordinates. Color coding of the samples according to metadata can reveal information about the factors driving similarities between samples, such as geographic location of subjects , or recovery time past infection .
Clearly there are a number of steps during the analysis process for biases to affect results and it can be challenging to determine if these are severe enough to significantly alter the observed microbiota from its native state. One option for evaluating if this is the case, is processing a “mock microbial community” that comprises known organisms in known quantities and proportions. An investigation of a mock community prepared by the Human Microbiome Project (HMP) compared a range of sample storage temperatures and two DNA extraction methods, and found that the different extractions resulted in microbial community composition and abundance that were statistically different, but provided consistent conclusions . The HMP initially employed four sequencing centers to generate the data, providing an excellent opportunity for the evaluation of center-specific (technical) biases introduced when using the same methods . When Schloss et al.  analyzed the results from HMP mock communities processed and sequenced at different centers, they found that based on non-metric multidimensional scaling, communities were clustering primarily based on the sequencing center, and secondarily by processing batches. Further, the HMP provided the opportunity to evaluate the effects of different 16S rRNA gene variable regions and bioinformatic analytical approaches on community composition . This work led to standard protocols that could be applied by others and a better understanding of the biases of each platform . One difficulty in this rapidly developing field is that protocols can become quickly outdated and need to be reworked with equivalent rigor when adapting to new sampling devices, storage systems, DNA extraction procedures, and more importantly, sequencing platforms. This is particularly challenging when trying to compare data with previously published studies. When possible, it is recommended to use data that was sequenced on the same platform, using the same variable region, and reprocess the data with current and approved bioinformatics tools, in order to detect any potential biases. The development of standardized protocols, which would likely be body-site specific, but could be utilized by all researchers, would make comparison more reliable. However, the development of such protocols is a challenge in itself .
Microbial communities can be grouped according to community composition. These groups are generated using clustering algorithms that also take into account the relative abundances of all taxa detected in a sample. This was first described in the human gut , where the term enterotypes was first coined. In the genital tract, the more ecologically correct terminology, and generally applicable, community state types (CST) was used for the characterization of the vaginal microbiota [42, 73, 84]. However the term cervicotype has also been used when specifically analyzing cervical samples . These classifications have been challenged recently as often a blurry line exist between groups and communities which appear to be distributed on a gradient and most samples fall somewhere between the extremes . However, the bioinformatics approach to analyze the data can influence greatly the grouping outcome. Koren et al. cited that enterotypes identification depended not only on the structure of the data but also the methods applied to identifying clustering strength . As a result, the decision of how to group communities into different enterotypes (how many? are there outliers? where to break a continuum?) remains open for discussion. It may be that the best clustering categories are those that have biological relevance or best distinguish risk factors in relation to the health outcome of interest. Future epidemiologic studies will help us to better detail which taxa or communities of bacteria (as in the case of CST) are associated with various disease outcome.
To identify clusters of bacterial communities based on relative abundances of different phylotypes (to generate a CST), Gajer et al. computed Jensen-Shannon distances between all pairs of community states (samples) and then generated hierarchical clustering using the Jensen-Shannon distance data and Ward linkage . However, should a large dataset be available, one could build a machine learning algorithm, such as Support Vector Machine (SVM), to make these CST assignments so that they are consistent and independent of the samples that are clustered. Such an algorithm would allow for CST assignments to be comparable across studies using the same 16S rRNA gene primers and sequencing platform. It is important to note that CSTs are not forced, pre-determined or assigned by eye, all common confusions observed in the literature. Figure 2 shows the dendrogram of the resulting hierarchical clustering of over 3,938 samples processed in our research studies and illustrates how bacterial communities cluster to form CSTs. CSTs are most useful in reducing the complexity of the dataset and allowing epidemiological investigations of disease outcomes with a large number of samples. Data reduction methods such as these may also prove useful in identifying biomarkers associated with disease states, or even susceptibility to diseases before they occur. A recent study incorporated 16S rRNA relative abundance data of gut samples with clinical data to improve the accuracy of predictive models in discriminating between healthy and colorectal cancer patients . Longitudinal study designs are also valuable in facilitating the identification of biomarkers that appear before the onset of disease or detecting changes in the microbiota with fluctuations in symptoms.
We have only begun to understand the importance of our human microbiome and how changes in its composition and function can affect our health. Our behaviors, such as smoking, diet, and hygiene do not just affect us, they also affect our microbial partners. Surprisingly little is known about the composition of the human microbiome across the lifespan, how common human activities affect the structure of the microbiota, the correlation with the immunologic microenvironment, or the associations with disease susceptibilities and symptoms for example. Of course, this also presents one of the most challenging aspects of human microbiome research in modeling of microbial communities in the presence of co-factors associated with human behaviors and characteristics.
While the bacterial portion of the microbiota have been a popular area of research, newer work has also focused on characterizing the mycobiome, the fungal portion of the microbiota, and the virome, comprising viruses and bacterial phages. Much like bacterial and archaeal microbiota research has focused on sequencing of the 16S gene, mycobiome research is relying on sequencing part of the 18S rRNA gene and/or ITS (internal transcribed spacer) regions for taxonomic classification. Some of the limitations in analyzing these fungal communities involve disagreements over the optimal 18S or ITS regions for analysis, as well as a lack of robust databases for classification . Virome research is particularly handicapped by the lack of conserved genetic material among all viruses, so there is no equivalent to the 16S rRNA gene for targeted sequencing and identification of all viruses. Research can be targeted to specific viral groups using target hybridization and enrichment prior to sequencing [90, 91]. Metagenomic sequencing of all DNA from the sample can aid in identifying novel DNA viruses, or known viruses that are unexpected at that body site, but will miss RNA viruses, which would need to be identified through metatranscriptomic analyses. Metagenomic sequencing, for example, has been successfully used to interrogate viral communities such as those inhabiting the respiratory tract in cystic fibrosis and non-cystic fibrosis individuals  and the human gut . Virome research could also elucidate the effect of bacteriophages on the human microbiota, with possible applications as a tool to manipulate or modulate the microbiota .
Much of the published work on the human microbiome is highly descriptive, demonstrating differences in characteristics of the microbiota between sampling sites of the body , over time [73, 96], and between diseased and healthy states [97, 98]. Moving forward, being able to differentiate between correlations of the microbiota with a specific disease state and demonstrating causation is an exciting prospect, one in which an epidemiological approach will play a major role. One example of clear causation is the transplant of fecal materials from healthy donors (i.e., healthy microbiota) to treat patients with recurrent Clostridium difficile infection [99, 100]. The fecal microbiota before and after the transplant has been monitored using 16S rRNA gene sequencing for up to a year after treatment, and showed that post-transplant, treated patients experience a lasting increase in species diversity and changes in relative abundance of certain organisms similar to the healthy donor . This experiment provides well-defined proof of concept that altering the microbiota can impact the disease state of a patient. Observational cohort studies that follow cohorts over long periods of time may also be useful in surveying the microbiota before, during, and after disease or dysbiosis, thus identifying if disruption or alteration of the microbiota is the result of disease, or potentially the causal factor to the disruption. While such large-scale studies may be difficult and expensive to run, another alternative that is often underutilized is capitalizing on existing sample repositories that include extensive and frequent sampling. More work needs to be done in centralizing information on existing repositories of both data and clinical samples.
Molecular epidemiology is a discipline that combines molecular approaches with classical epidemiology  and within its broad reach includes all the tenets of epidemiology to be applied to human microbiome research. Often overlooked in human microbiome study design are development of strong questionnaires capable of collecting validated information on confounding factors, selection of appropriate controls, adjusting for those intricate human factors as needed, and controlling for correlation between samples collected longitudinally. The field would also benefit from collaborations with behavioral researchers and network epidemiologists to collect better measurements of activities of the core groups studied. Measurement of interactions can be complicated in human microbiome analyses and resorting to basic approaches of stratification of samples in analysis may be necessary. Finally, journals frequently request release of de-identified data and the NIH is also now requiring release of genomic data (https://gds.nih.gov/03policy2.html). NIH’s database of Genotypes and Phenotypes (dbGaP), which is a controlled-access database governed by a data user certification protocol (DUC), is one preferred option for human microbiome studies, but other options exist. Public deposit of data has not been routine in epidemiological research and with a push toward greater collaboration and sharing of data, new discoveries will certainly result. A caveat of requiring data release is that older human subject consent forms may not contain specific language related to the release of information to a public or controlled-access database and therefore appropriations need to be met with the local institutional review board. The NIH recently launched the Big Data to Knowledge (BD2K) initiative to support development of new digital tools for analysis of biomedical research. This encourages wider sharing of data between researchers and can lead to new insights into the functional role of the microbiota in human health.
Research on the human microbiome is still in its infancy. Sequencing and other omics’ technology continues to rapidly evolve. Methods to extract DNA vary, collection media and devices continue to be re-invented and certainly there is no template to statistically interrogate the complex communities of the microbiome. As we move forward into a relatively new field of inquiry, open access to data, as well as free exchange and comparisons of protocols, will help to solidify the field. We have the expectation that in the future, we will be able to harness the microbiome to improve human health. Therapy in the form of probiotics, prebiotics, using small molecules to control specific microbial biochemical reactions, could be used to manage, modulate or restore the microbiome and maintain homeostasis.
This work was funded by the National Institute of Allergy and Infectious Diseases K01-AI080974, U19-AI084044 and R01-AI116799.
Early sequencing studies were performed using Sanger sequencing, which yielded long read lengths (>800 bp), but with a low throughput (~ 100 reads per sample) and a higher cost (~$100-$200 per sample). It was quickly replaced by the more economical (<$10 per sample) and higher throughput (>1,000 reads per sample) second-generation sequencing techniques. While Roche/454 pyrosequencing technology  has been commonly used, it has recently been replaced by Illumina sequencing technologies that offer higher throughput paired, but shorter (~150-300 bp) reads, and to a lesser extent the Ion Torrent’s technology (generate single ~400 bp reads) reads. The latter relies on similar chemistry as the Roche/454 platform, but nucleotide incorporation detection is performed by measuring pH variation change due to the release of a hydrogen ion during ligation using an electronic pH detector . A study comparing 16S rRNA gene sequencing on the Illumina and Ion Torrent platform found slight differences in over/underrepresentation of species between them, and premature truncation of reads appears to be an issue with the Ion Torrent reads, leading to additional challenges for the analysis . PacBio’s Single Molecule Real-Time (SMRT) sequencing reads are increasingly being applied to human microbiome research because of the long (>1.5 kb) reads, hence affording sequencing the entire 16S rRNA gene. However, its adoption is slow because of its higher cost, poor sequence accuracy, and lower throughput [105, 106]. This single molecule long read technology platform might be well-suited to improve metagenome sequence quality by affording better quality genome assemblies, both on its own  and in a hybrid approach that incorporates Illumina sequencing . However, its low throughput, hence low sequence sampling, still limit its use to specialized applications. All of these technologies rely on sequencing a mixture of DNA amplicon tagged with unique indexes indicating their origin. Post sequencing, reads are demultiplexed, where the unique index is used to assign each read back to its sample of origin, allowing for hundreds, or even thousands, of samples to be processed in a single sequencing run [109-111].
The Illumina’s HiSeq and MiSeq instruments are both capable of paired-end sequencing (that is, can generate sequences from both ends of the target molecule). Therefore, although the read lengths of the HiSeq and MiSeq are only about 150 and 300 base pairs, respectively, overlaps between the forward and reverse reads can be used to assemble both reads to a final sequence of 300 to 500 base pairs. The most common sequencing errors with the Illumina sequencing platform are mismatches rather than insertions or deletions. HiSeq is capable of outputting 600 Gb per run  and can process thousands of samples per run . While the MiSeq is not capable of such high throughput, it does offer longer reads (up to 300 bp) and a shorter run time (less than 2 days) than HiSeq (currently about 4 days on the HiSeq 4000) . Studies employing Illumina sequencing have been used to survey the composition of the human microbiota at numerous body sites [113-119].
For a more thorough review of other sequencing technologies, including future technologies in development, and their applications, see Buermans and den Dunnen . Regardless of the sequencing methodology chosen, it should be noted that proprietary technologies can result in a lack of transparency. When sequencing problems occur, it can be very challenging to troubleshoot the problems independently. Microbiome research is a rapidly developing field with major leaps forward in sequence read lengths and decreases in cost over short periods of time. For long-term studies, sequencing methods that were standard at the onset of the study could become outdated or even unavailable by the end of the study. The rapid changes in technology and their disparate availability can make developing standards for the field and comparisons with previous datasets challenging, as each technology has its own set of biases and shortcomings.
There are several open source pipelines that have been developed to make analysis of 16S rRNA gene amplicon sequences more streamlined and user-friendly. Two of the more popular ones are Quantitative Insights into Microbial Ecology (QIIME)  and mothur . Other algorithms or software are available independently, and may require more specialized knowledge to employ. A basic workflow of 16S rRNA gene amplicon sequence analysis is described below.
Because 16S rRNA gene sequences are clustered based upon similarity, sequencing errors could result in reads being misclassified into separate OTUs, leading to an overestimate of bacterial diversity [120, 121]. Thus, removing reads that have low quality scores (using average quality score cut-off), are short (below a given cut-off), and contains mismatches to the primers has been used [120, 122, 123]. Others have applied more conservative strategies where the reads are scanned for low quality regions, after which the read is then trimmed and reassessed for read length. This approach retains as much information as possible . The amplification and sequencing step can lead to chimeric sequences [125, 126], which if not removed could bias the taxonomic assignments by artificially inflating OTU richness estimates . A number of programs are available to detect and remove chimeric sequences, including ChimeraSlayer  and UCHIME .
After stringent quality control of the raw sequence reads and because of the large number of sequences obtained with newer technologies, sequences are clustered into operational taxonomic units (OTU) based on sequence similarities. There are several options to accomplish this step, closed- or open-reference-based clustering and de novo clustering [130-132]. In reference-based clustering, reads are compared to a database of known sequences (usually full-length sequences for 16S rRNA gene) and classified into taxonomic groups based on identity. This gives taxonomic classification up front and is a good option if working with sample types that are thoroughly covered by your chosen reference database, but can be problematic with unknown sequences that could be unclassified. In closed-reference clustering, reads with no match in the reference database are then excluded, while in open-reference they are included, representing novel diversity not captured by the reference database. De novo clustering involves grouping 16S rRNA gene sequences based on similarity, without references. The threshold for similarity is adjustable, with 97% being the most common choice, as it represents the minimum similarity that defines species. There are many clustering algorithms that have been implemented for this purpose [133-136]. A representative sequence of each OTU or consensus sequence is then compared with a database for taxonomic assignment, which is then transitively transferred to all sequence reads forming a given OTU. The size of the OTU represents its relative abundance in the sample. This approach is especially useful when researching previously unstudied environments as it can identify OTUs that are not closely related to known organisms. However, this process is not perfect, and sequence reads forming OTUs could comprise sequences that are not taxonomically identical; similarly, selection of OTU representative sequences could impact study outcomes.
Common reference databases used for 16S rRNA gene sequence taxonomic classification include the Ribosomal Database Project (RDP) [137, 138], SILVA , and Greengenes . Relative abundance tables of the proportion of each OTU per sample allow comparisons of the microbiota composition between subjects, subjects’ groups and possibly over time. Relative abundance is often used in these analyses, however it can present an imperfect understanding of how samples relate to one another. Information on absolute abundance is important, as communities with dramatic differences in absolute abundance, which might reflect major functional differences, could have similar relative abundance composition (i.e., a sample having 50% relative abundance of a given OTU could have different implications if its overall bacterial load is 103 or 107). One option to complement 16S rRNA gene relative abundance, is to use 16S rRNA gene quantitative PCR and measure the total 16S rRNA gene copies (an estimate of bacterial load) [141-143] and an OTU relative abundance to calculate an OTU absolute abundance . Alternatively, in known samples with limited diversity, targeted qPCR assays could be used to evaluate the absolute abundance of a specific taxa .
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.