Diarrheal diseases continue to contribute significantly to morbidity and mortality in infants and young children in developing countries. There is an urgent need to better understand the contributions of novel, potentially uncultured, diarrheal pathogens to severe diarrheal disease, as well as distortions in normal gut microbiota composition that might facilitate severe disease.
We use high throughput 16S rRNA gene sequencing to compare fecal microbiota composition in children under five years of age who have been diagnosed with moderate to severe diarrhea (MSD) with the microbiota from diarrhea-free controls. Our study includes 992 children from four low-income countries in West and East Africa, and Southeast Asia. Known pathogens, as well as bacteria currently not considered as important diarrhea-causing pathogens, are positively associated with MSD, and these include Escherichia/Shigella, and Granulicatella species, and Streptococcus mitis/pneumoniae groups. In both cases and controls, there tend to be distinct negative correlations between facultative anaerobic lineages and obligate anaerobic lineages. Overall genus-level microbiota composition exhibit a shift in controls from low to high levels of Prevotella and in MSD cases from high to low levels of Escherichia/Shigella in younger versus older children; however, there was significant variation among many genera by both site and age.
Our findings expand the current understanding of microbiota-associated diarrhea pathogenicity in young children from developing countries. Our findings are necessarily based on correlative analyses and must be further validated through epidemiological and molecular techniques.
We introduce a novel methodology for differential abundance analysis in sparse high-throughput marker gene survey data. Our approach, implemented in the metagenomeSeq Bioconductor package, relies on a novel normalization technique and a statistical model that accounts for under-sampling: a common feature of large-scale marker gene studies. We show, using simulated data and several published microbiota datasets, that metagenomeSeq outperforms the tools currently used in this field.
The continued democratization of DNA sequencing has sparked a new wave of development of genome assembly and assembly validation methods. As individual research labs, rather than centralized centers, begin to sequence the majority of new genomes, it is important to establish best practices for genome assembly. However, recent evaluations such as GAGE and the Assemblathon have concluded that there is no single best approach to genome assembly. Instead, it is preferable to generate multiple assemblies and validate them to determine which is most useful for the desired analysis; this is a labor-intensive process that is often impossible or unfeasible.
To encourage best practices supported by the community, we present iMetAMOS, an automated ensemble assembly pipeline; iMetAMOS encapsulates the process of running, validating, and selecting a single assembly from multiple assemblies. iMetAMOS packages several leading open-source tools into a single binary that automates parameter selection and execution of multiple assemblers, scores the resulting assemblies based on multiple validation metrics, and annotates the assemblies for genes and contaminants. We demonstrate the utility of the ensemble process on 225 previously unassembled Mycobacterium tuberculosis genomes as well as a Rhodobacter sphaeroides benchmark dataset. On these real data, iMetAMOS reliably produces validated assemblies and identifies potential contamination without user intervention. In addition, intelligent parameter selection produces assemblies of R. sphaeroides comparable to or exceeding the quality of those from the GAGE-B evaluation, affecting the relative ranking of some assemblers.
Ensemble assembly with iMetAMOS provides users with multiple, validated assemblies for each genome. Although computationally limited to small or mid-sized genomes, this approach is the most effective and reproducible means for generating high-quality assemblies and enables users to select an assembly best tailored to their specific needs.
Cultivation-based assays combined with PCR or enzyme-linked immunosorbent assay (ELISA)-based methods for finding virulence factors are standard methods for detecting bacterial pathogens in stools; however, with emerging molecular technologies, new methods have become available. The aim of this study was to compare four distinct detection technologies for the identification of pathogens in stools from children under 5 years of age in The Gambia, Mali, Kenya, and Bangladesh. The children were identified, using currently accepted clinical protocols, as either controls or cases with moderate to severe diarrhea. A total of 3,610 stool samples were tested by established clinical culture techniques: 3,179 DNA samples by the Universal Biosensor assay (Ibis Biosciences, Inc.), 1,466 DNA samples by the GoldenGate assay (Illumina), and 1,006 DNA samples by sequencing of 16S rRNA genes. Each method detected different proportions of samples testing positive for each of seven enteric pathogens, enteroaggregative Escherichia coli (EAEC), enterotoxigenic E. coli (ETEC), enteropathogenic E. coli (EPEC), Shigella spp., Campylobacter jejuni, Salmonella enterica, and Aeromonas spp. The comparisons among detection methods included the frequency of positive stool samples and kappa values for making pairwise comparisons. Overall, the standard culture methods detected Shigella spp., EPEC, ETEC, and EAEC in smaller proportions of the samples than either of the methods based on detection of the virulence genes from DNA in whole stools. The GoldenGate method revealed the greatest agreement with the other methods. The agreement among methods was higher in cases than in controls. The new molecular technologies have a high potential for highly sensitive identification of bacterial diarrheal pathogens.
Since its launch in 2004, the open-source AMOS project has released several innovative DNA sequence analysis applications including: Hawkeye, a visual analytics tool for inspecting the structure of genome assemblies; the Assembly Forensics and FRCurve pipelines for systematically evaluating the quality of a genome assembly; and AMOScmp, the first comparative genome assembler. These applications have been used to assemble and analyze dozens of genomes ranging in complexity from simple microbial species through mammalian genomes. Recent efforts have been focused on enhancing support for new data characteristics brought on by second- and now third-generation sequencing. This review describes the major components of AMOS in light of these challenges, with an emphasis on methods for assessing assembly quality and the visual analytics capabilities of Hawkeye. These interactive graphical aspects are essential for navigating and understanding the complexities of a genome assembly, from the overall genome structure down to individual bases. Hawkeye and AMOS are available open source at http://amos.sourceforge.net.
DNA Sequencing; genome assembly; assembly forensics; visual analytics
Estimates of the prevalence of Shigella spp. are limited by the suboptimal sensitivity of current diagnostic and surveillance methods. We used a quantitative PCR (qPCR) assay to detect Shigella in the stool samples of 3,533 children aged <59 months from the Gambia, Mali, Kenya, and Bangladesh, with or without moderate-to-severe diarrhea (MSD). We compared the results from conventional culture to those from qPCR for the Shigella ipaH gene. Using MSD as the reference standard, we determined the optimal cutpoint to be 2.9 × 104
ipaH copies per 100 ng of stool DNA for set 1 (n = 877). One hundred fifty-eight (18%) specimens yielded >2.9 × 104
ipaH copies. Ninety (10%) specimens were positive by traditional culture for Shigella. Individuals with ≥2.9 × 104
ipaH copies have 5.6-times-higher odds of having diarrhea than those with <2.9 × 104
ipaH copies (95% confidence interval, 3.7 to 8.5; P < 0.0001). Nearly identical results were found using an independent set of samples. qPCR detected 155 additional MSD cases with high copy numbers of ipaH, a 90% increase from the 172 cases detected by culture in both samples. Among a subset (n = 2,874) comprising MSD cases and their age-, gender-, and location-matched controls, the fraction of MSD cases that were attributable to Shigella infection increased from 9.6% (n = 129) for culture to 17.6% (n = 262) for qPCR when employing our cutpoint. We suggest that qPCR with a cutpoint of approximately 1.4 × 104
ipaH copies be the new reference standard for the detection and diagnosis of shigellosis in children in low-income countries. The acceptance of this new standard would substantially increase the fraction of MSD cases that are attributable to Shigella.
The current revolution in genomics has been made possible by software tools called genome assemblers, which stitch together DNA fragments “read” by sequencing machines into complete or nearly complete genome sequences. Despite decades of research in this field and the development of dozens of genome assemblers, assessing and comparing the quality of assembled genome sequences still relies on the availability of independently determined standards, such as manually curated genome sequences, or independently produced mapping data. These “gold standards” can be expensive to produce and may only cover a small fraction of the genome, which limits their applicability to newly generated genome sequences. Here we introduce a de novo probabilistic measure of assembly quality which allows for an objective comparison of multiple assemblies generated from the same set of reads. We define the quality of a sequence produced by an assembler as the conditional probability of observing the sequenced reads from the assembled sequence. A key property of our metric is that the true genome sequence maximizes the score, unlike other commonly used metrics.
We demonstrate that our de novo score can be computed quickly and accurately in a practical setting even for large datasets, by estimating the score from a relatively small sample of the reads. To demonstrate the benefits of our score, we measure the quality of the assemblies generated in the GAGE and Assemblathon 1 assembly “bake-offs” with our metric. Even without knowledge of the true reference sequence, our de novo metric closely matches the reference-based evaluation metrics used in the studies and outperforms other de novo metrics traditionally used to measure assembly quality (such as N50). Finally, we highlight the application of our score to optimize assembly parameters used in genome assemblers, which enables better assemblies to be produced, even without prior knowledge of the genome being assembled.
Likelihood-based measures, such as ours proposed here, will become the new standard for de novo assembly evaluation.
Rickettsia prowazekii is a notable intracellular pathogen, the agent of epidemic typhus, and a potential biothreat agent. We present here whole-genome sequence data for four strains of R. prowazekii, including one from a flying squirrel.
Simultaneous analysis of the gut microbiome and host gene expression in infants reveals the impact of diet (breastfeeding versus formula) on host-microbiome interactions.
See research article http://www.genomebiology.com/2012/13/4/r32
infant development; gut microbiome; host-microbiome interactions; breastfeeding
Enterotoxigenic Escherichia coli (ETEC) is an important cause of diarrhea, mainly in developing countries. Although there are 25 different ETEC adhesins described in strains affecting humans, between 15% and 50% of the clinical isolates from different geographical regions are negative for these adhesins, suggesting that additional unidentified adhesion determinants might be present. Here, we report the discovery of Coli Surface Antigen 23 (CS23), a novel adhesin expressed by an ETEC serogroup O4 strain (ETEC 1766a), which was negative for the previously known ETEC adhesins, albeit it has the ability to adhere to Caco-2 cells. CS23 is encoded by an 8.8-kb locus which contains 9 open reading frames (ORFs), 7 of them sharing significant identity with genes required for assembly of K88-related fimbriae. This gene locus, named aal (adhesion-associated locus), is required for the adhesion ability of ETEC 1766a and was able to confer this adhesive phenotype to a nonadherent E. coli HB101 strain. The CS23 major structural subunit, AalE, shares limited identity with known pilin proteins, and it is more closely related to the CS13 pilin protein CshE, carried by human ETEC strains. Our data indicate that CS23 is a new member of the diverse adhesin repertoire used by ETEC strains.
We describe MetAMOS, an open source and modular metagenomic assembly and analysis pipeline. MetAMOS represents an important step towards fully automated metagenomic analysis, starting with next-generation sequencing reads and producing genomic scaffolds, open-reading frames and taxonomic or functional annotations. MetAMOS can aid in reducing assembly errors, commonly encountered when assembling metagenomic samples, and improves taxonomic assignment accuracy while also reducing computational cost. MetAMOS can be downloaded from: https://github.com/treangen/MetAMOS.
A variety of microbial communities and their genes (microbiome) exist throughout the human body, playing fundamental roles in human health and disease. The NIH funded Human Microbiome Project (HMP) Consortium has established a population-scale framework which catalyzed significant development of metagenomic protocols resulting in a broad range of quality-controlled resources and data including standardized methods for creating, processing and interpreting distinct types of high-throughput metagenomic data available to the scientific community. Here we present resources from a population of 242 healthy adults sampled at 15 to 18 body sites up to three times, which to date, have generated 5,177 microbial taxonomic profiles from 16S rRNA genes and over 3.5 Tb of metagenomic sequence. In parallel, approximately 800 human-associated reference genomes have been sequenced. Collectively, these data represent the largest resource to date describing the abundance and variety of the human microbiome, while providing a platform for current and future studies.
Motivation: Sequencing projects increasingly target samples from non-clonal sources. In particular, metagenomics has enabled scientists to begin to characterize the structure of microbial communities. The software tools developed for assembling and analyzing sequencing data for clonal organisms are, however, unable to adequately process data derived from non-clonal sources.
Results: We present a new scaffolder, Bambus 2, to address some of the challenges encountered when analyzing metagenomes. Our approach relies on a combination of a novel method for detecting genomic repeats and algorithms that analyze assembly graphs to identify biologically meaningful genomic variants. We compare our software to current assemblers using simulated and real data. We demonstrate that the repeat detection algorithms have higher sensitivity than current approaches without sacrificing specificity. In metagenomic datasets, the scaffolder avoids false joins between distantly related organisms while obtaining long-range contiguity. Bambus 2 represents a first step toward automated metagenomic assembly.
Availability: Bambus 2 is open source and available from http://amos.sf.net.
Supplementary Information: Supplementary data are available at Bioinformatics online.
Genome assembly is difficult due to repeated sequences within the genome, which create ambiguities and cause the final assembly to be broken up into many separate sequences (contigs). Long range linking information, such as mate-pairs or mapping data, is necessary to help assembly software resolve repeats, thereby leading to a more complete reconstruction of genomes. Prior work has used optical maps for validating assemblies and scaffolding contigs, after an initial assembly has been produced. However, optical maps have not previously been used within the genome assembly process. Here, we use optical map information within the popular de Bruijn graph assembly paradigm to eliminate paths in the de Bruijn graph which are not consistent with the optical map and help determine the correct reconstruction of the genome.
We developed a new algorithm called AGORA: Assembly Guided by Optical Restriction Alignment. AGORA is the first algorithm to use optical map information directly within the de Bruijn graph framework to help produce an accurate assembly of a genome that is consistent with the optical map information provided. Our simulations on bacterial genomes show that AGORA is effective at producing assemblies closely matching the reference sequences.
Additionally, we show that noise in the optical map can have a strong impact on the final assembly quality for some complex genomes, and we also measure how various characteristics of the starting de Bruijn graph may impact the quality of the final assembly. Lastly, we show that a proper choice of restriction enzyme for the optical map may substantially improve the quality of the final assembly.
Our work shows that optical maps can be used effectively to assemble genomes within the de Bruijn graph assembly framework. Our experiments also provide insights into the characteristics of the mapping data that most affect the performance of our algorithm, indicating the potential benefit of more accurate optical mapping technologies, such as nano-coding.
The oral microbiome, the complex ecosystem of microbes inhabiting the human mouth, harbors several thousands of bacterial types. The proliferation of pathogenic bacteria within the mouth gives rise to periodontitis, an inflammatory disease known to also constitute a risk factor for cardiovascular disease. While much is known about individual species associated with pathogenesis, the system-level mechanisms underlying the transition from health to disease are still poorly understood. Through the sequencing of the 16S rRNA gene and of whole community DNA we provide a glimpse at the global genetic, metabolic, and ecological changes associated with periodontitis in 15 subgingival plaque samples, four from each of two periodontitis patients, and the remaining samples from three healthy individuals. We also demonstrate the power of whole-metagenome sequencing approaches in characterizing the genomes of key players in the oral microbiome, including an unculturable TM7 organism. We reveal the disease microbiome to be enriched in virulence factors, and adapted to a parasitic lifestyle that takes advantage of the disrupted host homeostasis. Furthermore, diseased samples share a common structure that was not found in completely healthy samples, suggesting that the disease state may occupy a narrow region within the space of possible configurations of the oral microbiome. Our pilot study demonstrates the power of high-throughput sequencing as a tool for understanding the role of the oral microbiome in periodontal disease. Despite a modest level of sequencing (∼2 lanes Illumina 76 bp PE) and high human DNA contamination (up to ∼90%) we were able to partially reconstruct several oral microbes and to preliminarily characterize some systems-level differences between the healthy and diseased oral microbiomes.
The very large memory requirements for the construction of assembly graphs for de novo genome assembly limit current algorithms to super-computing environments.
In this paper, we demonstrate that constructing a sparse assembly graph which stores only a small fraction of the observed k-mers as nodes and the links between these nodes allows the de novo assembly of even moderately-sized genomes (~500 M) on a typical laptop computer.
We implement this sparse graph concept in a proof-of-principle software package, SparseAssembler, utilizing a new sparse k-mer graph structure evolved from the de Bruijn graph. We test our SparseAssembler with both simulated and real data, achieving ~90% memory savings and retaining high assembly accuracy, without sacrificing speed in comparison to existing de novo assemblers.
A Modular Open-Source Assembler (AMOS) was designed to offer a modular approach to genome assembly. AMOS includes a wide range of tools for assembly, including lightweight de novo assemblers Minimus and Minimo, and Bambus 2, a robust scaffolder able to handle metagenomic and polymorphic data. This protocol describes how to configure and use AMOS for the assembly of Next Generation sequence data. Additionally, we provide three tutorial examples that include bacterial, viral, and metagenomic datasets with specific tips for improving assembly quality.
Next-generation sequencing; genome assembly; Open-Source
Environmental shotgun sequencing (or metagenomics) is widely used to survey the communities of microbial organisms that live in many diverse ecosystems, such as the human body. Finding the protein-coding genes within the sequences is an important step for assessing the functional capacity of a metagenome. In this work, we developed a metagenomics gene prediction system Glimmer-MG that achieves significantly greater accuracy than previous systems via novel approaches to a number of important prediction subtasks. First, we introduce the use of phylogenetic classifications of the sequences to model parameterization. We also cluster the sequences, grouping together those that likely originated from the same organism. Analogous to iterative schemes that are useful for whole genomes, we retrain our models within each cluster on the initial gene predictions before making final predictions. Finally, we model both insertion/deletion and substitution sequencing errors using a different approach than previous software, allowing Glimmer-MG to change coding frame or pass through stop codons by predicting an error. In a comparison among multiple gene finding methods, Glimmer-MG makes the most sensitive and precise predictions on simulated and real metagenomes for all read lengths and error rates tested.