The human microbiome has become a recognized factor in promoting and maintaining health. We outline opportunities in interdisciplinary research, analytical rigor, standardization, and policy development for this relatively new and rapidly developing field. Advances in these aspects of the research community may in turn advance our understanding of human microbiome biology.
The complete genome sequence of the radiation resistant bacterium Deinococcus radiodurans R1 is composed of two chromosomes (2,648,615 and 412,340 basepairs), a megaplasmid (177,466 basepairs), and a small plasmid (45,702 basepairs) yielding a total genome of 3,284,123 basepairs. Multiple components distributed on the chromosomes and megaplasmid that contribute to the ability of D. radiodurans to survive under conditions of starvation, oxidative stress, and high levels of DNA-damage have been identified. D. radiodurans represents an organism in which all systems for DNA repair, DNA damage export, desiccation and starvation recovery, and genetic redundancy are present in one cell.
This manuscript calls for an international effort to generate a comprehensive catalog from genome sequences of all the archaeal and bacterial type strains.
Microbes hold the key to life. They hold the secrets to our past (as the descendants of the earliest forms of life) and the prospects for our future (as we mine their genes for solutions to some of the planet's most pressing problems, from global warming to antibiotic resistance). However, the piecemeal approach that has defined efforts to study microbial genetic diversity for over 20 years and in over 30,000 genome projects risks squandering that promise. These efforts have covered less than 20% of the diversity of the cultured archaeal and bacterial species, which represent just 15% of the overall known prokaryotic diversity. Here we call for the funding of a systematic effort to produce a comprehensive genomic catalog of all cultured Bacteria and Archaea by sequencing, where available, the type strain of each species with a validly published name (currently∼11,000). This effort will provide an unprecedented level of coverage of our planet's genetic diversity, allow for the large-scale discovery of novel genes and functions, and lead to an improved understanding of microbial evolution and function in the environment.
Summary: Circleator is a Perl application that generates circular figures of genome-associated data. It leverages BioPerl to support standard annotation and sequence file formats and produces publication-quality SVG output. It is designed to be both flexible and easy to use. It includes a library of circular track types and predefined configuration files for common use-cases, including. (i) visualizing gene annotation and DNA sequence data from a GenBank flat file, (ii) displaying patterns of gene conservation in related microbial strains, (iii) showing Single Nucleotide Polymorphisms (SNPs) and indels relative to a reference genome and gene set and (iv) viewing RNA-Seq plots.
Availability and implementation: Circleator is freely available under the Artistic License 2.0 from http://jonathancrabtree.github.io/Circleator/ and is integrated with the CloVR cloud-based sequence analysis Virtual Machine (VM), which can be downloaded from http://clovr.org or run on Amazon EC2.
Supplementary data are available at Bioinformatics online.
The Evidence Ontology (ECO) is a structured, controlled vocabulary for capturing evidence in biological research. ECO includes diverse terms for categorizing evidence that supports annotation assertions including experimental types, computational methods, author statements and curator inferences. Using ECO, annotation assertions can be distinguished according to the evidence they are based on such as those made by curators versus those automatically computed or those made via high-throughput data review versus single test experiments. Originally created for capturing evidence associated with Gene Ontology annotations, ECO is now used in other capacities by many additional annotation resources including UniProt, Mouse Genome Informatics, Saccharomyces Genome Database, PomBase, the Protein Information Resource and others. Information on the development and use of ECO can be found at http://evidenceontology.org. The ontology is freely available under Creative Commons license (CC BY-SA 3.0), and can be downloaded in both Open Biological Ontologies and Web Ontology Language formats at http://code.google.com/p/evidenceontology. Also at this site is a tracker for user submission of term requests and questions. ECO remains under active development in response to user-requested terms and in collaborations with other ontologies and database resources.
Database URL: Evidence Ontology Web site: http://evidenceontology.org
A conference entitled ‘Human microbiome science: Vision for the future’ was organized in Bethesda, MD from July 24 to 26, 2013. The event brought together experts in the field of human microbiome research and aimed at providing a comprehensive overview of the state of microbiome research, but more importantly to identify and discuss gaps, challenges and opportunities in this nascent field. This report summarizes the presentations but also describes what is needed for human microbiome research to move forward and deliver medical translational applications.
High throughput sequencing has accelerated the determination of genome sequences for thousands of human infectious disease pathogens and dozens of their vectors. The scale and scope of these data are enabling genotype-phenotype association studies to identify genetic determinants of pathogen virulence and drug/insecticide resistance, and phylogenetic studies to track the origin and spread of disease outbreaks. To maximize the utility of genomic sequences for these purposes, it is essential that metadata about the pathogen/vector isolate characteristics be collected and made available in organized, clear, and consistent formats. Here we report the development of the GSCID/BRC Project and Sample Application Standard, developed by representatives of the Genome Sequencing Centers for Infectious Diseases (GSCIDs), the Bioinformatics Resource Centers (BRCs) for Infectious Diseases, and the U.S. National Institute of Allergy and Infectious Diseases (NIAID), part of the National Institutes of Health (NIH), informed by interactions with numerous collaborating scientists. It includes mapping to terms from other data standards initiatives, including the Genomic Standards Consortium’s minimal information (MIxS) and NCBI’s BioSample/BioProjects checklists and the Ontology for Biomedical Investigations (OBI). The standard includes data fields about characteristics of the organism or environmental source of the specimen, spatial-temporal information about the specimen isolation event, phenotypic characteristics of the pathogen/vector isolated, and project leadership and support. By modeling metadata fields into an ontology-based semantic framework and reusing existing ontologies and minimum information checklists, the application standard can be extended to support additional project-specific data fields and integrated with other data represented with comparable standards. The use of this metadata standard by all ongoing and future GSCID sequencing projects will provide a consistent representation of these data in the BRC resources and other repositories that leverage these data, allowing investigators to identify relevant genomic sequences and perform comparative genomics analyses that are both statistically meaningful and biologically relevant.
The Genomic Standards Consortium (GSC) is an open-membership community that was founded in 2005 to work towards the development, implementation and harmonization of standards in the field of genomics. Starting with the defined task of establishing a minimal set of descriptions the GSC has evolved into an active standards-setting body that currently has 18 ongoing projects, with additional projects regularly proposed from within and outside the GSC. Here we describe our recently enacted policy for proposing new activities that are intended to be taken on by the GSC, along with the template for proposing such new activities.
The parasite Plasmodium falciparum is responsible for hundreds of millions of cases of malaria, and kills more than one million African children annually. Here we report an analysis of the genome sequence of P. falciparum clone 3D7. The 23-megabase nuclear genome consists of 14 chromosomes, encodes about 5,300 genes, and is the most (A + T)-rich genome sequenced to date. Genes involved in antigenic variation are concentrated in the subtelomeric regions of the chromosomes. Compared to the genomes of free-living eukaryotic microbes, the genome of this intracellular parasite encodes fewer enzymes and transporters, but a large proportion of genes are devoted to immune evasion and host–parasite interactions. Many nuclear-encoded proteins are targeted to the apicoplast, an organelle involved in fatty-acid and isoprenoid metabolism. The genome sequence provides the foundation for future studies of this organism, and is being exploited in the search for new drugs and vaccines to fight malaria.
Microbial genome sequencing is one of the longest-standing areas of biological database development, but high-throughput, low-cost technologies have increased its throughput to an unprecedented number of new genomes per year. Several thousand microbial genomes are now available, necessitating new approaches to organizing information on gene function, phylogeny and microbial taxonomy to facilitate downstream biological interpretation. MetaRef, available at http://metaref.org, is a novel online resource systematically cataloguing a comprehensive pan-genome of all microbial clades with sequenced isolates. It organizes currently available draft and finished bacterial and archaeal genomes into quality-controlled clades, reports all core and pan gene families at multiple levels in the resulting taxonomy, and it annotates families’ conservation, phylogeny and consensus functional information. MetaRef also provides a comprehensive non-redundant reference gene catalogue for metagenomic studies, including the abundance and prevalence of all gene families in the >700 shotgun metagenomic samples of the Human Microbiome Project. This constitutes a systematic mapping of clade-specific microbial functions within the healthy human microbiome across multiple body sites and can be used as reference for identifying potential functional biomarkers in disease-associate microbiomes. MetaRef provides all information both as an online browsable resource and as downloadable sequences and tabular data files that can be used for subsequent offline studies.
Culex quinquefasciatus (the Southern house mosquito) is an important mosquito vector of viruses such as West Nile virus and St. Louis encephalitis virus as well of nematodes that cause lymphatic filariasis. It is one species within the Culex pipiens species complex and enjoys a distribution throughout tropical and temperate climates of the world. The ability of C. quinquefasciatus to take blood meals from birds, livestock and humans contributes to its ability to vector pathogens between species. We describe the genomic sequence of C. quinquefasciatus, its repertoire of 18,883 protein-coding genes is 22% larger than Ae. aegypti and 52% larger than An. gambiae with multiple gene family expansions including olfactory and gustatory receptors, salivary gland genes, and genes associated with xenobiotic detoxification.
Besides the development of comprehensive tools for high-throughput 16S ribosomal RNA amplicon sequence analysis, there exists a growing need for protocols emphasizing alternative phylogenetic markers such as those representing eukaryotic organisms.
Here we introduce CloVR-ITS, an automated pipeline for comparative analysis of internal transcribed spacer (ITS) pyrosequences amplified from metagenomic DNA isolates and representing fungal species. This pipeline performs a variety of steps similar to those commonly used for 16S rRNA amplicon sequence analysis, including preprocessing for quality, chimera detection, clustering of sequences into operational taxonomic units (OTUs), taxonomic assignment (at class, order, family, genus, and species levels) and statistical analysis of sample groups of interest based on user-provided information. Using ITS amplicon pyrosequencing data from a previous human gastric fluid study, we demonstrate the utility of CloVR-ITS for fungal microbiota analysis and provide runtime and cost examples, including analysis of extremely large datasets on the cloud. We show that the largest fractions of reads from the stomach fluid samples were assigned to Dothideomycetes, Saccharomycetes, Agaricomycetes and Sordariomycetes but that all samples were dominated by sequences that could not be taxonomically classified. Representatives of the Candida genus were identified in all samples, most notably C. quercitrusa, while sequence reads assigned to the Aspergillus genus were only identified in a subset of samples. CloVR-ITS is made available as a pre-installed, automated, and portable software pipeline for cloud-friendly execution as part of the CloVR virtual machine package (http://clovr.org).
The CloVR-ITS pipeline provides fungal microbiota analysis that can be complementary to bacterial 16S rRNA and total metagenome sequence analysis allowing for more comprehensive studies of environmental and host-associated microbial communities.
Internal transcribed spacer (ITS); Fungal microbiota; Automated sequence analysis pipeline; Cloud computing
A variety of microbial communities and their genes (microbiome) exist throughout the human body, playing fundamental roles in human health and disease. The NIH funded Human Microbiome Project (HMP) Consortium has established a population-scale framework which catalyzed significant development of metagenomic protocols resulting in a broad range of quality-controlled resources and data including standardized methods for creating, processing and interpreting distinct types of high-throughput metagenomic data available to the scientific community. Here we present resources from a population of 242 healthy adults sampled at 15 to 18 body sites up to three times, which to date, have generated 5,177 microbial taxonomic profiles from 16S rRNA genes and over 3.5 Tb of metagenomic sequence. In parallel, approximately 800 human-associated reference genomes have been sequenced. Collectively, these data represent the largest resource to date describing the abundance and variety of the human microbiome, while providing a platform for current and future studies.
This manuscript describes the NIH Human Microbiome Project, including a brief review of human microbiome research, a history of the project, and a comprehensive overview of the consortium's recent collection of publications analyzing the human microbiome.
Variability in the extent of the descriptions of data (‘metadata’) held in public repositories forces users to assess the quality of records individually, which rapidly becomes impractical. The scoring of records on the richness of their description provides a simple, objective proxy measure for quality that enables filtering that supports downstream analysis. Pivotally, such descriptions should spur on improvements. Here, we introduce such a measure - the ‘Metadata Coverage Index’ (MCI): the percentage of available fields actually filled in a record or description. MCI scores can be calculated across a database, for individual records or for their component parts (e.g., fields of interest). There are many potential uses for this simple metric: for example; to filter, rank or search for records; to assess the metadata availability of an ad hoc collection; to determine the frequency with which fields in a particular record type are filled, especially with respect to standards compliance; to assess the utility of specific tools and resources, and of data capture practice more generally; to prioritize records for further curation; to serve as performance metrics of funded projects; or to quantify the value added by curation. Here we demonstrate the utility of MCI scores using metadata from the Genomes Online Database (GOLD), including records compliant with the ‘Minimum Information about a Genome Sequence’ (MIGS) standard developed by the Genomic Standards Consortium. We discuss challenges and address the further application of MCI scores; to show improvements in annotation quality over time, to inform the work of standards bodies and repository providers on the usability and popularity of their products, and to assess and credit the work of curators. Such an index provides a step towards putting metadata capture practices and in the future, standards compliance, into a quantitative and objective framework.
Microbial communities carry out the majority of the biochemical activity on the planet, and they play integral roles in processes including metabolism and immune homeostasis in the human microbiome. Shotgun sequencing of such communities' metagenomes provides information complementary to organismal abundances from taxonomic markers, but the resulting data typically comprise short reads from hundreds of different organisms and are at best challenging to assemble comparably to single-organism genomes. Here, we describe an alternative approach to infer the functional and metabolic potential of a microbial community metagenome. We determined the gene families and pathways present or absent within a community, as well as their relative abundances, directly from short sequence reads. We validated this methodology using a collection of synthetic metagenomes, recovering the presence and abundance both of large pathways and of small functional modules with high accuracy. We subsequently applied this method, HUMAnN, to the microbial communities of 649 metagenomes drawn from seven primary body sites on 102 individuals as part of the Human Microbiome Project (HMP). This provided a means to compare functional diversity and organismal ecology in the human microbiome, and we determined a core of 24 ubiquitously present modules. Core pathways were often implemented by different enzyme families within different body sites, and 168 functional modules and 196 metabolic pathways varied in metagenomic abundance specifically to one or more niches within the microbiome. These included glycosaminoglycan degradation in the gut, as well as phosphate and amino acid transport linked to host phenotype (vaginal pH) in the posterior fornix. An implementation of our methodology is available at http://huttenhower.sph.harvard.edu/humann. This provides a means to accurately and efficiently characterize microbial metabolic pathways and functional modules directly from high-throughput sequencing reads, enabling the determination of community roles in the HMP cohort and in future metagenomic studies.
The human body is inhabited by trillions of bacteria and other microbes, which have recently been studied in many different habitats (including gut, mouth, skin, and urogenital) by the Human Microbiome Project (HMP). These microbial communities were assayed using high-throughput DNA sequencing, but it can be challenging to determine their biological functions based solely on the resulting short sequences. To reconstruct the metabolic activities of such communities, we have developed HUMAnN, a method to accurately infer community function directly from short DNA reads. The method's accuracy was validated using a collection of synthetic microbial communities. Applying HUMAnN to data from the HMP, we showed that, unlike individual microbial species, many metabolic processes were present among all body habitats. However, the frequencies of these processes varied dramatically, and some were highly enriched within individual habitats to provide niche specialization (e.g. in the gut, which is abundant in food matter but low in oxygen). Other community functions were linked specifically to properties of the human host, such as biochemical processes only present in vaginal habitats with particularly high or low pH. Studying additional environmental or disease-associated communities using HUMAnN will further improve our understanding of how the microbial organisms in a community are linked to the biological processes they carry out.
Here we present a standard developed by the Genomic Standards Consortium (GSC) for reporting marker gene sequences—the minimum information about a marker gene sequence (MIMARKS). We also introduce a system for describing the environment from which a biological sample originates. The ‘environmental packages’ apply to any genome sequence of known origin and can be used in combination with MIMARKS and other GSC checklists. Finally, to establish a unified standard for describing sequence data and to provide a single point of entry for the scientific community to access and learn about GSC checklists, we present the minimum information about any (x) sequence (MIxS). Adoption of MIxS will enhance our ability to analyze natural genetic diversity documented by massive DNA sequencing efforts from myriad ecosystems in our ever-changing biosphere.
Microbial ecology has been enhanced greatly by the ongoing ‘omics revolution, bringing half the world's biomass and most of its biodiversity into analytical view for the first time; indeed, it feels almost like the invention of the microscope and the discovery of the new world at the same time. With major microbial ecology research efforts accumulating prodigious quantities of sequence, protein, and metabolite data, we are now poised to address environmental microbial research at macro scales, and to begin to characterize and understand the dimensions of microbial biodiversity on the planet. What is currently impeding progress is the need for a framework within which the research community can develop, exchange and discuss predictive ecosystem models that describe the biodiversity and functional interactions. Such a framework must encompass data and metadata transparency and interoperation; data and results validation, curation, and search; application programming interfaces for modeling and analysis tools; and human and technical processes and services necessary to ensure broad adoption. Here we discuss the need for focused community interaction to augment and deepen established community efforts, beginning with the Genomic Standards Consortium (GSC), to create a science-driven strategic plan for a Genomic Software Institute (GSI).
We have previously demonstrated that assessment of antisaccades (AS) provides not only measures of motor function in multiple sclerosis (MS), but measures of cognitive control processes in particular, attention and working memory. This study sought to demonstrate the potential for AS measures to sensitively reflect change in functional status in MS. Twenty-four patients with relapsing-remitting MS and 12 age-matched controls were evaluated longitudinally using an AS saccade task. Compared to control subjects, a number of saccade parameters changed significantly over a two year period for MS patients. These included saccade error rates, latencies, and accuracy measures. Further, for MS patients, correlations were retained between OM measures and scores on the PASAT, which is considered the reference task for the cognitive evaluation of MS patients. Notably, EDSS scores for these patients did not change significantly over this period. These results demonstrate that OM measures may reflect disease evolution in MS, in the absence of clinically evident changes as measured using conventional techniques. With replication, these markers could ultimately be developed into a cost-effective, non-invasive, and well tolerated assessment tool to assist in confirming progression early in the disease process, and in measuring and predicting response to therapy.
Motor impairments have been found to be a significant clinical feature associated with autism and Asperger’s disorder (AD) in addition to core symptoms of communication and social cognition deficits. Motor deficits in high-functioning autism (HFA) and AD may differentiate these disorders, particularly with respect to the role of the cerebellum in motor functioning. Current neuroimaging and behavioral evidence suggests greater disruption of the cerebellum in HFA than AD. Investigations of ocular motor functioning have previously been used in clinical populations to assess the integrity of the cerebellar networks, through examination of saccade accuracy and the integrity of saccade dynamics. Previous investigations of visually guided saccades in HFA and AD have only assessed basic saccade metrics, such as latency, amplitude, and gain, as well as peak velocity. We used a simple visually guided saccade paradigm to further characterize the profile of visually guided saccade metrics and dynamics in HFA and AD. It was found that children with HFA, but not AD, were more inaccurate across both small (5°) and large (10°) target amplitudes, and final eye position was hypometric at 10°. These findings suggest greater functional disturbance of the cerebellum in HFA than AD, and suggest fundamental difficulties with visual error monitoring in HFA.
autism; Asperger’s disorder; saccades; eye movements; Verbal Comprehension Index
The widespread popularity of genomic applications is threatened by the “bioinformatics bottleneck” resulting from uncertainty about the cost and infrastructure needed to meet increasing demands for next-generation sequence analysis. Cloud computing services have been discussed as potential new bioinformatics support systems but have not been evaluated thoroughly.
We present benchmark costs and runtimes for common microbial genomics applications, including 16S rRNA analysis, microbial whole-genome shotgun (WGS) sequence assembly and annotation, WGS metagenomics and large-scale BLAST. Sequence dataset types and sizes were selected to correspond to outputs typically generated by small- to midsize facilities equipped with 454 and Illumina platforms, except for WGS metagenomics where sampling of Illumina data was used. Automated analysis pipelines, as implemented in the CloVR virtual machine, were used in order to guarantee transparency, reproducibility and portability across different operating systems, including the commercial Amazon Elastic Compute Cloud (EC2), which was used to attach real dollar costs to each analysis type. We found considerable differences in computational requirements, runtimes and costs associated with different microbial genomics applications. While all 16S analyses completed on a single-CPU desktop in under three hours, microbial genome and metagenome analyses utilized multi-CPU support of up to 120 CPUs on Amazon EC2, where each analysis completed in under 24 hours for less than $60. Representative datasets were used to estimate maximum data throughput on different cluster sizes and to compare costs between EC2 and comparable local grid servers.
Although bioinformatics requirements for microbial genomics depend on dataset characteristics and the analysis protocols applied, our results suggests that smaller sequencing facilities (up to three Roche/454 or one Illumina GAIIx sequencer) invested in 16S rRNA amplicon sequencing, microbial single-genome and metagenomics WGS projects can achieve cost-efficient bioinformatics support using CloVR in combination with Amazon EC2 as an alternative to local computing centers.
The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries.
Next-generation sequencing technologies have decentralized sequence acquisition, increasing the demand for new bioinformatics tools that are easy to use, portable across multiple platforms, and scalable for high-throughput applications. Cloud computing platforms provide on-demand access to computing infrastructure over the Internet and can be used in combination with custom built virtual machines to distribute pre-packaged with pre-configured software.
We describe the Cloud Virtual Resource, CloVR, a new desktop application for push-button automated sequence analysis that can utilize cloud computing resources. CloVR is implemented as a single portable virtual machine (VM) that provides several automated analysis pipelines for microbial genomics, including 16S, whole genome and metagenome sequence analysis. The CloVR VM runs on a personal computer, utilizes local computer resources and requires minimal installation, addressing key challenges in deploying bioinformatics workflows. In addition CloVR supports use of remote cloud computing resources to improve performance for large-scale sequence processing. In a case study, we demonstrate the use of CloVR to automatically process next-generation sequencing data on multiple cloud computing platforms.
The CloVR VM and associated architecture lowers the barrier of entry for utilizing complex analysis protocols on both local single- and multi-core computers and cloud systems for high throughput data processing.
A vast and rich body of information has grown up as a result of the world's enthusiasm for 'omics technologies. Finding ways to describe and make available this information that maximise its usefulness has become a major effort across the 'omics world. At the heart of this effort is the Genomic Standards Consortium (GSC), an open-membership organization that drives community-based standardization activities, Here we provide a short history of the GSC, provide an overview of its range of current activities, and make a call for the scientific community to join forces to improve the quality and quantity of contextual information about our public collections of genomes, metagenomes, and marker gene sequences.