The parasite Plasmodium falciparum is responsible for hundreds of millions of cases of malaria, and kills more than one million African children annually. Here we report an analysis of the genome sequence of P. falciparum clone 3D7. The 23-megabase nuclear genome consists of 14 chromosomes, encodes about 5,300 genes, and is the most (A + T)-rich genome sequenced to date. Genes involved in antigenic variation are concentrated in the subtelomeric regions of the chromosomes. Compared to the genomes of free-living eukaryotic microbes, the genome of this intracellular parasite encodes fewer enzymes and transporters, but a large proportion of genes are devoted to immune evasion and host–parasite interactions. Many nuclear-encoded proteins are targeted to the apicoplast, an organelle involved in fatty-acid and isoprenoid metabolism. The genome sequence provides the foundation for future studies of this organism, and is being exploited in the search for new drugs and vaccines to fight malaria.
This manuscript describes the NIH Human Microbiome Project, including a brief review of human microbiome research, a history of the project, and a comprehensive overview of the consortium's recent collection of publications analyzing the human microbiome.
Variability in the extent of the descriptions of data (‘metadata’) held in public repositories forces users to assess the quality of records individually, which rapidly becomes impractical. The scoring of records on the richness of their description provides a simple, objective proxy measure for quality that enables filtering that supports downstream analysis. Pivotally, such descriptions should spur on improvements. Here, we introduce such a measure - the ‘Metadata Coverage Index’ (MCI): the percentage of available fields actually filled in a record or description. MCI scores can be calculated across a database, for individual records or for their component parts (e.g., fields of interest). There are many potential uses for this simple metric: for example; to filter, rank or search for records; to assess the metadata availability of an ad hoc collection; to determine the frequency with which fields in a particular record type are filled, especially with respect to standards compliance; to assess the utility of specific tools and resources, and of data capture practice more generally; to prioritize records for further curation; to serve as performance metrics of funded projects; or to quantify the value added by curation. Here we demonstrate the utility of MCI scores using metadata from the Genomes Online Database (GOLD), including records compliant with the ‘Minimum Information about a Genome Sequence’ (MIGS) standard developed by the Genomic Standards Consortium. We discuss challenges and address the further application of MCI scores; to show improvements in annotation quality over time, to inform the work of standards bodies and repository providers on the usability and popularity of their products, and to assess and credit the work of curators. Such an index provides a step towards putting metadata capture practices and in the future, standards compliance, into a quantitative and objective framework.
Microbial communities carry out the majority of the biochemical activity on the planet, and they play integral roles in processes including metabolism and immune homeostasis in the human microbiome. Shotgun sequencing of such communities' metagenomes provides information complementary to organismal abundances from taxonomic markers, but the resulting data typically comprise short reads from hundreds of different organisms and are at best challenging to assemble comparably to single-organism genomes. Here, we describe an alternative approach to infer the functional and metabolic potential of a microbial community metagenome. We determined the gene families and pathways present or absent within a community, as well as their relative abundances, directly from short sequence reads. We validated this methodology using a collection of synthetic metagenomes, recovering the presence and abundance both of large pathways and of small functional modules with high accuracy. We subsequently applied this method, HUMAnN, to the microbial communities of 649 metagenomes drawn from seven primary body sites on 102 individuals as part of the Human Microbiome Project (HMP). This provided a means to compare functional diversity and organismal ecology in the human microbiome, and we determined a core of 24 ubiquitously present modules. Core pathways were often implemented by different enzyme families within different body sites, and 168 functional modules and 196 metabolic pathways varied in metagenomic abundance specifically to one or more niches within the microbiome. These included glycosaminoglycan degradation in the gut, as well as phosphate and amino acid transport linked to host phenotype (vaginal pH) in the posterior fornix. An implementation of our methodology is available at http://huttenhower.sph.harvard.edu/humann. This provides a means to accurately and efficiently characterize microbial metabolic pathways and functional modules directly from high-throughput sequencing reads, enabling the determination of community roles in the HMP cohort and in future metagenomic studies.
The human body is inhabited by trillions of bacteria and other microbes, which have recently been studied in many different habitats (including gut, mouth, skin, and urogenital) by the Human Microbiome Project (HMP). These microbial communities were assayed using high-throughput DNA sequencing, but it can be challenging to determine their biological functions based solely on the resulting short sequences. To reconstruct the metabolic activities of such communities, we have developed HUMAnN, a method to accurately infer community function directly from short DNA reads. The method's accuracy was validated using a collection of synthetic microbial communities. Applying HUMAnN to data from the HMP, we showed that, unlike individual microbial species, many metabolic processes were present among all body habitats. However, the frequencies of these processes varied dramatically, and some were highly enriched within individual habitats to provide niche specialization (e.g. in the gut, which is abundant in food matter but low in oxygen). Other community functions were linked specifically to properties of the human host, such as biochemical processes only present in vaginal habitats with particularly high or low pH. Studying additional environmental or disease-associated communities using HUMAnN will further improve our understanding of how the microbial organisms in a community are linked to the biological processes they carry out.
Microbial ecology has been enhanced greatly by the ongoing ‘omics revolution, bringing half the world's biomass and most of its biodiversity into analytical view for the first time; indeed, it feels almost like the invention of the microscope and the discovery of the new world at the same time. With major microbial ecology research efforts accumulating prodigious quantities of sequence, protein, and metabolite data, we are now poised to address environmental microbial research at macro scales, and to begin to characterize and understand the dimensions of microbial biodiversity on the planet. What is currently impeding progress is the need for a framework within which the research community can develop, exchange and discuss predictive ecosystem models that describe the biodiversity and functional interactions. Such a framework must encompass data and metadata transparency and interoperation; data and results validation, curation, and search; application programming interfaces for modeling and analysis tools; and human and technical processes and services necessary to ensure broad adoption. Here we discuss the need for focused community interaction to augment and deepen established community efforts, beginning with the Genomic Standards Consortium (GSC), to create a science-driven strategic plan for a Genomic Software Institute (GSI).
We have previously demonstrated that assessment of antisaccades (AS) provides not only measures of motor function in multiple sclerosis (MS), but measures of cognitive control processes in particular, attention and working memory. This study sought to demonstrate the potential for AS measures to sensitively reflect change in functional status in MS. Twenty-four patients with relapsing-remitting MS and 12 age-matched controls were evaluated longitudinally using an AS saccade task. Compared to control subjects, a number of saccade parameters changed significantly over a two year period for MS patients. These included saccade error rates, latencies, and accuracy measures. Further, for MS patients, correlations were retained between OM measures and scores on the PASAT, which is considered the reference task for the cognitive evaluation of MS patients. Notably, EDSS scores for these patients did not change significantly over this period. These results demonstrate that OM measures may reflect disease evolution in MS, in the absence of clinically evident changes as measured using conventional techniques. With replication, these markers could ultimately be developed into a cost-effective, non-invasive, and well tolerated assessment tool to assist in confirming progression early in the disease process, and in measuring and predicting response to therapy.
Motor impairments have been found to be a significant clinical feature associated with autism and Asperger’s disorder (AD) in addition to core symptoms of communication and social cognition deficits. Motor deficits in high-functioning autism (HFA) and AD may differentiate these disorders, particularly with respect to the role of the cerebellum in motor functioning. Current neuroimaging and behavioral evidence suggests greater disruption of the cerebellum in HFA than AD. Investigations of ocular motor functioning have previously been used in clinical populations to assess the integrity of the cerebellar networks, through examination of saccade accuracy and the integrity of saccade dynamics. Previous investigations of visually guided saccades in HFA and AD have only assessed basic saccade metrics, such as latency, amplitude, and gain, as well as peak velocity. We used a simple visually guided saccade paradigm to further characterize the profile of visually guided saccade metrics and dynamics in HFA and AD. It was found that children with HFA, but not AD, were more inaccurate across both small (5°) and large (10°) target amplitudes, and final eye position was hypometric at 10°. These findings suggest greater functional disturbance of the cerebellum in HFA than AD, and suggest fundamental difficulties with visual error monitoring in HFA.
autism; Asperger’s disorder; saccades; eye movements; Verbal Comprehension Index
The widespread popularity of genomic applications is threatened by the “bioinformatics bottleneck” resulting from uncertainty about the cost and infrastructure needed to meet increasing demands for next-generation sequence analysis. Cloud computing services have been discussed as potential new bioinformatics support systems but have not been evaluated thoroughly.
We present benchmark costs and runtimes for common microbial genomics applications, including 16S rRNA analysis, microbial whole-genome shotgun (WGS) sequence assembly and annotation, WGS metagenomics and large-scale BLAST. Sequence dataset types and sizes were selected to correspond to outputs typically generated by small- to midsize facilities equipped with 454 and Illumina platforms, except for WGS metagenomics where sampling of Illumina data was used. Automated analysis pipelines, as implemented in the CloVR virtual machine, were used in order to guarantee transparency, reproducibility and portability across different operating systems, including the commercial Amazon Elastic Compute Cloud (EC2), which was used to attach real dollar costs to each analysis type. We found considerable differences in computational requirements, runtimes and costs associated with different microbial genomics applications. While all 16S analyses completed on a single-CPU desktop in under three hours, microbial genome and metagenome analyses utilized multi-CPU support of up to 120 CPUs on Amazon EC2, where each analysis completed in under 24 hours for less than $60. Representative datasets were used to estimate maximum data throughput on different cluster sizes and to compare costs between EC2 and comparable local grid servers.
Although bioinformatics requirements for microbial genomics depend on dataset characteristics and the analysis protocols applied, our results suggests that smaller sequencing facilities (up to three Roche/454 or one Illumina GAIIx sequencer) invested in 16S rRNA amplicon sequencing, microbial single-genome and metagenomics WGS projects can achieve cost-efficient bioinformatics support using CloVR in combination with Amazon EC2 as an alternative to local computing centers.
The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries.
Next-generation sequencing technologies have decentralized sequence acquisition, increasing the demand for new bioinformatics tools that are easy to use, portable across multiple platforms, and scalable for high-throughput applications. Cloud computing platforms provide on-demand access to computing infrastructure over the Internet and can be used in combination with custom built virtual machines to distribute pre-packaged with pre-configured software.
We describe the Cloud Virtual Resource, CloVR, a new desktop application for push-button automated sequence analysis that can utilize cloud computing resources. CloVR is implemented as a single portable virtual machine (VM) that provides several automated analysis pipelines for microbial genomics, including 16S, whole genome and metagenome sequence analysis. The CloVR VM runs on a personal computer, utilizes local computer resources and requires minimal installation, addressing key challenges in deploying bioinformatics workflows. In addition CloVR supports use of remote cloud computing resources to improve performance for large-scale sequence processing. In a case study, we demonstrate the use of CloVR to automatically process next-generation sequencing data on multiple cloud computing platforms.
The CloVR VM and associated architecture lowers the barrier of entry for utilizing complex analysis protocols on both local single- and multi-core computers and cloud systems for high throughput data processing.
A vast and rich body of information has grown up as a result of the world's enthusiasm for 'omics technologies. Finding ways to describe and make available this information that maximise its usefulness has become a major effort across the 'omics world. At the heart of this effort is the Genomic Standards Consortium (GSC), an open-membership organization that drives community-based standardization activities, Here we provide a short history of the GSC, provide an overview of its range of current activities, and make a call for the scientific community to join forces to improve the quality and quantity of contextual information about our public collections of genomes, metagenomes, and marker gene sequences.
The Institute for Genome Sciences (IGS) has developed a prokaryotic annotation pipeline that is used for coding gene/RNA prediction and functional annotation of Bacteria and Archaea. The fully automated pipeline accepts one or many genomic sequences as input and produces output in a variety of standard formats. Functional annotation is primarily based on similarity searches and motif finding combined with a hierarchical rule based annotation system. The output annotations can also be loaded into a relational database and accessed through visualization tools.
Institute for Genome Sciences; functional annotation; structural annotation; microbial genomics; prokaryotic genomics; annotation pipeline; pFunc; Glimmer; HMM; BER; Ergatis; Manatee; IGS Annotation Engine
The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources; and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases.
This report summarizes the proceedings of the 10th workshop of the Genomic Standards Consortium (GSC), held at Argonne National Laboratory, IL, USA. It was the second GSC workshop to have open registration and attracted over 60 participants who worked together to progress the full range of projects ongoing within the GSC. Overall, the primary focus of the workshop was on advancing the M5 platform for next-generation collaborative computational infrastructures. Other key outcomes included the formation of a GSC working group focused on MIGS/MIMS/MIENS compliance using the ISA software suite and the formal launch of the GSC Developer Working Group. Further information about the GSC and its range of activities can be found at http://gensc.org/.
This report summarizes the proceedings of the first day of the Metagenomics, Metadata and MetaAnalysis (M3) workshop held at the Intelligent Systems for Molecular Biology 2010 conference. The second day, which was dedicated to the inaugural meeting of the BioSharing initiative is presented in a separate report. The Genomic Standards Consortium (GSC) hosted the first day of this Special Interest Group (SIG) at ISMB to continue exploring the bottlenecks and emerging solutions for obtaining biological insights through large-scale comparative analysis of metagenomic datasets. The M3 SIG included invited and selected talks and a panel discussion at the end of the day involving the plenary speakers. Further information about the GSC and its range of activities can be found at http://gensc.org. Information about the newly established BioSharing effort can be found at http://biosharing.org/.
This report summarizes the proceedings of the one day BioSharing meeting held at the Intelligent Systems for Molecular Biology (ISMB) 2010 conference in Boston, MA, USA This inaugural BioSharing event was hosted by the Genomic Standards Consortium as part of its M3 & BioSharing special interest group (SIG) workshop. The BioSharing event included invited talks from a range of community leaders and a panel discussion at the end of the day. The panel session led to the formal agreement among community leaders to join together to promote cross-community knowledge exchange and collaborations. A key focus of the newly formed Biosharing community will be linking up resources to promote real-world data sharing (virtuous cycle of data) and supporting compliance with data policies through the creation of a one-stop-portal of information. Further information about the newly established BioSharing effort can be found at http://biosharing.org.
It is widely recognized that, with the advent of very high throughput, short read, and highly parallelized sequencing technologies, the generation of new DNA sequences from microbes, plants, metagenomes is outpacing the ability to assign functions to (“annotate”) all this data. To begin to try to address this, on May 18 and 19, 2010, a team of roughly fifty people met to define and scope the possibility of a first Critical Assessment of Functional Annotation Experiment (CAFAE) for bacterial genome annotation in Crystal City, Virginia. Due to the fundamental importance of genomic data to its mission, the Department of Energy (DOE) BER program hosted this workshop, funding the attendance of all invitees. The workshop was co-organized by Dan Drell and Susan Gregurick (DOE), Owen White and Nikos Kyripides.
The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases.
This report summarizes the M3 Workshop held at the January 2010 Pacific Symposium on Biocomputing. The workshop, organized by Genomic Standards Consortium members, included five contributed talks, a series of short presentations from stakeholders in the genomics standards community, a poster session, and, in the evening, an open discussion session to review current projects and examine future directions for the GSC and its stakeholders.
Motivation: The growth of sequence data has been accompanied by an increasing need to analyze data on distributed computer clusters. The use of these systems for routine analysis requires scalable and robust software for data management of large datasets. Software is also needed to simplify data management and make large-scale bioinformatics analysis accessible and reproducible to a wide class of target users.
Results: We have developed a workflow management system named Ergatis that enables users to build, execute and monitor pipelines for computational analysis of genomics data. Ergatis contains preconfigured components and template pipelines for a number of common bioinformatics tasks such as prokaryotic genome annotation and genome comparisons. Outputs from many of these components can be loaded into a Chado relational database. Ergatis was designed to be accessible to a broad class of users and provides a user friendly, web-based interface. Ergatis supports high-throughput batch processing on distributed compute clusters and has been used for data management in a number of genome annotation and comparative genomics projects.
Availability: Ergatis is an open-source project and is freely available at http://ergatis.sourceforge.net
The Comprehensive Microbial Resource or CMR (http://cmr.jcvi.org) provides a web-based central resource for the display, search and analysis of the sequence and annotation for complete and publicly available bacterial and archaeal genomes. In addition to displaying the original annotation from GenBank, the CMR makes available secondary automated structural and functional annotation across all genomes to provide consistent data types necessary for effective mining of genomic data. Precomputed homology searches are stored to allow meaningful genome comparisons. The CMR supplies users with over 50 different tools to utilize the sequence and annotation data across one or more of the 571 currently available genomes. At the gene level users can view the gene annotation and underlying evidence. Genome level information includes whole genome graphical displays, biochemical pathway maps and genome summary data. Comparative tools display analysis between genomes with homology and genome alignment tools, and searches across the accessions, annotation, and evidence assigned to all genes/genomes are available. The data and tools on the CMR aid genomic research and analysis, and the CMR is included in over 200 scientific publications. The code underlying the CMR website and the CMR database are freely available for download with no license restrictions.
The Gemina system (http://gemina.igs.umaryland.edu) identifies, standardizes and integrates the outbreak metadata for the breadth of NIAID category A–C viral and bacterial pathogens, thereby providing an investigative and surveillance tool describing the Who [Host], What [Disease, Symptom], When [Date], Where [Location] and How [Pathogen, Environmental Source, Reservoir, Transmission Method] for each pathogen. The Gemina database will provide a greater understanding of the interactions of viral and bacterial pathogens with their hosts and infectious diseases through in-depth literature text-mining, integrated outbreak metadata, outbreak surveillance tools, extensive ontology development, metadata curation and representative genomic sequence identification and standards development. The Gemina web interface provides metadata selection and retrieval of a pathogen's; Infection Systems (Pathogen, Host, Disease, Transmission Method and Anatomy) and Incidents (Location and Date) along with a hosts Age and Gender. The Gemina system provides an integrated investigative and geospatial surveillance system connecting pathogens, pathogen products and disease anchored on the taxonomic ID of the pathogen and host to identify the breadth of hosts and diseases known for these pathogens, to identify the extent of outbreak locations, and to identify unique genomic regions with the DNA Signature Insignia Detection Tool.