Ocean Sampling Day was initiated by the EU-funded Micro B3 (Marine Microbial Biodiversity, Bioinformatics, Biotechnology) project to obtain a snapshot of the marine microbial biodiversity and function of the world’s oceans. It is a simultaneous global mega-sequencing campaign aiming to generate the largest standardized microbial data set in a single day. This will be achievable only through the coordinated efforts of an Ocean Sampling Day Consortium, supportive partnerships and networks between sites. This commentary outlines the establishment, function and aims of the Consortium and describes our vision for a sustainable study of marine microbial communities and their embedded functional traits.
Ocean sampling day; OSD; Biodiversity; Genomics; Health Index; Bacteria; Microorganism; Metagenomics; Marine; Micro B3; Standards
Lotic ecosystems such as rivers and streams are unique in that they represent a continuum of both space and time during the transition from headwaters to the river mouth. As microbes have very different controls over their ecology, distribution and dispersion compared with macrobiota, we wished to explore biogeographical patterns within a river catchment and uncover the major drivers structuring bacterioplankton communities. Water samples collected across the River Thames Basin, UK, covering the transition from headwater tributaries to the lower reaches of the main river channel were characterised using 16S rRNA gene pyrosequencing. This approach revealed an ecological succession in the bacterial community composition along the river continuum, moving from a community dominated by Bacteroidetes in the headwaters to Actinobacteria-dominated downstream. Location of the sampling point in the river network (measured as the cumulative water channel distance upstream) was found to be the most predictive spatial feature; inferring that ecological processes pertaining to temporal community succession are of prime importance in driving the assemblages of riverine bacterioplankton communities. A decrease in bacterial activity rates and an increase in the abundance of low nucleic acid bacteria relative to high nucleic acid bacteria were found to correspond with these downstream changes in community structure, suggesting corresponding functional changes. Our findings show that bacterial communities across the Thames basin exhibit an ecological succession along the river continuum, and that this is primarily driven by water residence time rather than the physico-chemical status of the river.
Sampling ecosystems, even at a local scale, at the temporal and spatial resolution necessary to capture natural variability in microbial communities are prohibitively expensive. We extrapolated marine surface microbial community structure and metabolic potential from 72 16S rRNA amplicon and 8 metagenomic observations using remotely sensed environmental parameters to create a system-scale model of marine microbial metabolism for 5904 grid cells (49 km2) in the Western English Chanel, across 3 years of weekly averages. Thirteen environmental variables predicted the relative abundance of 24 bacterial Orders and 1715 unique enzyme-encoding genes that encode turnover of 2893 metabolites. The genes' predicted relative abundance was highly correlated (Pearson Correlation 0.72, P-value <10−6) with their observed relative abundance in sequenced metagenomes. Predictions of the relative turnover (synthesis or consumption) of CO2 were significantly correlated with observed surface CO2 fugacity. The spatial and temporal variation in the predicted relative abundances of genes coding for cyanase, carbon monoxide and malate dehydrogenase were investigated along with the predicted inter-annual variation in relative consumption or production of ∼3000 metabolites forming six significant temporal clusters. These spatiotemporal distributions could possibly be explained by the co-occurrence of anaerobic and aerobic metabolisms associated with localized plankton blooms or sediment resuspension, which facilitate the presence of anaerobic micro-niches. This predictive model provides a general framework for focusing future sampling and experimental design to relate biogeochemical turnover to microbial ecology.
To facilitate sharing of Omics data, many groups of scientists have been working to establish the relevant data standards. The main components of data sharing standards are experiment description standards, data exchange standards, terminology standards, and experiment execution standards. Here we provide a survey of existing and emerging standards that are intended to assist the free and open exchange of large-format data.
Data sharing; Data exchange; Data standards; MGED; MIAME; Ontology; Data format; Microarray; Proteomics; Metabolomics
This manuscript calls for an international effort to generate a comprehensive catalog from genome sequences of all the archaeal and bacterial type strains.
Microbes hold the key to life. They hold the secrets to our past (as the descendants of the earliest forms of life) and the prospects for our future (as we mine their genes for solutions to some of the planet's most pressing problems, from global warming to antibiotic resistance). However, the piecemeal approach that has defined efforts to study microbial genetic diversity for over 20 years and in over 30,000 genome projects risks squandering that promise. These efforts have covered less than 20% of the diversity of the cultured archaeal and bacterial species, which represent just 15% of the overall known prokaryotic diversity. Here we call for the funding of a systematic effort to produce a comprehensive genomic catalog of all cultured Bacteria and Archaea by sequencing, where available, the type strain of each species with a validly published name (currently∼11,000). This effort will provide an unprecedented level of coverage of our planet's genetic diversity, allow for the large-scale discovery of novel genes and functions, and lead to an improved understanding of microbial evolution and function in the environment.
This report summarizes the proceedings of the 14th workshop of the Genomic Standards Consortium (GSC) held at the University of Oxford in September 2012. The primary goal of the workshop was to work towards the launch of the Genomic Observatories (GOs) Network under the GSC. For the first time, it brought together potential GOs sites, GSC members, and a range of interested partner organizations. It thus represented the first meeting of the GOs Network (GOs1). Key outcomes include the formation of a core group of “champions” ready to take the GOs Network forward, as well as the formation of working groups. The workshop also served as the first meeting of a wide range of participants in the Ocean Sampling Day (OSD) initiative, a first GOs action. Three projects with complementary interests – COST Action ES1103, MG4U and Micro B3 – organized joint sessions at the workshop. A two-day GSC Hackathon followed the main three days of meetings.
The co-authors of this paper hereby state their intention to work together to launch the Genomic Observatories Network (GOs Network) for which this document will serve as its Founding Charter. We define a Genomic Observatory as an ecosystem and/or site subject to long-term scientific research, including (but not limited to) the sustained study of genomic biodiversity from single-celled microbes to multicellular organisms.
An international group of 64 scientists first published the call for a global network of Genomic Observatories in January 2012. The vision for such a network was expanded in a subsequent paper and developed over a series of meetings in Bremen (Germany), Shenzhen (China), Moorea (French Polynesia), Oxford (UK), Pacific Grove (California, USA), Washington (DC, USA), and London (UK). While this community-building process continues, here we express our mutual intent to establish the GOs Network formally, and to describe our shared vision for its future. The views expressed here are ours alone as individual scientists, and do not necessarily represent those of the institutions with which we are affiliated.
Biodiversity; Genomics; Biocode; Earth observations
The Genomic Standards Consortium (GSC) is an open-membership community that was founded in 2005 to work towards the development, implementation and harmonization of standards in the field of genomics. Starting with the defined task of establishing a minimal set of descriptions the GSC has evolved into an active standards-setting body that currently has 18 ongoing projects, with additional projects regularly proposed from within and outside the GSC. Here we describe our recently enacted policy for proposing new activities that are intended to be taken on by the GSC, along with the template for proposing such new activities.
The UK Science and Innovation Network UK-USA workshop ‘Beating the Superbugs: Hospital Microbiome Studies for tackling Antimicrobial Resistance’ was held on October 14th 2013 at the UK Department of Health, London. The workshop was designed to promote US-UK collaboration on hospital microbiome studies to add a new facet to our collective understanding of antimicrobial resistance. The assembled researchers debated the importance of the hospital microbial community in transmission of disease and as a reservoir for antimicrobial resistance genes, and discussed methodologies, hypotheses, and priorities. A number of complementary approaches were explored, although the importance of the built environment microbiome in disease transmission was not universally accepted. Current whole genome epidemiological methods are being pioneered in the UK and the benefits of moving to community analysis are not necessarily obvious to the pioneers; however, rapid progress in other areas of microbiology suggest to some researchers that hospital microbiome studies will be exceptionally fruitful even in the short term. Collaborative studies will recombine different strengths to tackle the international problems of antimicrobial resistance and hospital and healthcare associated infections.
Antibiotic resistance; Nosocomial infections; Hospital microbiome; Superbugs
Metagenomics is a relatively recently established but rapidly expanding field that uses high-throughput next-generation sequencing technologies to characterize the microbial communities inhabiting different ecosystems (including oceans, lakes, soil, tundra, plants and body sites). Metagenomics brings with it a number of challenges, including the management, analysis, storage and sharing of data. In response to these challenges, we have developed a new metagenomics resource (http://www.ebi.ac.uk/metagenomics/) that allows users to easily submit raw nucleotide reads for functional and taxonomic analysis by a state-of-the-art pipeline, and have them automatically stored (together with descriptive, standards-compliant metadata) in the European Nucleotide Archive.
The Global Biodiversity Information Facility and the Genomic Standards Consortium convened a joint workshop at the University of Oxford, 27-29 February 2012, with a small group of experts from Europe, USA, China and Japan, to continue the alignment of the Darwin Core with the MIxS and related genomics standards. Several reference mappings were produced as well as test expressions of MIxS in RDF. The use and management of controlled vocabulary terms was considered in relation to both GBIF and the GSC, and tools for working with terms were reviewed. Extensions for publishing genomic biodiversity data to the GBIF network via a Darwin Core Archive were prototyped and work begun on preparing translations of the Darwin Core to Japanese and Chinese. Five genomic repositories were identified for engagement to begin the process of testing the publishing of genomic data to the GBIF network commencing with the SILVA rRNA database.
To make full use of research data, the bioscience community needs to adopt technologies and reward mechanisms that support interoperability and promote the growth of an open ‘data commoning’ culture. Here we describe the prerequisites for data commoning and present an established and growing ecosystem of solutions using the shared ‘Investigation-Study-Assay’ framework to support that vision.
Building on the planning efforts of the RCN4GSC project, a workshop was convened in San Diego to bring together experts from genomics and metagenomics, biodiversity, ecology, and bioinformatics with the charge to identify potential for positive interactions and progress, especially building on successes at establishing data standards by the GSC and by the biodiversity and ecological communities. Until recently, the contribution of microbial life to the biomass and biodiversity of the biosphere was largely overlooked (because it was resistant to systematic study). Now, emerging genomic and metagenomic tools are making investigation possible. Initial research findings suggest that major advances are in the offing. Although different research communities share some overlapping concepts and traditions, they differ significantly in sampling approaches, vocabularies and workflows. Likewise, their definitions of ‘fitness for use’ for data differ significantly, as this concept stems from the specific research questions of most importance in the different fields. Nevertheless, there is little doubt that there is much to be gained from greater coordination and integration. As a first step toward interoperability of the information systems used by the different communities, participants agreed to conduct a case study on two of the leading data standards from the two formerly disparate fields: (a) GSC’s standard checklists for genomics and metagenomics and (b) TDWG’s Darwin Core standard, used primarily in taxonomy and systematic biology.
Variability in the extent of the descriptions of data (‘metadata’) held in public repositories forces users to assess the quality of records individually, which rapidly becomes impractical. The scoring of records on the richness of their description provides a simple, objective proxy measure for quality that enables filtering that supports downstream analysis. Pivotally, such descriptions should spur on improvements. Here, we introduce such a measure - the ‘Metadata Coverage Index’ (MCI): the percentage of available fields actually filled in a record or description. MCI scores can be calculated across a database, for individual records or for their component parts (e.g., fields of interest). There are many potential uses for this simple metric: for example; to filter, rank or search for records; to assess the metadata availability of an ad hoc collection; to determine the frequency with which fields in a particular record type are filled, especially with respect to standards compliance; to assess the utility of specific tools and resources, and of data capture practice more generally; to prioritize records for further curation; to serve as performance metrics of funded projects; or to quantify the value added by curation. Here we demonstrate the utility of MCI scores using metadata from the Genomes Online Database (GOLD), including records compliant with the ‘Minimum Information about a Genome Sequence’ (MIGS) standard developed by the Genomic Standards Consortium. We discuss challenges and address the further application of MCI scores; to show improvements in annotation quality over time, to inform the work of standards bodies and repository providers on the usability and popularity of their products, and to assess and credit the work of curators. Such an index provides a step towards putting metadata capture practices and in the future, standards compliance, into a quantitative and objective framework.
We are entering a new era in genomics–that of large-scale, place-based, highly contextualized genomic research. Here we review this emerging paradigm shift and suggest that sites of utmost scientific importance be expanded into ‘Genomic Observatories’ (GOs). Investment in GOs should focus on the digital characterization of whole ecosystems, from all-taxa biotic inventories to time-series ’omics studies. The foundational layer of biodiversity–genetic variation–would thus be mainstreamed into Earth Observation systems enabling predictive modelling of biodiversity dynamics and resultant impacts on ecosystem services.
Ecogenomics; Earth observation; Biodiversity; Ecosystems; Biocode; Genomic observatory; DNA
Computing of sequence similarity results is becoming a limiting factor in metagenome analysis. Sequence similarity search results encoded in an open, exchangeable format have the potential to limit the needs for computational reanalysis of these data sets. A prerequisite for sharing of similarity results is a common reference.
We introduce a mechanism for automatically maintaining a comprehensive, non-redundant protein database and for creating a quarterly release of this resource. In addition, we present tools for translating similarity searches into many annotation namespaces, e.g. KEGG or NCBI's GenBank.
The data and tools we present allow the creation of multiple result sets using a single computation, permitting computational results to be shared between groups for large sequence data sets.
Here we present a standard developed by the Genomic Standards Consortium (GSC) for reporting marker gene sequences—the minimum information about a marker gene sequence (MIMARKS). We also introduce a system for describing the environment from which a biological sample originates. The ‘environmental packages’ apply to any genome sequence of known origin and can be used in combination with MIMARKS and other GSC checklists. Finally, to establish a unified standard for describing sequence data and to provide a single point of entry for the scientific community to access and learn about GSC checklists, we present the minimum information about any (x) sequence (MIxS). Adoption of MIxS will enhance our ability to analyze natural genetic diversity documented by massive DNA sequencing efforts from myriad ecosystems in our ever-changing biosphere.
Robust seasonal dynamics in microbial community composition have previously been observed in the English Channel L4 marine observatory. These could be explained either by seasonal changes in the taxa present at the L4 site, or by the continuous modulation of abundance of taxa within a persistent microbial community. To test these competing hypotheses, deep sequencing of 16S rRNA from one randomly selected time point to a depth of 10 729 927 reads was compared with an existing taxonomic survey data covering 6 years. When compared against the 6-year survey of 72 shallow sequenced time points, the deep sequenced time point maintained 95.4% of the combined shallow OTUs. Additionally, on average, 99.75%±0.06 (mean±s.d.) of the operational taxonomic units found in each shallow sequenced sample were also found in the single deep sequenced sample. This suggests that the vast majority of taxa identified in this ecosystem are always present, but just in different proportions that are predictable. Thus observed changes in community composition are actually variations in the relative abundance of taxa, not, as was previously believed, demonstrating extinction and recolonization of taxa in the ecosystem through time.
16S rRNA; bacteria; community; diversity; seed bank
This report details the outcome of the 13th Meeting of the Genomic Standards Consortium. The three-day conference was held at the Kingkey Palace Hotel, Shenzhen, China, on March 5–7, 2012, and was hosted by the Beijing Genomics Institute. The meeting, titled From Genomes to Interactions to Communities to Models, highlighted the role of data standards associated with genomic, metagenomic, and amplicon sequence data and the contextual information associated with the sample. To this end the meeting focused on genomic projects for animals, plants, fungi, and viruses; metagenomic studies in host-microbe interactions; and the dynamics of microbial communities. In addition, the meeting hosted a Genomic Observatories Network session, a Genomic Standards Consortium biodiversity working group session, and a Microbiology of the Built Environment session sponsored by the Alfred P. Sloan Foundation.
Genomic Standards Consortium; microbiome; microbial metagenomics; fungal genomics; viral genomics; Genomic Observatories Network
A steep drop in the cost of next-generation sequencing during recent years has made the technology affordable to the majority of researchers, but downstream bioinformatic analysis still poses a resource bottleneck for smaller laboratories and institutes that do not have access to substantial computational resources. Sequencing instruments are typically bundled with only the minimal processing and storage capacity required for data capture during sequencing runs. Given the scale of sequence datasets, scientific value cannot be obtained from acquiring a sequencer unless it is accompanied by an equal investment in informatics infrastructure.
Cloud BioLinux is a publicly accessible Virtual Machine (VM) that enables scientists to quickly provision on-demand infrastructures for high-performance bioinformatics computing using cloud platforms. Users have instant access to a range of pre-configured command line and graphical software applications, including a full-featured desktop interface, documentation and over 135 bioinformatics packages for applications including sequence alignment, clustering, assembly, display, editing, and phylogeny. Each tool's functionality is fully described in the documentation directly accessible from the graphical interface of the VM. Besides the Amazon EC2 cloud, we have started instances of Cloud BioLinux on a private Eucalyptus cloud installed at the J. Craig Venter Institute, and demonstrated access to the bioinformatic tools interface through a remote connection to EC2 instances from a local desktop computer. Documentation for using Cloud BioLinux on EC2 is available from our project website, while a Eucalyptus cloud image and VirtualBox Appliance is also publicly available for download and use by researchers with access to private clouds.
Cloud BioLinux provides a platform for developing bioinformatics infrastructures on the cloud. An automated and configurable process builds Virtual Machines, allowing the development of highly customized versions from a shared code base. This shared community toolkit enables application specific analysis platforms on the cloud by minimizing the effort required to prepare and maintain them.
Microbial ecology has been enhanced greatly by the ongoing ‘omics revolution, bringing half the world's biomass and most of its biodiversity into analytical view for the first time; indeed, it feels almost like the invention of the microscope and the discovery of the new world at the same time. With major microbial ecology research efforts accumulating prodigious quantities of sequence, protein, and metabolite data, we are now poised to address environmental microbial research at macro scales, and to begin to characterize and understand the dimensions of microbial biodiversity on the planet. What is currently impeding progress is the need for a framework within which the research community can develop, exchange and discuss predictive ecosystem models that describe the biodiversity and functional interactions. Such a framework must encompass data and metadata transparency and interoperation; data and results validation, curation, and search; application programming interfaces for modeling and analysis tools; and human and technical processes and services necessary to ensure broad adoption. Here we discuss the need for focused community interaction to augment and deepen established community efforts, beginning with the Genomic Standards Consortium (GSC), to create a science-driven strategic plan for a Genomic Software Institute (GSI).
Here we describe, the longest microbial time-series analyzed to date using high-resolution 16S rRNA tag pyrosequencing of samples taken monthly over 6 years at a temperate marine coastal site off Plymouth, UK. Data treatment effected the estimation of community richness over a 6-year period, whereby 8794 operational taxonomic units (OTUs) were identified using single-linkage preclustering and 21 130 OTUs were identified by denoising the data. The Alphaproteobacteria were the most abundant Class, and the most frequently recorded OTUs were members of the Rickettsiales (SAR 11) and Rhodobacteriales. This near-surface ocean bacterial community showed strong repeatable seasonal patterns, which were defined by winter peaks in diversity across all years. Environmental variables explained far more variation in seasonally predictable bacteria than did data on protists or metazoan biomass. Change in day length alone explains >65% of the variance in community diversity. The results suggested that seasonal changes in environmental variables are more important than trophic interactions. Interestingly, microbial association network analysis showed that correlations in abundance were stronger within bacterial taxa rather than between bacteria and eukaryotes, or between bacteria and environmental variables.
16S rRNA; microbial; bacteria; community; diversity; model
This report details the outcome of the 1st International Earth Microbiome Project Conference. The 2-day conference was held at the Kingkey Palace Hotel, Shenzhen, China, on the 14th-15th June 2011, and was hosted by BGI (formally the Beijing Genomics Institute). The conference was arranged as a formal launch for the Earth Microbiome Project, to highlight some of the exciting research projects, results of the preliminary pilot studies, and to provide a discussion forum for the types of technology and experimental approaches that will come to define the standard operating procedures of this project.