Metagenomic investigations hold great promise for informing the genetics, physiology, and ecology of environmental microorganisms. Current challenges for metagenomic analysis are related to our ability to connect the dots between sequencing reads, their population of origin, and their encoding functions. Assembly-based methods reduce dataset size by extending overlapping reads into larger contiguous sequences (contigs), providing contextual information for genetic sequences that does not rely on existing references. These methods, however, tend to be computationally intensive and are again challenged by sequencing errors as well as by genomic repeats While numerous tools have been developed based on these methodological concepts, they present confounding choices and training requirements to metagenomic investigators. To help with accessibility to assembly tools, this review also includes an IPython Notebook metagenomic assembly tutorial. This tutorial has instructions for execution any operating system using Amazon Elastic Cloud Compute and guides users through downloading, assembly, and mapping reads to contigs of a mock microbiome metagenome. Despite its challenges, metagenomic analysis has already revealed novel insights into many environments on Earth. As software, training, and data continue to emerge, metagenomic data access and its discoveries will to grow.
metagenomes; assembly; review; challenges; tutorial
Francisella philomiragia is a saprophytic gammaproteobacterium found only occasionally in immunocompromised individuals and is the nearest neighbor to the causative agent of tularemia and category A select agent Francisella tularensis. To shed insight into the key genetic differences and the evolution of these two distinct lineages, we sequenced the first complete genome of F. philomiragia strain ATCC 25017, which was isolated as a free-living microorganism from water in Bear River Refuge, Utah.
Historically, cholera outbreaks have been linked to V. cholerae O1 serogroup strains or its derivatives of the O37 and O139 serogroups. A genomic study on the 2010 Haiti cholera outbreak strains highlighted the putative role of non O1/non-O139 V. cholerae in causing cholera and the lack of genomic sequences of such strains from around the world. Here we address these gaps by scanning a global collection of V. cholerae strains as a first step towards understanding the population genetic diversity and epidemic potential of non O1/non-O139 strains. Whole Genome Mapping (Optical Mapping) based bar coding produces a high resolution, ordered restriction map, depicting a complete view of the unique chromosomal architecture of an organism. To assess the genomic diversity of non-O1/non-O139 V. cholerae, we applied a Whole Genome Mapping strategy on a well-defined and geographically and temporally diverse strain collection, the Sakazaki serogroup type strains. Whole Genome Map data on 91 of the 206 serogroup type strains support the hypothesis that V. cholerae has an unprecedented genetic and genomic structural diversity. Interestingly, we discovered chromosomal fusions in two unusual strains that possess a single chromosome instead of the two chromosomes usually found in V. cholerae. We also found pervasive chromosomal rearrangements such as duplications and indels in many strains. The majority of Vibrio genome sequences currently in public databases are unfinished draft sequences. The Whole Genome Mapping approach presented here enables rapid screening of large strain collections to capture genomic complexities that would not have been otherwise revealed by unfinished draft genome sequencing and thus aids in assembling and finishing draft sequences of complex genomes. Furthermore, Whole Genome Mapping allows for prediction of novel V. cholerae non-O1/non-O139 strains that may have the potential to cause future cholera outbreaks.
Bacillus cereus strain 03BB87, a blood culture isolate, originated in a 56-year-old male muller operator with a fatal case of pneumonia in 2003. Here we present the finished genome sequence of that pathogen, including a 5.46-Mb chromosome and two plasmids (209 and 52 Kb, respectively).
Next generation sequencing (NGS) technologies that parallelize the sequencing process and produce thousands to millions, or even hundreds of millions of sequences in a single sequencing run, have revolutionized genomic and genetic research. Because of the vagaries of any platform’s sequencing chemistry, the experimental processing, machine failure, and so on, the quality of sequencing reads is never perfect, and often declines as the read is extended. These errors invariably affect downstream analysis/application and should therefore be identified early on to mitigate any unforeseen effects.
Here we present a novel FastQ Quality Control Software (FaQCs) that can rapidly process large volumes of data, and which improves upon previous solutions to monitor the quality and remove poor quality data from sequencing runs. Both the speed of processing and the memory footprint of storing all required information have been optimized via algorithmic and parallel processing solutions. The trimmed output compared side-by-side with the original data is part of the automated PDF output. We show how this tool can help data analysis by providing a few examples, including an increased percentage of reads recruited to references, improved single nucleotide polymorphism identification as well as de novo sequence assembly metrics.
FaQCs combines several features of currently available applications into a single, user-friendly process, and includes additional unique capabilities such as filtering the PhiX control sequences, conversion of FASTQ formats, and multi-threading. The original data and trimmed summaries are reported within a variety of graphics and reports, providing a simple way to do data quality control and assurance.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-014-0366-2) contains supplementary material, which is available to authorized users.
Quality control; Trimming; Next generation sequencing analysis; Data preprocessing
Assembly of metagenomic samples is a very complex process, with algorithms designed to address sequencing platform-specific issues, (read length, data volume, and/or community complexity), while also faced with genomes that differ greatly in nucleotide compositional biases and in abundance. To address these issues, we have developed a post-assembly process: MetaGenomic Assembly by Merging (MeGAMerge). We compare this process to the performance of several assemblers, using both real, and in-silico generated samples of different community composition and complexity. MeGAMerge consistently outperforms individual assembly methods, producing larger contigs with an increased number of predicted genes, without replication of data. MeGAMerge contigs are supported by read mapping and contig alignment data, when using synthetically-derived and real metagenomic data, as well as by gene prediction analyses and similarity searches. MeGAMerge is a flexible method that generates improved metagenome assemblies, with the ability to accommodate upcoming sequencing platforms, as well as present and future assembly algorithms.
We present draft genome sequences of 15 clinical Vibrio isolates of various serogroups. These are valuable data for use in studying Vibrio cholerae genetic diversity, epidemic potential, and strain attribution.
Acinetobacter baumannii is an emerging nosocomial pathogen, and therefore high-quality genome assemblies for this organism are needed to aid in detection, diagnostic, and treatment technologies. Here we present the improved draft assembly of A. baumannii ATCC 19606 in two scaffolds. This 3,953,621-bp genome contains 3,750 coding regions and has a 39.1% G+C content.
This manuscript calls for an international effort to generate a comprehensive catalog from genome sequences of all the archaeal and bacterial type strains.
Microbes hold the key to life. They hold the secrets to our past (as the descendants of the earliest forms of life) and the prospects for our future (as we mine their genes for solutions to some of the planet's most pressing problems, from global warming to antibiotic resistance). However, the piecemeal approach that has defined efforts to study microbial genetic diversity for over 20 years and in over 30,000 genome projects risks squandering that promise. These efforts have covered less than 20% of the diversity of the cultured archaeal and bacterial species, which represent just 15% of the overall known prokaryotic diversity. Here we call for the funding of a systematic effort to produce a comprehensive genomic catalog of all cultured Bacteria and Archaea by sequencing, where available, the type strain of each species with a validly published name (currently∼11,000). This effort will provide an unprecedented level of coverage of our planet's genetic diversity, allow for the large-scale discovery of novel genes and functions, and lead to an improved understanding of microbial evolution and function in the environment.
Thanks to high-throughput sequencing technologies, genome sequencing has become a common component in nearly all aspects of viral research; thus, we are experiencing an explosion in both the number of available genome sequences and the number of institutions producing such data. However, there are currently no common standards used to convey the quality, and therefore utility, of these various genome sequences. Here, we propose five “standard” categories that encompass all stages of viral genome finishing, and we define them using simple criteria that are agnostic to the technology used for sequencing. We also provide genome finishing recommendations for various downstream applications, keeping in mind the cost-benefit trade-offs associated with different levels of finishing. Our goal is to define a common vocabulary that will allow comparison of genome quality across different research groups, sequencing platforms, and assembly techniques.
Escherichia coli O145:H28 strain RM12581 was isolated from bagged romaine lettuce during a 2010 U.S. lettuce-associated outbreak. E. coli O145:H28 strain RM12761 was isolated from ice cream during a 2007 ice cream-associated outbreak in Belgium. Here we report the complete genome sequences and annotation of both strains.
Marine Group I (MGI) Thaumarchaeota are one of the most abundant and cosmopolitan chemoautotrophs within the global dark ocean. To date, no representatives of this archaeal group retrieved from the dark ocean have been successfully cultured. We used single cell genomics to investigate the genomic and metabolic diversity of thaumarchaea within the mesopelagic of the subtropical North Pacific and South Atlantic Ocean. Phylogenetic and metagenomic recruitment analysis revealed that MGI single amplified genomes (SAGs) are genetically and biogeographically distinct from existing thaumarchaea cultures obtained from surface waters. Confirming prior studies, we found genes encoding proteins for aerobic ammonia oxidation and the hydrolysis of urea, which may be used for energy production, as well as genes involved in 3-hydroxypropionate/4-hydroxybutyrate and oxidative tricarboxylic acid pathways. A large proportion of protein sequences identified in MGI SAGs were absent in the marine cultures Cenarchaeum symbiosum and Nitrosopumilus maritimus, thus expanding the predicted protein space for this archaeal group. Identifiable genes located on genomic islands with low metagenome recruitment capacity were enriched in cellular defense functions, likely in response to viral infections or grazing. We show that MGI Thaumarchaeota in the dark ocean may have more flexibility in potential energy sources and adaptations to biotic interactions than the existing, surface-ocean cultures.
The human body consists of innumerable multifaceted environments that predispose colonization by a number of distinct microbial communities, which play fundamental roles in human health and disease. In addition to community surveys and shotgun metagenomes that seek to explore the composition and diversity of these microbiomes, there are significant efforts to sequence reference microbial genomes from many body sites of healthy adults. To illustrate the utility of reference genomes when studying more complex metagenomes, we present a reference-based analysis of sequence reads generated from 55 shotgun metagenomes, selected from 5 major body sites, including 16 sub-sites. Interestingly, between 13% and 92% (62.3% average) of these shotgun reads were aligned to a then-complete list of 2780 reference genomes, including 1583 references for the human microbiome. However, no reference genome was universally found in all body sites. For any given metagenome, the body site-specific reference genomes, derived from the same body site as the sample, accounted for an average of 58.8% of the mapped reads. While different body sites did differ in abundant genera, proximal or symmetrical body sites were found to be most similar to one another. The extent of variation observed, both between individuals sampled within the same microenvironment, or at the same site within the same individual over time, calls into question comparative studies across individuals even if sampled at the same body site. This study illustrates the high utility of reference genomes and the need for further site-specific reference microbial genome sequencing, even within the already well-sampled human microbiome.
OP9 is a yet-uncultivated bacterial lineage found in geothermal systems, petroleum reservoirs, anaerobic digesters, and wastewater treatment facilities. Here we use single-cell and metagenome sequencing to obtain two distinct, nearly-complete OP9 genomes, one constructed from single cells sorted from hot spring sediments and the other derived from binned metagenomic contigs from an in situ-enriched cellulolytic, thermophilic community. Phylogenomic analyses support the designation of OP9 as a candidate phylum for which we propose the name ‘Atribacteria’. Although a plurality of predicted proteins is most similar to those from Firmicutes, the presence of key genes suggests a diderm cell envelope. Metabolic reconstruction from the core genome suggests an anaerobic lifestyle based on sugar fermentation by Embden-Meyerhof glycolysis with production of hydrogen, acetate, and ethanol. Putative glycohydrolases and an endoglucanase may enable catabolism of (hemi)cellulose in thermal environments. This study lays a foundation for understanding the physiology and ecological role of the ‘Atribacteria’.
Comparison of genome-wide, high-resolution restriction maps of Klebsiella pneumoniae clinical isolates, including an NDM-1 producer, and in silico-generated restriction maps of sequenced genomes revealed a highly heterogeneous region we designated the “high heterogeneity zone” (HHZ). The HHZ consists of several regions including a “hot spot” prone to insertions and other rearrangements. The HHZ is a characteristic genomic area that can be used in the identification and tracking of outbreak-causing strains.
Klebsiella pneumoniae; Genomic analysis; optical map; NDM-1; ICE
Desulfosporosinus species are sulfate-reducing bacteria belonging to the Firmicutes. Their genomes will give insights into the genetic repertoire and evolution of sulfate reducers typically thriving in terrestrial environments and able to degrade toluene (Desulfosporosinus youngiae), to reduce Fe(III) (Desulfosporosinus meridiei, Desulfosporosinus orientis), and to grow under acidic conditions (Desulfosporosinus acidiphilus).
We describe the draft genome sequence of Pseudomonas putida strain LS46, a novel isolate that synthesizes medium-chain-length polyhydroxyalkanoates. The draft genome of P. putida LS46 consists of approximately 5.86 million bp, with a G+C content of 61.69%. A total of 5,316 annotated genes and 5,219 coding sequences (CDS) were identified.
Kingella kingae is a human oral bacterium that can cause infections of the skeletal system in children. The bacterium is also a cardiovascular pathogen causing infective endocarditis in children and adults. We report herein the draft genome sequence of septic arthritis K. kingae strain PYKK081.
In May of 2011, an enteroaggregative Escherichia coli O104:H4 strain that had acquired a Shiga toxin 2-converting phage caused a large outbreak of bloody diarrhea in Europe which was notable for its high prevalence of hemolytic uremic syndrome cases. Several studies have described the genomic inventory and phylogenies of strains associated with the outbreak and a collection of historical E. coli O104:H4 isolates using draft genome assemblies. We present the complete, closed genome sequences of an isolate from the 2011 outbreak (2011C–3493) and two isolates from cases of bloody diarrhea that occurred in the Republic of Georgia in 2009 (2009EL–2050 and 2009EL–2071). Comparative genome analysis indicates that, while the Georgian strains are the nearest neighbors to the 2011 outbreak isolates sequenced to date, structural and nucleotide-level differences are evident in the Stx2 phage genomes, the mer/tet antibiotic resistance island, and in the prophage and plasmid profiles of the strains, including a previously undescribed plasmid with homology to the pMT virulence plasmid of Yersinia pestis. In addition, multiphenotype analysis showed that 2009EL–2071 possessed higher resistance to polymyxin and membrane-disrupting agents. Finally, we show evidence by electron microscopy of the presence of a common phage morphotype among the European and Georgian strains and a second phage morphotype among the Georgian strains. The presence of at least two stx2 phage genotypes in host genetic backgrounds that may derive from a recent common ancestor of the 2011 outbreak isolates indicates that the emergence of stx2 phage-containing E. coli O104:H4 strains probably occurred more than once, or that the current outbreak isolates may be the result of a recent transfer of a new stx2 phage element into a pre-existing stx2-positive genetic background.
The Deepwater Horizon oil spill in the Gulf of Mexico resulted in a deep-sea hydrocarbon plume that caused a shift in the indigenous microbial community composition with unknown ecological consequences. Early in the spill history, a bloom of uncultured, thus uncharacterized, members of the Oceanospirillales was previously detected, but their role in oil disposition was unknown. Here our aim was to determine the functional role of the Oceanospirillales and other active members of the indigenous microbial community using deep sequencing of community DNA and RNA, as well as single-cell genomics. Shotgun metagenomic and metatranscriptomic sequencing revealed that genes for motility, chemotaxis and aliphatic hydrocarbon degradation were significantly enriched and expressed in the hydrocarbon plume samples compared with uncontaminated seawater collected from plume depth. In contrast, although genes coding for degradation of more recalcitrant compounds, such as benzene, toluene, ethylbenzene, total xylenes and polycyclic aromatic hydrocarbons, were identified in the metagenomes, they were expressed at low levels, or not at all based on analysis of the metatranscriptomes. Isolation and sequencing of two Oceanospirillales single cells revealed that both cells possessed genes coding for n-alkane and cycloalkane degradation. Specifically, the near-complete pathway for cyclohexane oxidation in the Oceanospirillales single cells was elucidated and supported by both metagenome and metatranscriptome data. The draft genome also included genes for chemotaxis, motility and nutrient acquisition strategies that were also identified in the metagenomes and metatranscriptomes. These data point towards a rapid response of members of the Oceanospirillales to aliphatic hydrocarbons in the deep sea.
Gulf oil spill; Deepwater Horizon; Oceanospirillales; single-cell genomics; metagenomics; metatranscriptomics
Here we present a standard developed by the Genomic Standards Consortium (GSC) for reporting marker gene sequences—the minimum information about a marker gene sequence (MIMARKS). We also introduce a system for describing the environment from which a biological sample originates. The ‘environmental packages’ apply to any genome sequence of known origin and can be used in combination with MIMARKS and other GSC checklists. Finally, to establish a unified standard for describing sequence data and to provide a single point of entry for the scientific community to access and learn about GSC checklists, we present the minimum information about any (x) sequence (MIxS). Adoption of MIxS will enhance our ability to analyze natural genetic diversity documented by massive DNA sequencing efforts from myriad ecosystems in our ever-changing biosphere.
Microbial hydrolysis of polysaccharides is critical to ecosystem functioning and is of great interest in diverse biotechnological applications, such as biofuel production and bioremediation. Here we demonstrate the use of a new, efficient approach to recover genomes of active polysaccharide degraders from natural, complex microbial assemblages, using a combination of fluorescently labeled substrates, fluorescence-activated cell sorting, and single cell genomics. We employed this approach to analyze freshwater and coastal bacterioplankton for degraders of laminarin and xylan, two of the most abundant storage and structural polysaccharides in nature. Our results suggest that a few phylotypes of Verrucomicrobia make a considerable contribution to polysaccharide degradation, although they constituted only a minor fraction of the total microbial community. Genomic sequencing of five cells, representing the most predominant, polysaccharide-active Verrucomicrobia phylotype, revealed significant enrichment in genes encoding a wide spectrum of glycoside hydrolases, sulfatases, peptidases, carbohydrate lyases and esterases, confirming that these organisms were well equipped for the hydrolysis of diverse polysaccharides. Remarkably, this enrichment was on average higher than in the sequenced representatives of Bacteroidetes, which are frequently regarded as highly efficient biopolymer degraders. These findings shed light on the ecological roles of uncultured Verrucomicrobia and suggest specific taxa as promising bioprospecting targets. The employed method offers a powerful tool to rapidly identify and recover discrete genomes of active players in polysaccharide degradation, without the need for cultivation.
The filamentous cyanobacterium Microcoleus vaginatusis found in arid land soils worldwide. The genome of M. vaginatusstrain FGP-2 allows exploration of genes involved in photosynthesis, desiccation tolerance, alkane production, and other features contributing to this organism's ability to function as a major component of biological soil crusts in arid lands.
Ochrobactrum anthropi is a common soil alphaproteobacterium that colonizes a wide spectrum of organisms and is being increasingly recognized as an opportunistic human pathogen. Potentially life-threatening infections, such as endocarditis, are included in the list of reported O. anthropi infections. These reports, together with the scant number of studies and the organism's phylogenetic proximity to the highly pathogenic brucellae, make O. anthropi an attractive model of bacterial pathogenicity. Here we report the genome sequence of the type strain O. anthropi ATCC 49188, which revealed the presence of two chromosomes and four plasmids.
Burkholderia phytofirmans PsJNT is able to efficiently colonize the rhizosphere, root, and above-ground plant tissues of a wide variety of genetically unrelated plants, such as potatoes, canola, maize, and grapevines. Strain PsJN shows strong plant growth-promoting effects and was reported to enhance plant vigor and resistance to biotic and abiotic stresses. Here, we report the genome sequence of this strain, which indicates the presence of multiple traits relevant for endophytic colonization and plant growth promotion.