Metatranscriptomic sequencing is a highly sensitive bioassay of functional activity in a microbial community, providing complementary information to the metagenomic sequencing of the community. The acquisition of the metatranscriptomic sequences will enable us to refine the annotations of the metagenomes, and to study the gene activities and their regulation in complex microbial communities and their dynamics.
In this paper, we present TransGeneScan, a software tool for finding genes in assembled transcripts from metatranscriptomic sequences. By incorporating several features of metatranscriptomic sequencing, including strand-specificity, short intergenic regions, and putative antisense transcripts into a Hidden Markov Model, TranGeneScan can predict a sense transcript containing one or multiple genes (in an operon) or an antisense transcript.
We tested TransGeneScan on a mock metatranscriptomic data set containing three known bacterial genomes. The results showed that TranGeneScan performs better than metagenomic gene finders (MetaGeneMark and FragGeneScan) on predicting protein coding genes in assembled transcripts, and achieves comparable or even higher accuracy than gene finders for microbial genomes (Glimmer and GeneMark). These results imply, with the assistance of metatranscriptomic sequencing, we can obtain a broad and precise picture about the genes (and their functions) in a microbial community.
TransGeneScan is available as open-source software on SourceForge at https://sourceforge.net/projects/transgenescan/.
Metatranscriptomics; Gene finding; Hidden Markov Model; Operons; Antisense RNA (asRNA)
Diversity-generating retroelements (DGRs) are a unique family of retroelements that confer selective advantages to their hosts by accelerating the evolution of target genes through a specialized, error-prone, reverse transcription process. First identified in a Bordetella phage (BPP-1), which mediates the phage tropism specificity by generating variability in an involved gene, DGRs were predicted to be present in a larger collection of viral and bacterial species. A minimal DGR system is comprised of a reverse transcriptase (RTase) gene, a template sequence (TR) and a variable region (VR) within a target gene. We developed a computational tool, DGRscan, to allow either de novo identification (based on the prediction of potential template-variable region pairs) or similarity-based searches of DGR systems using known template sequences as the reference. The application of DGRscan to the human microbiome project (HMP) datasets resulted in the identification of 271 non-redundant DGR systems, doubling the size of the collection of known DGR systems. We further identified a large number of putative target genes (651, which share no more than 90% sequence identity at the amino acid level) that are potentially under diversification by the DGR systems. Our study provides the first survey of the DGR systems in the human microbiome, showing that the DGR systems are frequently found in human-associated bacterial communities, although they are of low incidence in individual genomes. Our study also provides functional clues for a large number of genes (reverse transcriptases and target genes) that were previously annotated as proteins of unknown functions or nonspecific functions.
DGR; human microbiome; DGRscan; reverse transcriptase (RTase); template region (TR); variable region (VR)
POSA (Partial Order Structure Alignment), available at http://posa.godziklab.org, is a server for multiple protein structure alignment introduced in 2005 (Ye,Y. and Godzik,A. (2005) Multiple flexible structure alignment using partial order graphs. Bioinformatics, 21, 2362–2369). It is free and open to all users, and there is no login requirement, albeit there is an option to register and store results in individual, password-protected directories. In the updated POSA server described here, we introduce two significant improvements. First is an interface allowing the user to provide additional information by defining segments that anchor the alignment in one or more input structures. This interface allows users to take advantage of their intuition and biological insights to improve the alignment and guide it toward a biologically relevant solution. The second improvement is an interactive visualization with options that allow the user to view all superposed structures in one window (a typical solution for visualizing results of multiple structure alignments) or view them individually in a series of synchronized windows with extensive, user-controlled visualization options. The user can rotate structure(s) in any of the windows and study similarities or differences between structures clearly visible in individual windows.
The molecular basis of allosteric regulation remains a subject of intense interest. Staphylococcus aureus CzrA is a member of the ubiquitous arsenic repressor (ArsR) family of bacterial homodimeric metal sensing proteins, and has emerged as a model system for understanding allosteric regulation of operator DNA binding by transition metal ions. Using unnatural amino acid substitution and a standard linkage analysis, we show that a His97’ NHε2•••O=C-His67 quaternary structural hydrogen bond is an energetically significant contributor to the magnitude of the allosteric coupling free energy, ΔGc. A “cavity” introduced just beneath this hydrogen bond in V66A/L68V CzrA results in a dramatic loss of regulation by Zn(II) despite adopting a wild-type global structure and Zn(II) binding and DNA binding affinities only minimally affected from wild-type. The energetics of Zn(II) binding and heterotropic coupling free energies (ΔHc, −TΔSc) of the double mutant are also radically altered and suggest that increased internal dynamics leads to poorer allosteric negative regulation in V66A/L68V CzrA. A statistical coupling analysis of 3000 ArsR proteins reveals a sector that links the DNA-binding determinants and the α5 Zn(II) sensing sites through V66/L68 in CzrA. We propose that distinct regulatory sites uniquely characteristic of individual ArsR proteins results from evolution of distinct connectivities to this sector, each capable of driving the same biological outcome, transcriptional derepression.
allostery; zinc sensor; metal homeostasis; statistical coupling analysis; ArsR
The CRISPR (clusters of regularly interspaced short palindromic repeats)–Cas adaptive immune system is an important defense system in bacteria, providing targeted defense against invasions of foreign nucleic acids. CRISPR–Cas systems consist of CRISPR loci and cas (CRISPR-associated) genes: sequence segments of invaders are incorporated into host genomes at CRISPR loci to generate specificity, while adjacent cas genes encode proteins that mediate the defense process. We pursued an integrated approach to identifying putative cas genes from genomes and metagenomes, combining similarity searches with genomic neighborhood analysis. Application of our approach to bacterial genomes and human microbiome datasets allowed us to significantly expand the collection of cas genes: the sequence space of the Cas9 family, the key player in the recently engineered RNA-guided platforms for genome editing in eukaryotes, is expanded by at least two-fold with metagenomic datasets. We found genes in cas loci encoding other functions, for example, toxins and antitoxins, confirming the recently discovered potential of coupling between adaptive immunity and the dormancy/suicide systems. We further identified 24 novel Cas families; one novel family contains 20 proteins, all identified from the human microbiome datasets, illustrating the importance of metagenomics projects in expanding the diversity of cas genes.
The wide applications of next-generation sequencing (NGS) technologies in metagenomics have raised many computational challenges. One of the essential problems in metagenomics is to estimate the taxonomic composition of a microbial community, which can be approached by mapping shotgun reads acquired from the community to previously characterized microbial genomes followed by quantity profiling of these species based on the number of mapped reads. This procedure, however, is not as trivial as it appears at first glance. A shotgun metagenomic dataset often contains DNA sequences from many closely-related microbial species (e.g., within the same genus) or strains (e.g., within the same species), thus it is often difficult to determine which species/strain a specific read is sampled from when it can be mapped to a common region shared by multiple genomes at high similarity. Furthermore, high genomic variations are observed among individual genomes within the same species, which are difficult to be differentiated from the inter-species variations during reads mapping. To address these issues, a commonly used approach is to quantify taxonomic distribution only at the genus level, based on the reads mapped to all species belonging to the same genus; alternatively, reads are mapped to a set of representative genomes, each selected to represent a different genus. Here, we introduce a novel approach to the quantity estimation of closely-related species within the same genus by mapping the reads to their genomes represented by a de Bruijn graph, in which the common genomic regions among them are collapsed. Using simulated and real metagenomic datasets, we show the de Bruijn graph approach has several advantages over existing methods, including (1) it avoids redundant mapping of shotgun reads to multiple copies of the common regions in different genomes, and (2) it leads to more accurate quantification for the closely-related species (and even for strains within the same species).
closely-related genomes; de Bruijn graph; metagenomics; quantification
Bacteria and archaea develop immunity against invading genomes by incorporating pieces of the invaders' sequences, called spacers, into a clustered regularly interspaced short palindromic repeats (CRISPR) locus between repeats, forming arrays of repeat-spacer units. When spacers are expressed, they direct CRISPR-associated (Cas) proteins to silence complementary invading DNA. In order to characterize the invaders of human microbiomes, we use spacers from CRISPR arrays that we had previously assembled from shotgun metagenomic datasets, and identify contigs that contain these spacers' targets.
We discover 95,000 contigs that are putative invasive mobile genetic elements, some targeted by hundreds of CRISPR spacers. We find that oral sites in healthy human populations have a much greater variety of mobile genetic elements than stool samples. Mobile genetic elements carry genes encoding diverse functions: only 7% of the mobile genetic elements are similar to known phages or plasmids, although a much greater proportion contain phage- or plasmid-related genes. A small number of contigs share similarity with known integrative and conjugative elements, providing the first examples of CRISPR defenses against this class of element. We provide detailed analyses of a few large mobile genetic elements of various types, and a relative abundance analysis of mobile genetic elements and putative hosts, exploring the dynamic activities of mobile genetic elements in human microbiomes. A joint analysis of mobile genetic elements and CRISPRs shows that protospacer-adjacent motifs drive their interaction network; however, some CRISPR-Cas systems target mobile genetic elements lacking motifs.
We identify a large collection of invasive mobile genetic elements in human microbiomes, an important resource for further study of the interaction between the CRISPR-Cas immune system and invaders.
CRISPR-Cas system; human microbiome; mobile genetic element (MGE)
Mycobacterium tuberculosis is an obligate human respiratory pathogen that encodes approximately ten arsenic repressor (ArsR) family regulatory proteins that allow the organism to respond to a wide range of changes in its immediate microenvironment. How individual ArsR repressors have evolved to respond to selective stimuli is of intrinsic interest. The Ni(II)/Co(II)-specific repressor NmtR and related actinomycete nickel sensors harbor a conserved N-terminal αNH2-Gly2-His3-Gly4 sequence. Here, we present the solution structure of homodimeric apo-NmtR and show that the core of the molecule adopts a typical winged-helix ArsR repressor (α1-α2-α3-αR-β1-β2-α5) “open conformation” that is similar to the related zinc sensor Staphylococcus aureus CzrA, but harboring long, flexible N-terminal (residues 2-16) and C-terminal (residues 109-120) extensions. Ni(II) binding to the regulatory sites induces strong paramagnetic broadening of the α5 helical region and the extreme N-terminal tail to residue 10. Ratiometric pulse chase amidination mass spectrometry reveals that the rate of amidination of the Gly2 α-amino group is strongly attenuated in the Ni(II) complex relative to the apo-state and non-cognate Zn(II) complex. Ni(II) binding also induces dynamic disorder in the μs-ms timescale of key DNA interacting regions that likely contributes to the negative regulation of DNA binding by Ni(II). Molecular dynamics simulations and quantum chemical calculations reveal that NmtR readily accommodates a distal Ni(II) hexacoordination model involving the α-amine and His3 of the N-terminal region and α5 residues Asp91′, His93′, His104 and His107, which collectively define a new metal sensing site configuration in ArsR family regulators.
Shotgun metagenomics has been applied to the studies of the functionality of various microbial communities. As a critical analysis step in these studies, biological pathways are reconstructed based on the genes predicted from metagenomic shotgun sequences. Pathway reconstruction provides insights into the functionality of a microbial community and can be used for comparing multiple microbial communities. The utilization of pathway reconstruction, however, can be jeopardized because of imperfect functional annotation of genes, and ambiguity in the assignment of predicted enzymes to biochemical reactions (e.g., some enzymes are involved in multiple biochemical reactions). Considering that metabolic functions in a microbial community are carried out by many enzymes in a collaborative manner, we present a probabilistic sampling approach to profiling functional content in a metagenomic dataset, by sampling functions of catalytically promiscuous enzymes within the context of the entire metabolic network defined by the annotated metagenome. We test our approach on metagenomic datasets from environmental and human-associated microbial communities. The results show that our approach provides a more accurate representation of the metabolic activities encoded in a metagenome, and thus improves the comparative analysis of multiple microbial communities. In addition, our approach reports likelihood scores of putative reactions, which can be used to identify important reactions and metabolic pathways that reflect the environmental adaptation of the microbial communities. Source code for sampling metabolic networks is available online at http://omics.informatics.indiana.edu/mg/MetaNetSam/.
We present a probabilistic sampling approach to profiling metabolic reactions in a microbial community from metagenomic shotgun reads, in an attempt to understand the metabolism within a microbial community and compare them across multiple communities. Different from the conventional pathway reconstruction approaches that aim at a definitive set of reactions, our method estimates how likely each annotated reaction can occur in the metabolism of the microbial community, given the shotgun sequencing data. This probabilistic measure improves our prediction of the actual metabolism in the microbial communities and can be used in the comparative functional analysis of metagenomic data.
The NIH Human Microbiome Project (HMP) has produced several hundred metagenomic data sets, allowing studies of the many functional elements in human-associated microbial communities. Here, we survey the distribution of oral spirochetes implicated in dental diseases in normal human individuals, using recombination sites associated with the chromosomal integron in Treponema genomes, taking advantage of the multiple copies of the integron recombination sites (repeats) in the genomes, and using a targeted assembly approach that we have developed. We find that integron-containing Treponema species are present in ∼80% of the normal human subjects included in the HMP. Further, we are able to de novo assemble the integron gene cassettes using our constrained assembly approach, which employs a unique application of the de Bruijn graph assembly information; most of these cassette genes were not assembled in whole-metagenome assemblies and could not be identified by mapping sequencing reads onto the known reference Treponema genomes due to the dynamic nature of integron gene cassettes. Our study significantly enriches the gene pool known to be carried by Treponema chromosomal integrons, totaling 826 (598 97% nonredundant) genes. We characterize the functions of these gene cassettes: many of these genes have unknown functions. The integron gene cassette arrays found in the human microbiome are extraordinarily dynamic, with different microbial communities sharing only a small number of common genes.
Integron systems are now recognized as important agents of bacterial evolution and are prevalent in most environments. One of the human pathogens known to harbor chromosomal integrons, the Treponema spirochetes are the only clade among spirochete species found to carry integrons. With the recent release of many new Treponema genomes, we were able to study the distribution of chromosomal integrons in this genus.
We find that the Treponema spirochetes implicated in human periodontal diseases and those isolated from cow and swine intestines contain chromosomal integrons, but not the Treponema species isolated from termite guts. By examining the species tree of selected spirochetes (based on 31 phylogenetic marker genes) and the phylogenetic tree of predicted integron integrases, and assisted by our analysis of predicted integron recombination sites, we found that all integron systems identified in Treponema spirochetes are likely to have evolved from a common ancestor—a horizontal gain into the clade. Subsequent to this event, the integron system was lost in the branch leading to the speciation of T. pallidum and T. phagedenis (the Treponema sps. implicated in sexually transmitted diseases). We also find that the lengths of the integron attC sites shortened through Treponema speciation, and that the integron gene cassettes of T. denticola are highly strain specific.
This is the first comprehensive study to characterize the chromosomal integron systems in Treponema species. By characterizing integron distribution and cassette contents in the Treponema sps., we link the integrons to the speciation of the various species, especially to the pathogens T. pallidum and T. phagedenis.
Chromosomal integron; Treponema species; Integron integrase; attC site
We describe the draft genome of the microcrustacean Daphnia pulex, which is only 200 Mb and contains at least 30,907 genes. The high gene count is a consequence of an elevated rate of gene duplication resulting in tandem gene clusters. More than 1/3 of Daphnia’s genes have no detectable homologs in any other available proteome, and the most amplified gene families are specific to the Daphnia lineage. The co-expansion of gene families interacting within metabolic pathways suggests that the maintenance of duplicated genes is not random, and the analysis of gene expression under different environmental conditions reveals that numerous paralogs acquire divergent expression patterns soon after duplication. Daphnia-specific genes – including many additional loci within sequenced regions that are otherwise devoid of annotations – are the most responsive genes to ecological challenges.
A variety of microbial communities and their genes (microbiome) exist throughout the human body, playing fundamental roles in human health and disease. The NIH funded Human Microbiome Project (HMP) Consortium has established a population-scale framework which catalyzed significant development of metagenomic protocols resulting in a broad range of quality-controlled resources and data including standardized methods for creating, processing and interpreting distinct types of high-throughput metagenomic data available to the scientific community. Here we present resources from a population of 242 healthy adults sampled at 15 to 18 body sites up to three times, which to date, have generated 5,177 microbial taxonomic profiles from 16S rRNA genes and over 3.5 Tb of metagenomic sequence. In parallel, approximately 800 human-associated reference genomes have been sequenced. Collectively, these data represent the largest resource to date describing the abundance and variety of the human microbiome, while providing a platform for current and future studies.
Motivation: One of the difficulties in metagenomic assembly is that homologous genes from evolutionarily closely related species may behave like repeats and confuse assemblers. As a result, small contigs, each representing a short gene fragment, instead of complete genes, may be reported by an assembler. This further complicates annotation of metagenomic datasets, as annotation tools (such as gene predictors or similarity search tools) typically perform poorly on configs encoding short gene fragments.
Results: We present a novel way of using the de Bruijn graph assembly of metagenomes to improve the assembly of genes. A network matching algorithm is proposed for matching the de Bruijn graph of contigs against reference genes, to derive ‘gene paths’ in the graph (sequences of contigs containing gene fragments) that have the highest similarities to known genes, allowing gene fragments contained in multiple contigs to be connected to form more complete (or intact) genes. Tests on simulated and real datasets show that our approach (called GeneStitch) is able to significantly improve the assembly of genes from metagenomic sequences, by connecting contigs with the guidance of homologous genes—information that is orthogonal to the sequencing reads. We note that the improvement of gene assembly can be observed even when only distantly related genes are available as the reference. We further propose to use ‘gene graphs’ to represent the assembly of reads from homologous genes and discuss potential applications of gene graphs to improving functional annotation for metagenomics.
Availability: The tools are available as open source for download at http://omics.informatics.indiana.edu/GeneStitch
The goal of the Human Microbiome Project (HMP) is to generate a comprehensive catalog of human-associated microorganisms including reference genomes representing the most common species. Toward this goal, the HMP has characterized the microbial communities at 18 body habitats in a cohort of over 200 healthy volunteers using 16S rRNA gene (16S) sequencing and has generated nearly 1,000 reference genomes from human-associated microorganisms. To determine how well current reference genome collections capture the diversity observed among the healthy microbiome and to guide isolation and future sequencing of microbiome members, we compared the HMP’s 16S data sets to several reference 16S collections to create a ‘most wanted’ list of taxa for sequencing. Our analysis revealed that the diversity of commonly occurring taxa within the HMP cohort microbiome is relatively modest, few novel taxa are represented by these OTUs and many common taxa among HMP volunteers recur across different populations of healthy humans. Taken together, these results suggest that it should be possible to perform whole-genome sequencing on a large fraction of the human microbiome, including the ‘most wanted’, and that these sequences should serve to support microbiome studies across multiple cohorts. Also, in stark contrast to other taxa, the ‘most wanted’ organisms are poorly represented among culture collections suggesting that novel culture- and single-cell-based methods will be required to isolate these organisms for sequencing.
We explore the microbiota of 18 body sites in over 200 individuals using sequences amplified V1–V3 and the V3–V5 small subunit ribosomal RNA (16S) hypervariable regions as part of the NIH Common Fund Human Microbiome Project. The body sites with the greatest number of core OTUs, defined as OTUs shared amongst 95% or more of the individuals, were the oral sites (saliva, tongue, cheek, gums, and throat) followed by the nose, stool, and skin, while the vaginal sites had the fewest number of OTUs shared across subjects. We found that commonalities between samples based on taxonomy could sometimes belie variability at the sub-genus OTU level. This was particularly apparent in the mouth where a given genus can be present in many different oral sites, but the sub-genus OTUs show very distinct site selection, and in the vaginal sites, which are consistently dominated by the Lactobacillus genus but have distinctly different sub-genus V1–V3 OTU populations across subjects. Different body sites show approximately a ten-fold difference in estimated microbial richness, with stool samples having the highest estimated richness, followed by the mouth, throat and gums, then by the skin, nasal and vaginal sites. Richness as measured by the V1–V3 primers was consistently higher than richness measured by V3–V5. We also show that when such a large cohort is analyzed at the genus level, most subjects fit the stool “enterotype” profile, but other subjects are intermediate, blurring the distinction between the enterotypes. When analyzed at the finer-scale, OTU level, there was little or no segregation into stool enterotypes, but in the vagina distinct biotypes were apparent. Finally, we note that even OTUs present in nearly every subject, or that dominate in some samples, showed orders of magnitude variation in relative abundance emphasizing the highly variable nature across individuals.
CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) loci, together with cas (CRISPR–associated) genes, form the CRISPR/Cas adaptive immune system, a primary defense strategy that eubacteria and archaea mobilize against foreign nucleic acids, including phages and conjugative plasmids. Short spacer sequences separated by the repeats are derived from foreign DNA and direct interference to future infections. The availability of hundreds of shotgun metagenomic datasets from the Human Microbiome Project (HMP) enables us to explore the distribution and diversity of known CRISPRs in human-associated microbial communities and to discover new CRISPRs. We propose a targeted assembly strategy to reconstruct CRISPR arrays, which whole-metagenome assemblies fail to identify. For each known CRISPR type (identified from reference genomes), we use its direct repeat consensus sequence to recruit reads from each HMP dataset and then assemble the recruited reads into CRISPR loci; the unique spacer sequences can then be extracted for analysis. We also identified novel CRISPRs or new CRISPR variants in contigs from whole-metagenome assemblies and used targeted assembly to more comprehensively identify these CRISPRs across samples. We observed that the distributions of CRISPRs (including 64 known and 86 novel ones) are largely body-site specific. We provide detailed analysis of several CRISPR loci, including novel CRISPRs. For example, known streptococcal CRISPRs were identified in most oral microbiomes, totaling ∼8,000 unique spacers: samples resampled from the same individual and oral site shared the most spacers; different oral sites from the same individual shared significantly fewer, while different individuals had almost no common spacers, indicating the impact of subtle niche differences on the evolution of CRISPR defenses. We further demonstrate potential applications of CRISPRs to the tracing of rare species and the virus exposure of individuals. This work indicates the importance of effective identification and characterization of CRISPR loci to the study of the dynamic ecology of microbiomes.
Human bodies are complex ecological systems in which various microbial organisms and viruses interact with each other and with the human host. The Human Microbiome Project (HMP) has resulted in >700 datasets of shotgun metagenomic sequences, from which we can learn about the compositions and functions of human-associated microbial communities. CRISPR/Cas systems are a widespread class of adaptive immune systems in bacteria and archaea, providing acquired immunity against foreign nucleic acids: CRISPR/Cas defense pathways involve integration of viral- or plasmid-derived DNA segments into CRISPR arrays (forming spacers between repeated structural sequences), and expression of short crRNAs from these single repeat-spacer units, to generate interference to future invading foreign genomes. Powered by an effective computational approach (the targeted assembly approach for CRISPR), our analysis of CRISPR arrays in the HMP datasets provides the very first global view of bacterial immunity systems in human-associated microbial communities. The great diversity of CRISPR spacers we observed among different body sites, in different individuals, and in single individuals over time, indicates the impact of subtle niche differences on the evolution of CRISPR defenses and indicates the key role of bacteriophage (and plasmids) in shaping human microbial communities.
16S rRNA gene profiling has recently been boosted by the development of pyrosequencing methods. A common analysis is to group pyrosequences into Operational Taxonomic Units (OTUs), such that reads in an OTU are likely sampled from the same species. However, species diversity estimated from error-prone 16S rRNA pyrosequences may be inflated because the reads sampled from the same 16S rRNA gene may appear different, and current OTU inference approaches typically involve time-consuming pairwise/multiple distance calculation and clustering. I propose a novel approach AbundantOTU based on a Consensus Alignment (CA) algorithm, which infers consensus sequences, each representing an OTU, taking advantage of the sequence redundancy for abundant species. Pyrosequencing reads can then be recruited to the consensus sequences to give quantitative information for the corresponding species. As tested on 16S rRNA pyrosequence datasets from mock communities with known species, AbundantOTU rapidly reported identified sequences of the source 16S rRNAs and the abundances of the corresponding species. AbundantOTU was also applied to 16S rRNA pyrosequence datasets derived from real microbial communities and the results are in general agreement with previous studies.
16S rRNA gene; pyrosequencing; Operational Taxonomic Unit (OTU); abundant species
Summary: With the wide application of next-generation sequencing (NGS) techniques, fast tools for protein similarity search that scale well to large query datasets and large databases are highly desirable. In a previous work, we developed RAPSearch, an algorithm that achieved a ~20–90-fold speedup relative to BLAST while still achieving similar levels of sensitivity for short protein fragments derived from NGS data. RAPSearch, however, requires a substantial memory footprint to identify alignment seeds, due to its use of a suffix array data structure. Here we present RAPSearch2, a new memory-efficient implementation of the RAPSearch algorithm that uses a collision-free hash table to index a similarity search database. The utilization of an optimized data structure further speeds up the similarity search—another 2–3 times. We also implemented multi-threading in RAPSearch2, and the multi-thread modes achieve significant acceleration (e.g. 3.5X for 4-thread mode). RAPSearch2 requires up to 2G memory when running in single thread mode, or up to 3.5G memory when running in 4-thread mode.
Availability and implementation: Implemented in C++, the source code is freely available for download at the RAPSearch2 website: http://omics.informatics.indiana.edu/mg/RAPSearch2/.
Supplementary information: Available at the RAPSearch2 website.
Next Generation Sequencing (NGS) is producing enormous corpuses of short DNA reads, affecting emerging fields like metagenomics. Protein similarity search--a key step to achieve annotation of protein-coding genes in these short reads, and identification of their biological functions--faces daunting challenges because of the very sizes of the short read datasets.
We developed a fast protein similarity search tool RAPSearch that utilizes a reduced amino acid alphabet and suffix array to detect seeds of flexible length. For short reads (translated in 6 frames) we tested, RAPSearch achieved ~20-90 times speedup as compared to BLASTX. RAPSearch missed only a small fraction (~1.3-3.2%) of BLASTX similarity hits, but it also discovered additional homologous proteins (~0.3-2.1%) that BLASTX missed. By contrast, BLAT, a tool that is even slightly faster than RAPSearch, had significant loss of sensitivity as compared to RAPSearch and BLAST.
RAPSearch is implemented as open-source software and is accessible at http://omics.informatics.indiana.edu/mg/RAPSearch. It enables faster protein similarity search. The application of RAPSearch in metageomics has also been demonstrated.
short reads; similarity search; suffix array; reduced amino acid alphabet; metagenomics
Metagenomics is the study of microbial communities sampled directly from their natural environment, without prior culturing. Among the computational tools recently developed for metagenomic sequence analysis, binning tools attempt to classify the sequences in a metagenomic dataset into different bins (i.e., species), based on various DNA composition patterns (e.g., the tetramer frequencies) of various genomes. Composition-based binning methods, however, cannot be used to classify very short fragments, because of the substantial variation of DNA composition patterns within a single genome. We developed a novel approach (AbundanceBin) for metagenomics binning by utilizing the different abundances of species living in the same environment. AbundanceBin is an application of the Lander-Waterman model to metagenomics, which is based on the l-tuple content of the reads. AbundanceBin achieved accurate, unsupervised, clustering of metagenomic sequences into different bins, such that the reads classified in a bin belong to species of identical or very similar abundances in the sample. In addition, AbundanceBin gave accurate estimations of species abundances, as well as their genome sizes—two important parameters for characterizing a microbial community. We also show that AbundanceBin performed well when the sequence lengths are very short (e.g., 75 bp) or have sequencing errors. By combining AbundanceBin and a composition-based method (MetaCluster), we can achieve even higher binning accuracy. Supplementary Material is available at www.liebertonline.com/cmb.
binning; EM algorithm; metagenomics; Poisson distribution
The advances of next-generation sequencing technology have facilitated metagenomics research that attempts to determine directly the whole collection of genetic material within an environmental sample (i.e. the metagenome). Identification of genes directly from short reads has become an important yet challenging problem in annotating metagenomes, since the assembly of metagenomes is often not available. Gene predictors developed for whole genomes (e.g. Glimmer) and recently developed for metagenomic sequences (e.g. MetaGene) show a significant decrease in performance as the sequencing error rates increase, or as reads get shorter. We have developed a novel gene prediction method FragGeneScan, which combines sequencing error models and codon usages in a hidden Markov model to improve the prediction of protein-coding region in short reads. The performance of FragGeneScan was comparable to Glimmer and MetaGene for complete genomes. But for short reads, FragGeneScan consistently outperformed MetaGene (accuracy improved ∼62% for reads of 400 bases with 1% sequencing errors, and ∼18% for short reads of 100 bases that are error free). When applied to metagenomes, FragGeneScan recovered substantially more genes than MetaGene predicted (>90% of the genes identified by homology search), and many novel genes with no homologs in current protein sequence database.
Metagenomics is the study of microbial communities sampled directly from their natural environment, without prior culturing. By enabling an analysis of populations including many (so-far) unculturable and often unknown microbes, metagenomics is revolutionizing the field of microbiology, and has excited researchers in many disciplines that could benefit from the study of environmental microbes, including those in ecology, environmental sciences, and biomedicine. Specific computational and statistical tools have been developed for metagenomic data analysis and comparison. New studies, however, have revealed various kinds of artifacts present in metagenomics data caused by limitations in the experimental protocols and/or inadequate data analysis procedures, which often lead to incorrect conclusions about a microbial community. Here, we review some of the artifacts, such as overestimation of species diversity and incorrect estimation of gene family frequencies, and discuss emerging computational approaches to address them. We also review potential challenges that metagenomics may encounter with the extensive application of next-generation sequencing (NGS) techniques.
Metagenomics; next-generation sequencing (NGS); taxonomic/functional profiling; statistical approaches; comparative metagenomics
Metagenomics is an emerging methodology for the direct genomic analysis of a mixed community of uncultured microorganisms. The current analyses of metagenomics data largely rely on the computational tools originally designed for microbial genomics projects. The challenge of assembling metagenomic sequences arises mainly from the short reads and the high species complexity of the community. Alternatively, individual (short) reads will be searched directly against databases of known genes (or proteins) to identify homologous sequences. The latter approach may have low sensitivity and specificity in identifying homologous sequences, which may further bias the subsequent diversity analysis. In this paper, we present a novel approach to metagenomic data analysis, called Metagenomic ORFome Assembly (MetaORFA). The whole computational framework consists of three steps. Each read from a metagenomics project will first be annotated with putative open reading frames (ORFs) that likely encode proteins. Next, the predicted ORFs are assembled into a collection of peptides using an EULER assembly method. Finally, the assembled peptides (i.e., ORFome) are used for database searching of homologs and subsequent diversity analysis. We applied MetaORFA approach to several metagenomics datasets with low coverage short reads. The results show that MetaORFA can produce long peptides even when the sequence coverage of reads is extremely low. Hence, the ORFome assembly significantly increased the sensitivity of homology searching, and may potentially improve the diversity analysis of the metagenomic data. This improvement is especially useful for the metagenomic projects when the genome assembly does not work because of the low sequence coverage.
Metagenomics; ORFome; ORFome assembly; Function annotation
A common biological pathway reconstruction approach—as implemented by many automatic biological pathway services (such as the KAAS and RAST servers) and the functional annotation of metagenomic sequences—starts with the identification of protein functions or families (e.g., KO families for the KEGG database and the FIG families for the SEED database) in the query sequences, followed by a direct mapping of the identified protein families onto pathways. Given a predicted patchwork of individual biochemical steps, some metric must be applied in deciding what pathways actually exist in the genome or metagenome represented by the sequences. Commonly, and straightforwardly, a complete biological pathway can be identified in a dataset if at least one of the steps associated with the pathway is found. We report, however, that this naïve mapping approach leads to an inflated estimate of biological pathways, and thus overestimates the functional diversity of the sample from which the DNA sequences are derived. We developed a parsimony approach, called MinPath (Minimal set of Pathways), for biological pathway reconstructions using protein family predictions, which yields a more conservative, yet more faithful, estimation of the biological pathways for a query dataset. MinPath identified far fewer pathways for the genomes collected in the KEGG database—as compared to the naïve mapping approach—eliminating some obviously spurious pathway annotations. Results from applying MinPath to several metagenomes indicate that the common methods used for metagenome annotation may significantly overestimate the biological pathways encoded by microbial communities.
Even though there is only a single large biological network within any cell and all pathways are to some extent connected, the partition of the entire cellular network into smaller units (e.g., KEGG pathways) is extremely important for understanding biological processes. Biological pathway reconstruction, therefore, is essential for understanding the biological functions that a newly sequenced genome encodes and recently for studying the functionality of a natural environment via metagenomics. The common practice of pathway reconstruction in metagenomics first identifies functions encoded by the metagenomic sequences and then reconstructs pathways from the annotated functions by mapping the functions to reference pathways. To address the issues of both incomplete data (e.g., metagenomes, unlike individual genomes, are most likely incomplete) and pathway redundancy (e.g., the same function is involved in multiple pathway units), we formulate a parsimony version of the pathway reconstruction/inference problem, called MinPath (Minimal set of Pathways): given a set of reference pathways and a set of functions that can be mapped to one or more pathways, MinPath aims at finding a minimum number of pathways that can explain all functions. MinPath achieves a more conservative, yet more faithful, estimation of the biological pathways encoded by genomes and metagenomes.