The wide applications of next-generation sequencing (NGS) technologies in metagenomics have raised many computational challenges. One of the essential problems in metagenomics is to estimate the taxonomic composition of a microbial community, which can be approached by mapping shotgun reads acquired from the community to previously characterized microbial genomes followed by quantity profiling of these species based on the number of mapped reads. This procedure, however, is not as trivial as it appears at first glance. A shotgun metagenomic dataset often contains DNA sequences from many closely-related microbial species (e.g., within the same genus) or strains (e.g., within the same species), thus it is often difficult to determine which species/strain a specific read is sampled from when it can be mapped to a common region shared by multiple genomes at high similarity. Furthermore, high genomic variations are observed among individual genomes within the same species, which are difficult to be differentiated from the inter-species variations during reads mapping. To address these issues, a commonly used approach is to quantify taxonomic distribution only at the genus level, based on the reads mapped to all species belonging to the same genus; alternatively, reads are mapped to a set of representative genomes, each selected to represent a different genus. Here, we introduce a novel approach to the quantity estimation of closely-related species within the same genus by mapping the reads to their genomes represented by a de Bruijn graph, in which the common genomic regions among them are collapsed. Using simulated and real metagenomic datasets, we show the de Bruijn graph approach has several advantages over existing methods, including (1) it avoids redundant mapping of shotgun reads to multiple copies of the common regions in different genomes, and (2) it leads to more accurate quantification for the closely-related species (and even for strains within the same species).
closely-related genomes; de Bruijn graph; metagenomics; quantification
Mycobacterium tuberculosis is an obligate human respiratory pathogen that encodes approximately ten arsenic repressor (ArsR) family regulatory proteins that allow the organism to respond to a wide range of changes in its immediate microenvironment. How individual ArsR repressors have evolved to respond to selective stimuli is of intrinsic interest. The Ni(II)/Co(II)-specific repressor NmtR and related actinomycete nickel sensors harbor a conserved N-terminal αNH2-Gly2-His3-Gly4 sequence. Here, we present the solution structure of homodimeric apo-NmtR and show that the core of the molecule adopts a typical winged-helix ArsR repressor (α1-α2-α3-αR-β1-β2-α5) “open conformation” that is similar to the related zinc sensor Staphylococcus aureus CzrA, but harboring long, flexible N-terminal (residues 2-16) and C-terminal (residues 109-120) extensions. Ni(II) binding to the regulatory sites induces strong paramagnetic broadening of the α5 helical region and the extreme N-terminal tail to residue 10. Ratiometric pulse chase amidination mass spectrometry reveals that the rate of amidination of the Gly2 α-amino group is strongly attenuated in the Ni(II) complex relative to the apo-state and non-cognate Zn(II) complex. Ni(II) binding also induces dynamic disorder in the μs-ms timescale of key DNA interacting regions that likely contributes to the negative regulation of DNA binding by Ni(II). Molecular dynamics simulations and quantum chemical calculations reveal that NmtR readily accommodates a distal Ni(II) hexacoordination model involving the α-amine and His3 of the N-terminal region and α5 residues Asp91′, His93′, His104 and His107, which collectively define a new metal sensing site configuration in ArsR family regulators.
Shotgun metagenomics has been applied to the studies of the functionality of various microbial communities. As a critical analysis step in these studies, biological pathways are reconstructed based on the genes predicted from metagenomic shotgun sequences. Pathway reconstruction provides insights into the functionality of a microbial community and can be used for comparing multiple microbial communities. The utilization of pathway reconstruction, however, can be jeopardized because of imperfect functional annotation of genes, and ambiguity in the assignment of predicted enzymes to biochemical reactions (e.g., some enzymes are involved in multiple biochemical reactions). Considering that metabolic functions in a microbial community are carried out by many enzymes in a collaborative manner, we present a probabilistic sampling approach to profiling functional content in a metagenomic dataset, by sampling functions of catalytically promiscuous enzymes within the context of the entire metabolic network defined by the annotated metagenome. We test our approach on metagenomic datasets from environmental and human-associated microbial communities. The results show that our approach provides a more accurate representation of the metabolic activities encoded in a metagenome, and thus improves the comparative analysis of multiple microbial communities. In addition, our approach reports likelihood scores of putative reactions, which can be used to identify important reactions and metabolic pathways that reflect the environmental adaptation of the microbial communities. Source code for sampling metabolic networks is available online at http://omics.informatics.indiana.edu/mg/MetaNetSam/.
We present a probabilistic sampling approach to profiling metabolic reactions in a microbial community from metagenomic shotgun reads, in an attempt to understand the metabolism within a microbial community and compare them across multiple communities. Different from the conventional pathway reconstruction approaches that aim at a definitive set of reactions, our method estimates how likely each annotated reaction can occur in the metabolism of the microbial community, given the shotgun sequencing data. This probabilistic measure improves our prediction of the actual metabolism in the microbial communities and can be used in the comparative functional analysis of metagenomic data.
The NIH Human Microbiome Project (HMP) has produced several hundred metagenomic data sets, allowing studies of the many functional elements in human-associated microbial communities. Here, we survey the distribution of oral spirochetes implicated in dental diseases in normal human individuals, using recombination sites associated with the chromosomal integron in Treponema genomes, taking advantage of the multiple copies of the integron recombination sites (repeats) in the genomes, and using a targeted assembly approach that we have developed. We find that integron-containing Treponema species are present in ∼80% of the normal human subjects included in the HMP. Further, we are able to de novo assemble the integron gene cassettes using our constrained assembly approach, which employs a unique application of the de Bruijn graph assembly information; most of these cassette genes were not assembled in whole-metagenome assemblies and could not be identified by mapping sequencing reads onto the known reference Treponema genomes due to the dynamic nature of integron gene cassettes. Our study significantly enriches the gene pool known to be carried by Treponema chromosomal integrons, totaling 826 (598 97% nonredundant) genes. We characterize the functions of these gene cassettes: many of these genes have unknown functions. The integron gene cassette arrays found in the human microbiome are extraordinarily dynamic, with different microbial communities sharing only a small number of common genes.
Integron systems are now recognized as important agents of bacterial evolution and are prevalent in most environments. One of the human pathogens known to harbor chromosomal integrons, the Treponema spirochetes are the only clade among spirochete species found to carry integrons. With the recent release of many new Treponema genomes, we were able to study the distribution of chromosomal integrons in this genus.
We find that the Treponema spirochetes implicated in human periodontal diseases and those isolated from cow and swine intestines contain chromosomal integrons, but not the Treponema species isolated from termite guts. By examining the species tree of selected spirochetes (based on 31 phylogenetic marker genes) and the phylogenetic tree of predicted integron integrases, and assisted by our analysis of predicted integron recombination sites, we found that all integron systems identified in Treponema spirochetes are likely to have evolved from a common ancestor—a horizontal gain into the clade. Subsequent to this event, the integron system was lost in the branch leading to the speciation of T. pallidum and T. phagedenis (the Treponema sps. implicated in sexually transmitted diseases). We also find that the lengths of the integron attC sites shortened through Treponema speciation, and that the integron gene cassettes of T. denticola are highly strain specific.
This is the first comprehensive study to characterize the chromosomal integron systems in Treponema species. By characterizing integron distribution and cassette contents in the Treponema sps., we link the integrons to the speciation of the various species, especially to the pathogens T. pallidum and T. phagedenis.
Chromosomal integron; Treponema species; Integron integrase; attC site
Motivation: One of the difficulties in metagenomic assembly is that homologous genes from evolutionarily closely related species may behave like repeats and confuse assemblers. As a result, small contigs, each representing a short gene fragment, instead of complete genes, may be reported by an assembler. This further complicates annotation of metagenomic datasets, as annotation tools (such as gene predictors or similarity search tools) typically perform poorly on configs encoding short gene fragments.
Results: We present a novel way of using the de Bruijn graph assembly of metagenomes to improve the assembly of genes. A network matching algorithm is proposed for matching the de Bruijn graph of contigs against reference genes, to derive ‘gene paths’ in the graph (sequences of contigs containing gene fragments) that have the highest similarities to known genes, allowing gene fragments contained in multiple contigs to be connected to form more complete (or intact) genes. Tests on simulated and real datasets show that our approach (called GeneStitch) is able to significantly improve the assembly of genes from metagenomic sequences, by connecting contigs with the guidance of homologous genes—information that is orthogonal to the sequencing reads. We note that the improvement of gene assembly can be observed even when only distantly related genes are available as the reference. We further propose to use ‘gene graphs’ to represent the assembly of reads from homologous genes and discuss potential applications of gene graphs to improving functional annotation for metagenomics.
Availability: The tools are available as open source for download at http://omics.informatics.indiana.edu/GeneStitch
The goal of the Human Microbiome Project (HMP) is to generate a comprehensive catalog of human-associated microorganisms including reference genomes representing the most common species. Toward this goal, the HMP has characterized the microbial communities at 18 body habitats in a cohort of over 200 healthy volunteers using 16S rRNA gene (16S) sequencing and has generated nearly 1,000 reference genomes from human-associated microorganisms. To determine how well current reference genome collections capture the diversity observed among the healthy microbiome and to guide isolation and future sequencing of microbiome members, we compared the HMP’s 16S data sets to several reference 16S collections to create a ‘most wanted’ list of taxa for sequencing. Our analysis revealed that the diversity of commonly occurring taxa within the HMP cohort microbiome is relatively modest, few novel taxa are represented by these OTUs and many common taxa among HMP volunteers recur across different populations of healthy humans. Taken together, these results suggest that it should be possible to perform whole-genome sequencing on a large fraction of the human microbiome, including the ‘most wanted’, and that these sequences should serve to support microbiome studies across multiple cohorts. Also, in stark contrast to other taxa, the ‘most wanted’ organisms are poorly represented among culture collections suggesting that novel culture- and single-cell-based methods will be required to isolate these organisms for sequencing.
We explore the microbiota of 18 body sites in over 200 individuals using sequences amplified V1–V3 and the V3–V5 small subunit ribosomal RNA (16S) hypervariable regions as part of the NIH Common Fund Human Microbiome Project. The body sites with the greatest number of core OTUs, defined as OTUs shared amongst 95% or more of the individuals, were the oral sites (saliva, tongue, cheek, gums, and throat) followed by the nose, stool, and skin, while the vaginal sites had the fewest number of OTUs shared across subjects. We found that commonalities between samples based on taxonomy could sometimes belie variability at the sub-genus OTU level. This was particularly apparent in the mouth where a given genus can be present in many different oral sites, but the sub-genus OTUs show very distinct site selection, and in the vaginal sites, which are consistently dominated by the Lactobacillus genus but have distinctly different sub-genus V1–V3 OTU populations across subjects. Different body sites show approximately a ten-fold difference in estimated microbial richness, with stool samples having the highest estimated richness, followed by the mouth, throat and gums, then by the skin, nasal and vaginal sites. Richness as measured by the V1–V3 primers was consistently higher than richness measured by V3–V5. We also show that when such a large cohort is analyzed at the genus level, most subjects fit the stool “enterotype” profile, but other subjects are intermediate, blurring the distinction between the enterotypes. When analyzed at the finer-scale, OTU level, there was little or no segregation into stool enterotypes, but in the vagina distinct biotypes were apparent. Finally, we note that even OTUs present in nearly every subject, or that dominate in some samples, showed orders of magnitude variation in relative abundance emphasizing the highly variable nature across individuals.
CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) loci, together with cas (CRISPR–associated) genes, form the CRISPR/Cas adaptive immune system, a primary defense strategy that eubacteria and archaea mobilize against foreign nucleic acids, including phages and conjugative plasmids. Short spacer sequences separated by the repeats are derived from foreign DNA and direct interference to future infections. The availability of hundreds of shotgun metagenomic datasets from the Human Microbiome Project (HMP) enables us to explore the distribution and diversity of known CRISPRs in human-associated microbial communities and to discover new CRISPRs. We propose a targeted assembly strategy to reconstruct CRISPR arrays, which whole-metagenome assemblies fail to identify. For each known CRISPR type (identified from reference genomes), we use its direct repeat consensus sequence to recruit reads from each HMP dataset and then assemble the recruited reads into CRISPR loci; the unique spacer sequences can then be extracted for analysis. We also identified novel CRISPRs or new CRISPR variants in contigs from whole-metagenome assemblies and used targeted assembly to more comprehensively identify these CRISPRs across samples. We observed that the distributions of CRISPRs (including 64 known and 86 novel ones) are largely body-site specific. We provide detailed analysis of several CRISPR loci, including novel CRISPRs. For example, known streptococcal CRISPRs were identified in most oral microbiomes, totaling ∼8,000 unique spacers: samples resampled from the same individual and oral site shared the most spacers; different oral sites from the same individual shared significantly fewer, while different individuals had almost no common spacers, indicating the impact of subtle niche differences on the evolution of CRISPR defenses. We further demonstrate potential applications of CRISPRs to the tracing of rare species and the virus exposure of individuals. This work indicates the importance of effective identification and characterization of CRISPR loci to the study of the dynamic ecology of microbiomes.
Human bodies are complex ecological systems in which various microbial organisms and viruses interact with each other and with the human host. The Human Microbiome Project (HMP) has resulted in >700 datasets of shotgun metagenomic sequences, from which we can learn about the compositions and functions of human-associated microbial communities. CRISPR/Cas systems are a widespread class of adaptive immune systems in bacteria and archaea, providing acquired immunity against foreign nucleic acids: CRISPR/Cas defense pathways involve integration of viral- or plasmid-derived DNA segments into CRISPR arrays (forming spacers between repeated structural sequences), and expression of short crRNAs from these single repeat-spacer units, to generate interference to future invading foreign genomes. Powered by an effective computational approach (the targeted assembly approach for CRISPR), our analysis of CRISPR arrays in the HMP datasets provides the very first global view of bacterial immunity systems in human-associated microbial communities. The great diversity of CRISPR spacers we observed among different body sites, in different individuals, and in single individuals over time, indicates the impact of subtle niche differences on the evolution of CRISPR defenses and indicates the key role of bacteriophage (and plasmids) in shaping human microbial communities.
16S rRNA gene profiling has recently been boosted by the development of pyrosequencing methods. A common analysis is to group pyrosequences into Operational Taxonomic Units (OTUs), such that reads in an OTU are likely sampled from the same species. However, species diversity estimated from error-prone 16S rRNA pyrosequences may be inflated because the reads sampled from the same 16S rRNA gene may appear different, and current OTU inference approaches typically involve time-consuming pairwise/multiple distance calculation and clustering. I propose a novel approach AbundantOTU based on a Consensus Alignment (CA) algorithm, which infers consensus sequences, each representing an OTU, taking advantage of the sequence redundancy for abundant species. Pyrosequencing reads can then be recruited to the consensus sequences to give quantitative information for the corresponding species. As tested on 16S rRNA pyrosequence datasets from mock communities with known species, AbundantOTU rapidly reported identified sequences of the source 16S rRNAs and the abundances of the corresponding species. AbundantOTU was also applied to 16S rRNA pyrosequence datasets derived from real microbial communities and the results are in general agreement with previous studies.
16S rRNA gene; pyrosequencing; Operational Taxonomic Unit (OTU); abundant species
Summary: With the wide application of next-generation sequencing (NGS) techniques, fast tools for protein similarity search that scale well to large query datasets and large databases are highly desirable. In a previous work, we developed RAPSearch, an algorithm that achieved a ~20–90-fold speedup relative to BLAST while still achieving similar levels of sensitivity for short protein fragments derived from NGS data. RAPSearch, however, requires a substantial memory footprint to identify alignment seeds, due to its use of a suffix array data structure. Here we present RAPSearch2, a new memory-efficient implementation of the RAPSearch algorithm that uses a collision-free hash table to index a similarity search database. The utilization of an optimized data structure further speeds up the similarity search—another 2–3 times. We also implemented multi-threading in RAPSearch2, and the multi-thread modes achieve significant acceleration (e.g. 3.5X for 4-thread mode). RAPSearch2 requires up to 2G memory when running in single thread mode, or up to 3.5G memory when running in 4-thread mode.
Availability and implementation: Implemented in C++, the source code is freely available for download at the RAPSearch2 website: http://omics.informatics.indiana.edu/mg/RAPSearch2/.
Supplementary information: Available at the RAPSearch2 website.
Next Generation Sequencing (NGS) is producing enormous corpuses of short DNA reads, affecting emerging fields like metagenomics. Protein similarity search--a key step to achieve annotation of protein-coding genes in these short reads, and identification of their biological functions--faces daunting challenges because of the very sizes of the short read datasets.
We developed a fast protein similarity search tool RAPSearch that utilizes a reduced amino acid alphabet and suffix array to detect seeds of flexible length. For short reads (translated in 6 frames) we tested, RAPSearch achieved ~20-90 times speedup as compared to BLASTX. RAPSearch missed only a small fraction (~1.3-3.2%) of BLASTX similarity hits, but it also discovered additional homologous proteins (~0.3-2.1%) that BLASTX missed. By contrast, BLAT, a tool that is even slightly faster than RAPSearch, had significant loss of sensitivity as compared to RAPSearch and BLAST.
RAPSearch is implemented as open-source software and is accessible at http://omics.informatics.indiana.edu/mg/RAPSearch. It enables faster protein similarity search. The application of RAPSearch in metageomics has also been demonstrated.
short reads; similarity search; suffix array; reduced amino acid alphabet; metagenomics
Metagenomics is the study of microbial communities sampled directly from their natural environment, without prior culturing. Among the computational tools recently developed for metagenomic sequence analysis, binning tools attempt to classify the sequences in a metagenomic dataset into different bins (i.e., species), based on various DNA composition patterns (e.g., the tetramer frequencies) of various genomes. Composition-based binning methods, however, cannot be used to classify very short fragments, because of the substantial variation of DNA composition patterns within a single genome. We developed a novel approach (AbundanceBin) for metagenomics binning by utilizing the different abundances of species living in the same environment. AbundanceBin is an application of the Lander-Waterman model to metagenomics, which is based on the l-tuple content of the reads. AbundanceBin achieved accurate, unsupervised, clustering of metagenomic sequences into different bins, such that the reads classified in a bin belong to species of identical or very similar abundances in the sample. In addition, AbundanceBin gave accurate estimations of species abundances, as well as their genome sizes—two important parameters for characterizing a microbial community. We also show that AbundanceBin performed well when the sequence lengths are very short (e.g., 75 bp) or have sequencing errors. By combining AbundanceBin and a composition-based method (MetaCluster), we can achieve even higher binning accuracy. Supplementary Material is available at www.liebertonline.com/cmb.
binning; EM algorithm; metagenomics; Poisson distribution
The advances of next-generation sequencing technology have facilitated metagenomics research that attempts to determine directly the whole collection of genetic material within an environmental sample (i.e. the metagenome). Identification of genes directly from short reads has become an important yet challenging problem in annotating metagenomes, since the assembly of metagenomes is often not available. Gene predictors developed for whole genomes (e.g. Glimmer) and recently developed for metagenomic sequences (e.g. MetaGene) show a significant decrease in performance as the sequencing error rates increase, or as reads get shorter. We have developed a novel gene prediction method FragGeneScan, which combines sequencing error models and codon usages in a hidden Markov model to improve the prediction of protein-coding region in short reads. The performance of FragGeneScan was comparable to Glimmer and MetaGene for complete genomes. But for short reads, FragGeneScan consistently outperformed MetaGene (accuracy improved ∼62% for reads of 400 bases with 1% sequencing errors, and ∼18% for short reads of 100 bases that are error free). When applied to metagenomes, FragGeneScan recovered substantially more genes than MetaGene predicted (>90% of the genes identified by homology search), and many novel genes with no homologs in current protein sequence database.
Metagenomics is the study of microbial communities sampled directly from their natural environment, without prior culturing. By enabling an analysis of populations including many (so-far) unculturable and often unknown microbes, metagenomics is revolutionizing the field of microbiology, and has excited researchers in many disciplines that could benefit from the study of environmental microbes, including those in ecology, environmental sciences, and biomedicine. Specific computational and statistical tools have been developed for metagenomic data analysis and comparison. New studies, however, have revealed various kinds of artifacts present in metagenomics data caused by limitations in the experimental protocols and/or inadequate data analysis procedures, which often lead to incorrect conclusions about a microbial community. Here, we review some of the artifacts, such as overestimation of species diversity and incorrect estimation of gene family frequencies, and discuss emerging computational approaches to address them. We also review potential challenges that metagenomics may encounter with the extensive application of next-generation sequencing (NGS) techniques.
Metagenomics; next-generation sequencing (NGS); taxonomic/functional profiling; statistical approaches; comparative metagenomics
Metagenomics is an emerging methodology for the direct genomic analysis of a mixed community of uncultured microorganisms. The current analyses of metagenomics data largely rely on the computational tools originally designed for microbial genomics projects. The challenge of assembling metagenomic sequences arises mainly from the short reads and the high species complexity of the community. Alternatively, individual (short) reads will be searched directly against databases of known genes (or proteins) to identify homologous sequences. The latter approach may have low sensitivity and specificity in identifying homologous sequences, which may further bias the subsequent diversity analysis. In this paper, we present a novel approach to metagenomic data analysis, called Metagenomic ORFome Assembly (MetaORFA). The whole computational framework consists of three steps. Each read from a metagenomics project will first be annotated with putative open reading frames (ORFs) that likely encode proteins. Next, the predicted ORFs are assembled into a collection of peptides using an EULER assembly method. Finally, the assembled peptides (i.e., ORFome) are used for database searching of homologs and subsequent diversity analysis. We applied MetaORFA approach to several metagenomics datasets with low coverage short reads. The results show that MetaORFA can produce long peptides even when the sequence coverage of reads is extremely low. Hence, the ORFome assembly significantly increased the sensitivity of homology searching, and may potentially improve the diversity analysis of the metagenomic data. This improvement is especially useful for the metagenomic projects when the genome assembly does not work because of the low sequence coverage.
Metagenomics; ORFome; ORFome assembly; Function annotation
A common biological pathway reconstruction approach—as implemented by many automatic biological pathway services (such as the KAAS and RAST servers) and the functional annotation of metagenomic sequences—starts with the identification of protein functions or families (e.g., KO families for the KEGG database and the FIG families for the SEED database) in the query sequences, followed by a direct mapping of the identified protein families onto pathways. Given a predicted patchwork of individual biochemical steps, some metric must be applied in deciding what pathways actually exist in the genome or metagenome represented by the sequences. Commonly, and straightforwardly, a complete biological pathway can be identified in a dataset if at least one of the steps associated with the pathway is found. We report, however, that this naïve mapping approach leads to an inflated estimate of biological pathways, and thus overestimates the functional diversity of the sample from which the DNA sequences are derived. We developed a parsimony approach, called MinPath (Minimal set of Pathways), for biological pathway reconstructions using protein family predictions, which yields a more conservative, yet more faithful, estimation of the biological pathways for a query dataset. MinPath identified far fewer pathways for the genomes collected in the KEGG database—as compared to the naïve mapping approach—eliminating some obviously spurious pathway annotations. Results from applying MinPath to several metagenomes indicate that the common methods used for metagenome annotation may significantly overestimate the biological pathways encoded by microbial communities.
Even though there is only a single large biological network within any cell and all pathways are to some extent connected, the partition of the entire cellular network into smaller units (e.g., KEGG pathways) is extremely important for understanding biological processes. Biological pathway reconstruction, therefore, is essential for understanding the biological functions that a newly sequenced genome encodes and recently for studying the functionality of a natural environment via metagenomics. The common practice of pathway reconstruction in metagenomics first identifies functions encoded by the metagenomic sequences and then reconstructs pathways from the annotated functions by mapping the functions to reference pathways. To address the issues of both incomplete data (e.g., metagenomes, unlike individual genomes, are most likely incomplete) and pathway redundancy (e.g., the same function is involved in multiple pathway units), we formulate a parsimony version of the pathway reconstruction/inference problem, called MinPath (Minimal set of Pathways): given a set of reference pathways and a set of functions that can be mapped to one or more pathways, MinPath aims at finding a minimum number of pathways that can explain all functions. MinPath achieves a more conservative, yet more faithful, estimation of the biological pathways encoded by genomes and metagenomes.
Protein structure analysis and comparison are major challenges in structural bioinformatics. Despite the existence of many tools and algorithms, very few of them have managed to capture the intuitive understanding of protein structures developed in structural biology, especially in the context of rapid database searches. Such intuitions could help speed up similarity searches and make it easier to understand the results of such analyses.
We developed a TOPS++FATCAT algorithm that uses an intuitive description of the proteins' structures as captured in the popular TOPS diagrams to limit the search space of the aligned fragment pairs (AFPs) in the flexible alignment of protein structures performed by the FATCAT algorithm. The TOPS++FATCAT algorithm is faster than FATCAT by more than an order of magnitude with a minimal cost in classification and alignment accuracy. For beta-rich proteins its accuracy is better than FATCAT, because the TOPS+ strings models contains important information of the parallel and anti-parallel hydrogen-bond patterns between the beta-strand SSEs (Secondary Structural Elements). We show that the TOPS++FATCAT errors, rare as they are, can be clearly linked to oversimplifications of the TOPS diagrams and can be corrected by the development of more precise secondary structure element definitions.
The benchmark analysis results and the compressed archive of the TOPS++FATCAT program for Linux platform can be downloaded from the following web site:
TOPS++FATCAT provides FATCAT accuracy and insights into protein structural changes at a speed comparable to sequence alignments, opening up a possibility of interactive protein structure similarity searches.
Domain rearrangements in the innate immune network of amphioxus suggests that domain shuffling has shaped the evolution of immune systems.
Regulation in protein networks often utilizes specialized domains that 'join' (or 'connect') the network through specific protein-protein interactions. The innate immune system, which provides a first and, in many species, the only line of defense against microbial and viral pathogens, is regulated in this way. Amphioxus (Branchiostoma floridae), whose genome was recently sequenced, occupies a unique position in the evolution of innate immunity, having diverged within the chordate lineage prior to the emergence of the adaptive immune system in vertebrates.
The repertoire of several families of innate immunity proteins is expanded in amphioxus compared to both vertebrates and protostome invertebrates. Part of this expansion consists of genes encoding proteins with unusual domain architectures, which often contain both upstream receptor and downstream activator domains, suggesting a potential role for direct connections (shortcuts) that bypass usual signal transduction pathways.
Domain rearrangements can potentially alter the topology of protein-protein interaction (and regulatory) networks. The extent of such arrangements in the innate immune network of amphioxus suggests that domain shuffling, which is an important mechanism in the evolution of multidomain proteins, has also shaped the development of immune systems.
A comparative genomics approach revealed that the genes for several components of the apoptosis network with single copies in vertebrates have multiple paralogs in cnidarian-bilaterian ancestors, suggesting a complex evolutionary history for this network.
Apoptosis, one of the main types of programmed cell death, is regulated and performed by a complex protein network. Studies in model organisms, mostly in the nematode Caenorhabditis elegans, identified a relatively simple apoptotic network consisting of only a few proteins. However, analysis of several recently sequenced invertebrate genomes, ranging from the cnidarian sea anemone Nematostella vectensis, representing one of the morphologically simplest metazoans, to the deuterostomes sea urchin and amphioxus, contradicts the current paradigm of a simple ancestral network that expanded in vertebrates.
Here we show that the apoptosome-forming CED-4/Apaf-1 protein, present in single copy in vertebrate, nematode, and insect genomes, had multiple paralogs in the cnidarian-bilaterian ancestor. Different members of this ancestral Apaf-1 family led to the extant proteins in nematodes/insects and in deuterostomes, explaining significant functional differences between proteins that until now were believed to be orthologous. Similarly, the evolution of the Bcl-2 and caspase protein families appears surprisingly complex and apparently included significant gene loss in nematodes and insects and expansions in deuterostomes.
The emerging picture of the evolution of the apoptosis network is one of a succession of lineage-specific expansions and losses, which combined with the limited number of 'apoptotic' protein families, resulted in apparent similarities between networks in different organisms that mask an underlying complex evolutionary history. Similar results are beginning to surface for other regulatory networks, contradicting the intuitive notion that regulatory networks evolved in a linear way, from simple to complex.
Protein structures are flexible, changing their shapes not only upon substrate binding, but also during evolution as a collective effect of mutations, deletions and insertions. A new generation of protein structure comparison algorithms allows for such flexibility; they go beyond identifying the largest common part between two proteins and find hinge regions and patterns of flexibility in protein families. Here we present a Flexible Structural Neighborhood (FSN), a database of structural neighbors of proteins deposited in PDB as seen by a flexible protein structure alignment program FATCAT, developed previously in our group. The database, searchable by a protein PDB code, provides lists of proteins with statistically significant structural similarity and on lower menu levels provides detailed alignments, interactive superposition of structures and positions of hinges that were identified in the comparison. While superficially similar to other structural protein alignment resources, FSN provides a unique resource to study not only protein structural similarity, but also how protein structures change. FSN is available from a server and by direct links from the PDB database.
The release of the 1000th complete microbial genome will occur in the next two to three years. In anticipation of this milestone, the Fellowship for Interpretation of Genomes (FIG) launched the Project to Annotate 1000 Genomes. The project is built around the principle that the key to improved accuracy in high-throughput annotation technology is to have experts annotate single subsystems over the complete collection of genomes, rather than having an annotation expert attempt to annotate all of the genes in a single genome. Using the subsystems approach, all of the genes implementing the subsystem are analyzed by an expert in that subsystem. An annotation environment was created where populated subsystems are curated and projected to new genomes. A portable notion of a populated subsystem was defined, and tools developed for exchanging and curating these objects. Tools were also developed to resolve conflicts between populated subsystems. The SEED is the first annotation environment that supports this model of annotation. Here, we describe the subsystem approach, and offer the first release of our growing library of populated subsystems. The initial release of data includes 180 177 distinct proteins with 2133 distinct functional roles. This data comes from 173 subsystems and 383 different organisms.
Defining blocks forming the global protein structure on the basis of local structural regularity is a very fruitful idea, extensively used in description, and prediction of structure from only sequence information. Over many years the secondary structure elements were used as available building blocks with great success. Specially prepared sets of possible structural motifs can be used to describe similarity between very distant, non-homologous proteins. The reason for utilizing the structural information in the description of proteins is straightforward. Structural comparison is able to detect approximately twice as many distant relationships as sequence comparison at the same error rate.
Here we provide a new fragment library for Local Structure Segment (LSS) prediction called FRAGlib which is integrated with a previously described segment alignment algorithm SEA. A joined FRAGlib/SEA server provides easy access to both algorithms, allowing a one stop alignment service using a novel approach to protein sequence alignment based on a network matching approach. The FRAGlib used as secondary structure prediction achieves only 73% accuracy in Q3 measure, but when combined with the SEA alignment, it achieves a significant improvement in pairwise sequence alignment quality, as compared to previous SEA implementation and other public alignment algorithms. The FRAGlib algorithm takes ~2 min. to search over FRAGlib database for a typical query protein with 500 residues. The SEA service align two typical proteins within circa ~5 min. All supplementary materials (detailed results of all the benchmarks, the list of test proteins and the whole fragments library) are available for download on-line at .
The joined FRAGlib/SEA server will be a valuable tool both for molecular biologists working on protein sequence analysis and for bioinformaticians developing computational methods of structure prediction and alignment of proteins.
Library of protein motifs; Profile-profile sequence similarity (BLAST; FFAS); Fragments library (FRAGlib); Predicted Local Structure Segments (PLSSs); Segment Alignment (SEA); Network matching problem
Protein structure comparison, an important problem in structural biology, has two main applications: (i) comparing two protein structures in order to identify the similarities and differences between them, and (ii) searching for structures similar to a query structure. Many web-based resources for both applications are available, but all are based on rigid structural alignment algorithms. FATCAT server implements the recently developed flexible protein structure comparison algorithm FATCAT, which automatically identifies hinges and internal rearrangements in two protein structures. The server provides access to two algorithms: FATCAT-pairwise for pairwise flexible structure comparison and FATCAT-search for database searching for structurally similar proteins. Given two protein structures [in the Protein Data Bank (PDB) format], FATCAT-pairwise reports their structural alignment and the corresponding statistical significance of the similarity measured as a P-value. Users can view the superposition of the structures online in web browsers that support the Chime plug-in, or download the superimposed structures in PDB format. In FATCAT-search, users provide one query structure and the server returns a list of protein structures that are similar to the query, ordered by the P-values. In addition, FATCAT server can report the conformational changes of the query structure as compared to other proteins in the structure database. FATCAT server is available at http://fatcat.burnham.org.