How DNA is organized in three dimensions inside the cell nucleus and how that affects the ways in which cells access, read and interpret genetic information are among the longest standing questions in cell biology. Using newly developed molecular, genomic, and computational approaches based on the chromosome conformation capture technology (such as 3C, 4C, 5C and Hi-C) the spatial organization of genomes is being explored at unprecedented resolution. Interpreting the increasingly large chromatin interaction datasets is now posing novel challenges. Here we describe several types of statistical and computational approaches that have recently been developed to analyze chromatin interaction data.
Chromosome conformation capture; chromatin looping; long-range gene regulation; chromatin domains; 3D modeling; polymer physics; genomics; integrative modeling; topology; fractal globule
We have determined the three-dimensional (3D) architecture of the Caulobacter crescentus genome by combining genome-wide chromatin interaction detection, live-cell imaging, and computational modeling. Using chromosome conformation capture carbon copy (5C) technology, we derive ~13 Kb resolution 3D models of the Caulobacter genome. These models illustrate that the genome is ellipsoidal with periodically arranged arms. The parS sites, a pair of short contiguous sequence elements involved in chromosome segregation, are positioned at one pole of this structure, where they nucleate a compact chromatin conformation. Both 5C and imaging experiments demonstrate that placing these sequence elements at new genomic positions yields large-scale rotations of the genome within the cell. Utilizing automated fluorescent imaging, we orient the genome within the cell and illustrate that within the resolution of our data the parS proximal region is the only portion of the genome stably attached to the cell envelope. Our approach provides an experimental paradigm for deriving insight into the cis-determinants of 3D genome architecture.
A remarkable feature of the self-renewing population of embryonic stem cells (ESCs) is their phenotypic heterogeneity: Nanog and other marker proteins of ESCs show large cell-to-cell variation in their expression level, which should significantly influence the differentiation process of individual cells. The molecular mechanism and biological implication of this heterogeneity, however, still remain elusive. We address this problem by constructing a model of the core gene-network of mouse ESCs. The model takes account of processes of binding/unbinding of transcription factors, formation/dissolution of transcription apparatus, and modification of histone code at each locus of genes in the network. These processes are hierarchically interrelated to each other forming the dynamical feedback loops. By simulating stochastic dynamics of this model, we show that the phenotypic heterogeneity of ESCs can be explained when the chromatin at the Nanog locus undergoes the large scale reorganization in formation/dissolution of transcription apparatus, which should have the timescale similar to the cell cycle period. With this slow transcriptional switching of Nanog, the simulated ESCs fluctuate among multiple transient states, which can trigger the differentiation into the lineage-specific cell states. From the simulated transitions among cell states, the epigenetic landscape underlying transitions is calculated. The slow Nanog switching gives rise to the wide basin of ESC states in the landscape. The bimodal Nanog distribution arising from the kinetic flow running through this ESC basin prevents transdifferentiation and promotes the definite decision of the cell fate. These results show that the distribution of timescales of the regulatory processes is decisively important to characterize the fluctuation of cells and their differentiation process. The analyses through the epigenetic landscape and the kinetic flow on the landscape should provide a guideline to engineer cell differentiation.
Embryonic stem cells (ESCs) can proliferate indefinitely by keeping pluripotency, i.e., the ability to differentiate into any cell-lineage. ESCs, therefore, have been the focus of intense biological and medical interests. A remarkable feature of ESCs is their phenotypic heterogeneity: ESCs show large cell-to-cell fluctuation in the expression level of Nanog, which is a key factor to maintain pluripotency. Since Nanog regulates many genes in ESCs, this fluctuation should seriously affect individual cells when they start differentiation. In this paper we analyze this phenotypic fluctuation by simulating the stochastic dynamics of gene network in ESCs. The model takes account of the mutually interrelated processes of gene regulation such as binding/unbinding of transcription factors, formation/dissolution of transcription apparatus, and histone-code modification. We show the distribution of timescales of these processes is decisively important to characterize the dynamical behavior of the gene network, and that the slow formation/dissolution of transcription apparatus at the Nanog locus explains the observed large fluctuation of ESCs. The epigenetic landscapes are calculated based on the stochastic simulation, and the role of the phenotypic fluctuation in the differentiation process is analyzed through the landscape picture.
Mycobacterium tuberculosis, the causative agent of tuberculosis (TB), infects an estimated two billion people worldwide and is the leading cause of mortality due to infectious disease. The development of new anti-TB therapeutics is required, because of the emergence of multi-drug resistance strains as well as co-infection with other pathogens, especially HIV. Recently, the pharmaceutical company GlaxoSmithKline published the results of a high-throughput screen (HTS) of their two million compound library for anti-mycobacterial phenotypes. The screen revealed 776 compounds with significant activity against the M. tuberculosis H37Rv strain, including a subset of 177 prioritized compounds with high potency and low in vitro cytotoxicity. The next major challenge is the identification of the target proteins. Here, we use a computational approach that integrates historical bioassay data, chemical properties and structural comparisons of selected compounds to propose their potential targets in M. tuberculosis. We predicted 139 target - compound links, providing a necessary basis for further studies to characterize the mode of action of these compounds. The results from our analysis, including the predicted structural models, are available to the wider scientific community in the open source mode, to encourage further development of novel TB therapeutics.
Mycobacterium tuberculosis is a major worldwide pathogen infecting millions individuals every year. Additionally, the number of antibiotic resistant strains has dramatically increased over the last decades. Trying to address this challenge, the pharmaceutical company GlaxoSmithKline has recently published the results of a large-scale high-throughput screen (HTS) that resulted in the release of 776 chemical compound structures active against tuberculosis. We have used this dataset of compounds as input to our computational approach that integrates historical bioassay data, chemical properties and structural comparisons. We propose 139 targets alongside their respective hit compounds and made them open to the wider scientific community. Our hope is that the availability of the experimental data from GSK and our computational analysis will encourage further research providing validated therapeutically targets against this devastating disease.
Resistance to macrolide antibiotics is conferred by mutation of A2058 to G or methylation by Erm methyltransferases of the exocyclic N6 of A2058 (E. coli numbering) that forms the macrolide binding site in the 50S subunit of the ribosome. Ketolides such as telithromycin mitigate A2058G resistance yet remain susceptible to Erm-based resistance. Molecular details associated with macrolide resistance due to the A2058G mutation and methylation at N6 of A2058 by Erm methyltransferases were investigated using empirical force field-based simulations. To address the buried nature of the macrolide binding site, the number of waters within the pocket was allowed to fluctuate via the use of a Grand Canonical Monte Carlo (GCMC) methodology. The GCMC water insertion/deletion steps were alternated with Molecular Dynamics (MD) simulations to allow for relaxation of the entire system. From this GCMC/MD approach information on the interactions between telithromycin and the 50S ribosome was obtained. In the wild-type (WT) ribosome, the 2′-OH to A2058 N1 hydrogen bond samples short distances with a higher probability, while the effectiveness of telithromycin against the A2058G mutation is explained by a rearrangement of the hydrogen bonding pattern of the 2′-OH to 2058 that maintains the overall antibiotic-ribosome interactions. In both the WT and A2058G mutation there is significant flexibility in telithromycin's imidazole-pyridine side chain (ARM), indicating that entropic effects contribute to the binding affinity. Methylated ribosomes show lower sampling of short 2′-OH to 2058 distances and also demonstrate enhanced G2057-A2058 stacking leading to disrupted A752-U2609 Watson-Crick (WC) interactions as well as hydrogen bonding between telithromycin's ARM and U2609. This information will be of utility in the rational design of novel macrolide analogs with improved activity against methylated A2058 ribosomes.
Bacterial resistance to antibiotics is a serious public health problem that requires the continuous development of new antibiotics. Bacteria acquire resistance to macrolide antibiotics by (1) effluxing the drug from the cell, (2) modifying the drug, or (3) modifying the drug target (i.e., the 50S subunit of the ribosome) to abrogate or completely abolish binding. While newer antibiotics are able to avoid the first two mechanisms, they remain unable to overcome resistance due to ribosomal modification, particularly due to methyltransferase (i.e., erm) enzymes. We have applied computer-aided drug design methods designed explicitly for studies of the ribosome to better understand the relationship between modification of the ribosome by erms and the binding of telithromycin, a 3rd generation ketolide antibiotic derived from erythromycin. While we confirm that ribosomal modification leads to decreased binding due to disruption of key interactions with the drug, we find these modifications effect a structural rearrangement of the entire region of the ribosome responsible for binding macrolide antibiotics. This information will be useful in the design of novel antibiotics that are effective against resistant bacteria possessing modified ribosomes.
The binding of proteins can shield DNA from mutagenic processes but also interfere with efficient repair. How the presence of DNA-binding proteins shapes intra-genomic differences in mutability and, ultimately, sequence variation in natural populations, however, remains poorly understood. In this study, we examine sequence evolution in Escherichia coli in relation to the binding of four abundant nucleoid-associated proteins: Fis, H-NS, IhfA, and IhfB. We find that, for a subset of mutations, protein occupancy is associated with both increased and decreased mutability in the underlying sequence depending on when the protein is bound during the bacterial growth cycle. On average, protein-bound DNA exhibits reduced mutability compared to protein-free DNA. However, this net protective effect is weak and can be abolished or even reversed during stages of colony growth where binding coincides – and hence likely interferes with – DNA repair activity. We suggest that the four nucleoid-associated proteins analyzed here have played a minor but significant role in patterning extant sequence variation in E. coli.
Mutations can be more or less likely to occur depending on whether DNA is naked or bound by proteins. On the one hand, DNA-binding proteins can shield the DNA from certain mutagenic processes. On the other hand, the very same proteins can interfere with efficient DNA repair. In this study, we reconstruct the history of mutations across 54 E. coli genomes and ask whether mutation risk is higher or lower in regions occupied by proteins that help organize bacterial DNA into chromatin. Intriguingly, we find that the effect of binding depends on its timing. When we consider genomic regions bound during stationary phase, we observe that binding is associated with lower mutation risk for some mutation classes compared to naked DNA, albeit weakly. However, when binding occurs during exponential phase, bound regions actually experience more mutations on average. We argue that this is because, during exponential phase, the major effect of binding is that it interferes with efficient DNA repair, whereas in stationary phase – when many repair pathways are inactive – the protective effect of binding dominates. Our results suggest that the four DNA-binding proteins considered here have a small but significant growth phase-specific effect on mutation dynamics in E. coli.
The vast majority of membrane proteins are anchored to biological membranes through hydrophobic α-helices. Sequence analysis of high-resolution membrane protein structures show that ionizable amino acid residues are present in transmembrane (TM) helices, often with a functional and/or structural role. Here, using as scaffold the hydrophobic TM domain of the model membrane protein glycophorin A (GpA), we address the consequences of replacing specific residues by ionizable amino acids on TM helix insertion and packing, both in detergent micelles and in biological membranes. Our findings demonstrate that ionizable residues are stably inserted in hydrophobic environments, and tolerated in the dimerization process when oriented toward the lipid face, emphasizing the complexity of protein-lipid interactions in biological membranes.
PA-824 is a promising drug candidate for the treatment of tuberculosis (TB). It is in phase II clinical trials as part of the first newly designed regimen containing multiple novel antituberculosis drugs (PA-824 in combination with moxifloxacin and pyrazinamide). However, given that the genes involved in resistance against PA-824 are not fully conserved in the Mycobacterium tuberculosis complex (MTBC), this regimen might not be equally effective against different MTBC genotypes. To investigate this question, we sequenced two PA-824 resistance genes (fgd1 [Rv0407] and ddn [Rv3547]) in 65 MTBC strains representing major phylogenetic lineages. The MICs of representative strains were determined using the modified proportion method in the Bactec MGIT 960 system. Our analysis revealed single-nucleotide polymorphisms in both genes that were specific either for several genotypes or for individual strains, yet none of these mutations significantly affected the PA-824 MICs (≤0.25 μg/ml). These results were supported by in silico modeling of the mutations identified in Fgd1. In contrast, “Mycobacterium canettii” strains displayed a higher MIC of 8 μg/ml. In conclusion, we found a large genetic diversity in PA-824 resistance genes that did not lead to elevated PA-824 MICs. In contrast, M. canettii strains had MICs that were above the plasma concentrations of PA-824 documented so far in clinical trials. As M. canettii is also intrinsically resistant against pyrazinamide, new regimens containing PA-824 and pyrazinamide might not be effective in treating M. canettii infections. This finding has implications for the design of multiple ongoing clinical trials.
Assembly of the ribosome from its protein and RNA constituents has been studied extensively over the past 50 years, and experimental evidence suggests that prokaryotic ribosomal proteins undergo conformational changes during assembly. However, to date, no studies have attempted to elucidate these conformational changes. The present work utilizes computational methods to analyze protein dynamics and to investigate the linkage between dynamics and binding of these proteins during the assembly of the ribosome. Ribosomal proteins are known to be positively charged and we find the percentage of positive residues in r-proteins to be about twice that of the average protein: Lys+Arg is 18.7% for E. coli and 21.2% for T. thermophilus. Also, positive residues constitute a large proportion of RNA contacting residues: 39% for E. coli and 46% for T. thermophilus. This affirms the known importance of charge-charge interactions in the assembly of the ribosome. We studied the dynamics of three primary proteins from E. coli and T. thermophilus 30S subunits that bind early in the assembly (S15, S17, and S20) with atomic molecular dynamic simulations, followed by a study of all r-proteins using elastic network models. Molecular dynamics simulations show that solvent-exposed proteins (S15 and S17) tend to adopt more stable solution conformations than an RNA-embedded protein (S20). We also find protein residues that contact the 16S rRNA are generally more mobile in comparison with the other residues. This is because there is a larger proportion of contacting residues located in flexible loop regions. By the use of elastic network models, which are computationally more efficient, we show that this trend holds for most of the 30S r-proteins.
Ribosomes are complex cellular machines that synthesize new proteins in the cell. The accurate and efficient assembly of ribosomal proteins (r-proteins) and ribosomal RNA (rRNA) to form a functional ribosome is important for cell growth, metabolic reactions, and other cellular processes. Additionally, some antibacterial drugs are believed to target the bacterial ribosome during its construction. Hence, ribosomal assembly has been an active research topic for many years because understanding the assembly mechanisms can provide insight into protein/RNA recognitions important in many other cellular processes, as well as optimize the development of antibacterial therapeutics. Experimental studies thus far have provided still limited understanding about the assembly process. To further understand the assembly process, we have computationally studied the dynamic properties that r-proteins exhibit during assembly and the relationship between dynamics, physical properties, and binding propensity. We observe significant charged interactions between r-proteins and rRNA. We also detect a strong correlation between contact residues and their dynamic mobilities. Protein residues contacting with rRNA are observed to be more mobile in comparison with other residues. We also relate the location of the r-protein in the fully assembled ribosome to its susceptibility for large conformational changes prior to binding.
Histone tails play an important role in nucleosome structure and dynamics. Here we investigate the effect of truncation of histone tails H3, H4, H2A and H2B on nucleosome structure with 100 ns all-atom molecular dynamics simulations. Tail domains of H3 and H2B show propensity of -helics formation during the intact nucleosome simulation. On truncation of H4 or H2B tails no structural change occurs in histones. However, H3 or H2A tail truncation results in structural alterations in the histone core domain, and in both the cases the structural change occurs in the H2A3 domain. We also find that the contacts between the histone H2A C terminal docking domain and surrounding residues are destabilized upon H3 tail truncation. The relation between the present observations and corresponding experiments is discussed.
Histone tails are the most common sites of post-translational modifications. Tail modifications alter both inter and intra nucleosomal interactions to disrupt the condensed chromatin structure, thereby playing crucial role in gene access. Here we investigated histone tail functions on the stability of a single nucleosome in atomic detail by selectively truncating tail domains in molecular dynamics simulations. Our study revealed that truncation of H3 or H2A tail results in structural alterations in the nucleosome core whereas truncation of H4 or H2B tail does not. A potential role of H2A C terminal tail in regulating nucleosome stability is discussed. Finally, an -helical domain formation was observed in one of the H3 tails and, upon truncation of this tail, structural changes occurred in closely lying histone domains. The correlation between tail-truncation and structural changes likely sheds light on allosteric regulation of nucleosome stability.
Over the last decade, and especially after the advent of fluorescent in situ hybridization imaging and chromosome conformation capture methods, the availability of experimental data on genome three-dimensional organization has dramatically increased. We now have access to unprecedented details of how genomes organize within the interphase nucleus. Development of new computational approaches to leverage this data has already resulted in the first three-dimensional structures of genomic domains and genomes. Such approaches expand our knowledge of the chromatin folding principles, which has been classically studied using polymer physics and molecular simulations. Our outlook describes computational approaches for integrating experimental data with polymer physics, thereby bridging the resolution gap for structural determination of genomes and genomic domains.
The crystal structure of the TLR4-MD-2-LPS complex responsible for triggering powerful pro-inflammatory cytokine responses has recently become available. Central to cell surface complex formation is binding of lipopolysaccharide (LPS) to soluble MD-2. We have previously shown, in biologically based experiments, that a generation 3.5 PAMAM dendrimer with 64 peripheral carboxylic acid groups acts as an antagonist of pro-inflammatory cytokine production after surface modification with 8 glucosamine molecules. We have also shown using molecular modelling approaches that this partially glycosylated dendrimer has the flexibility, cluster density, surface electrostatic charge, and hydrophilicity to make it a therapeutically useful antagonist of complex formation. These studies enabled the computational study of the interactions of the unmodified dendrimer, glucosamine, and of the partially glycosylated dendrimer with TLR4 and MD-2 using molecular docking and molecular dynamics techniques. They demonstrate that dendrimer glucosamine forms co-operative electrostatic interactions with residues lining the entrance to MD-2's hydrophobic pocket. Crucially, dendrimer glucosamine interferes with the electrostatic binding of: (i) the 4′phosphate on the di-glucosamine of LPS to Ser118 on MD-2; (ii) LPS to Lys91 on MD-2; (iii) the subsequent binding of TLR4 to Tyr102 on MD-2. This is followed by additional co-operative interactions between several of the dendrimer glucosamine's carboxylic acid branches and MD-2. Collectively, these interactions block the entry of the lipid chains of LPS into MD-2's hydrophobic pocket, and also prevent TLR4-MD-2-LPS complex formation. Our studies have therefore defined the first nonlipid-based synthetic MD-2 antagonist using both animal model-based studies of pro-inflammatory cytokine responses and molecular modelling studies of a whole dendrimer with its target protein. Using this approach, it should now be possible to computationally design additional macromolecular dendrimer based antagonists for other Toll Like Receptors. They could be useful for treating a spectrum of infectious, inflammatory and malignant diseases.
Dendrimers are well-defined branched symmetrical macromolecules. In biologically based experiments, we have shown that a generation 3.5 PAMAM dendrimer whose surface was modified with 8 surface glucosamine molecules inhibited TLR4 mediated cytokine inflammation in both primary human cells and a clinically validated rabbit model of tissue scaring. Molecular dynamics simulations also showed that these molecules had the flexibility, surface electrostatic charge, and hydrophilicity to make them therapeutically useful antagonists. Central to the TLR4-MD-2-LPS cell surface complex is binding of lipopolysaccharide (LPS) to soluble MD-2. We now show that dendrimer glucosamine forms co-operative electrostatic interactions with residues lining the entrance to MD-2's hydrophobic pocket. These interactions block the entry of the lipid chains of LPS into MD-2's hydrophobic pocket and prevent complex formation. We have therefore defined the first nonlipid-based synthetic MD-2 antagonist using both animal model based studies and molecular modelling studies of a whole dendrimer with its target protein. Using this approach, it should now be possible to computationally design additional macromolecular dendrimer based antagonists for other Toll Like Receptors. They could be useful for treating a spectrum of infectious, inflammatory and malignant diseases.
We developed a general approach that combines Chromosome Conformation Capture Carbon Copy with the Integrated Modeling Platform to generate high-resolution three-dimensional models of chromatin at the Mb scale. We applied this approach to the ENm008 domain on human chromosome 16 containing the α-globin locus, which is expressed in K562 cells and silenced in lymphoblastoid cells (GM12878). The models accurately reproduce the known looping interactions between the α-globin genes and their distal regulatory elements. Further, we find that the domain folds into a single globular conformation in GM12878 cells, whereas two globules are formed in K562 cells. The central cores of these globules are enriched for transcribed genes, whereas non-transcribed chromatin is more peripheral. We propose that globule formation represents a higher-order folding state related to clustering of transcribed genes around shared transcription machineries, as observed by microscopy.
Comparing the structures of proteins is crucial to gaining insight into protein evolution and function. Here, we align the sequences of multiple protein structures by a dynamic programming optimization of a scoring function that is a sum of an affine gap penalty and terms dependent on various sequence and structure features (SALIGN). The features include amino acid residue type, residue position, residue accessible surface area, residue secondary structure state and the conformation of a short segment centered on the residue. The multiple alignment is built by following the ‘guide’ tree constructed from the matrix of all pairwise protein alignment scores. Importantly, the method does not depend on the exact values of various parameters, such as feature weights and gap penalties, because the optimal alignment across a range of parameter values is found. Using multiple structure alignments in the HOMSTRAD database, SALIGN was benchmarked against MUSTANG for multiple alignments as well as against TM-align and CE for pairwise alignments. On the average, SALIGN produces a 15% improvement in structural overlap over HOMSTRAD and 14% over MUSTANG, and yields more equivalent structural positions than TM-align and CE in 90% and 95% of cases, respectively. The utility of accurate multiple structure alignment is illustrated by its application to comparative protein structure modeling.
multiple structure alignment; dynamic programming; guide tree; RMSD; structure overlap
Motivation:Several strategies have been developed to predict the fold of a target protein sequence, most of which are based on aligning the target sequence to other sequences of known structure. Previously, we demonstrated that the consideration of protein–protein interactions significantly increases the accuracy of fold assignment compared with PSI-BLAST sequence comparisons. A drawback of our method was the low number of proteins to which a fold could be assigned. Here, we present an improved version of the method that addresses this limitation. We also compare our method to other state-of-the-art fold assignment methodologies.
Results: Our approach (ModLink+) has been tested on 3716 proteins with domain folds classified in the Structural Classification Of Proteins (SCOP) as well as known interacting partners in the Database of Interacting Proteins (DIP). For this test set, the ratio of success [positive predictive value (PPV)] on fold assignment increases from 75% for PSI-BLAST, 83% for HHSearch and 81% for PRC to >90% for ModLink+at the e-value cutoff of 10−3. Under this e-value, ModLink+can assign a fold to 30–45% of the proteins in the test set, while our previous method could cover <25%. When applied to 6384 proteins with unknown fold in the yeast proteome, ModLink+combined with PSI-BLAST assigns a fold for domains in 3738 proteins, while PSI-BLAST alone covers only 2122 proteins, HHSearch 2969 and PRC 2826 proteins, using a threshold e-value that would represent a PPV >82% for each method in the test set.
Availability: The ModLink+server is freely accessible in the World Wide Web at http://sbi.imim.es/modlink/.
Supplementary information: Supplementary data are available at Bioinformatics online.
In recent years, the number of available RNA structures has rapidly grown reflecting the increased interest on RNA biology. Similarly to the studies carried out two decades ago for proteins, which gave the fundamental grounds for developing comparative protein structure prediction methods, we are now able to quantify the relationship between sequence and structure conservation in RNA.
Here we introduce an all-against-all sequence- and three-dimensional (3D) structure-based comparison of a representative set of RNA structures, which have allowed us to quantitatively confirm that: (i) there is a measurable relationship between sequence and structure conservation that weakens for alignments resulting in below 60% sequence identity, (ii) evolution tends to conserve more RNA structure than sequence, and (iii) there is a twilight zone for RNA homology detection.
The computational analysis here presented quantitatively describes the relationship between sequence and structure for RNA molecules and defines a twilight zone region for detecting RNA homology. Our work could represent the theoretical basis and limitations for future developments in comparative RNA 3D structure prediction.
Recent interest in non-coding RNA transcripts has resulted in a rapid increase of deposited RNA structures in the Protein Data Bank. However, a characterization and functional classification of the RNA structure and function space have only been partially addressed. Here, we introduce the SARA program for pair-wise alignment of RNA structures as a web server for structure-based RNA function assignment. The SARA server relies on the SARA program, which aligns two RNA structures based on a unit-vector root-mean-square approach. The likely accuracy of the SARA alignments is assessed by three different P-values estimating the statistical significance of the sequence, secondary structure and tertiary structure identity scores, respectively. Our benchmarks, which relied on a set of 419 RNA structures with known SCOR structural class, indicate that at a negative logarithm of mean P-value higher or equal than 2.5, SARA can assign the correct or a similar SCOR class to 81.4% and 95.3% of the benchmark set, respectively. The SARA server is freely accessible via the World Wide Web at http://sgu.bioinfo.cipf.es/services/SARA/.
Conventional patent-based drug development incentives work badly for the developing world, where commercial markets are usually small to non-existent. For this reason, the past decade has seen extensive experimentation with alternative R&D institutions ranging from private–public partnerships to development prizes. Despite extensive discussion, however, one of the most promising avenues—open source drug discovery—has remained elusive. We argue that the stumbling block has been the absence of a critical mass of preexisting work that volunteers can improve through a series of granular contributions. Historically, open source software collaborations have almost never succeeded without such “kernels”.
Here, we use a computational pipeline for: (i) comparative structure modeling of target proteins, (ii) predicting the localization of ligand binding sites on their surfaces, and (iii) assessing the similarity of the predicted ligands to known drugs. Our kernel currently contains 143 and 297 protein targets from ten pathogen genomes that are predicted to bind a known drug or a molecule similar to a known drug, respectively. The kernel provides a source of potential drug targets and drug candidates around which an online open source community can nucleate. Using NMR spectroscopy, we have experimentally tested our predictions for two of these targets, confirming one and invalidating the other.
The TDI kernel, which is being offered under the Creative Commons attribution share-alike license for free and unrestricted use, can be accessed on the World Wide Web at http://www.tropicaldisease.org. We hope that the kernel will facilitate collaborative efforts towards the discovery of new drugs against parasites that cause tropical diseases.
Open source drug discovery, a promising alternative avenue to conventional patent-based drug development, has so far remained elusive with few exceptions. A major stumbling block has been the absence of a critical mass of preexisting work that volunteers can improve through a series of granular contributions. This paper introduces the results from a newly assembled computational pipeline for identifying protein targets for drug discovery in ten organisms that cause tropical diseases. We have also experimentally tested two promising targets for their binding to commercially available drugs, validating one and invalidating the other. The resulting kernel provides a base of drug targets and lead candidates around which an open source community can nucleate. We invite readers to donate their judgment and in silico and in vitro experiments to develop these targets to the point where drug optimization can begin.
MODBASE (http://salilab.org/modbase) is a database of annotated comparative protein structure models. The models are calculated by MODPIPE, an automated modeling pipeline that relies primarily on MODELLER for fold assignment, sequence–structure alignment, model building and model assessment (http:/salilab.org/modeller). MODBASE currently contains 5 152 695 reliable models for domains in 1 593 209 unique protein sequences; only models based on statistically significant alignments and/or models assessed to have the correct fold are included. MODBASE also allows users to calculate comparative models on demand, through an interface to the MODWEB modeling server (http://salilab.org/modweb). Other resources integrated with MODBASE include databases of multiple protein structure alignments (DBAli), structurally defined ligand binding sites (LIGBASE), predicted ligand binding sites (AnnoLyze), structurally defined binary domain interfaces (PIBASE) and annotated single nucleotide polymorphisms and somatic mutations found in human proteins (LS-SNP, LS-Mut). MODBASE models are also available through the Protein Model Portal (http://www.proteinmodelportal.org/).
A number of studies have used protein interaction data alone for protein function prediction. Here, we introduce a computational approach for annotation of enzymes, based on the observation that similar protein sequences are more likely to perform the same function if they share similar interacting partners.
The method has been tested against the PSI-BLAST program using a set of 3,890 protein sequences from which interaction data was available. For protein sequences that align with at least 40% sequence identity to a known enzyme, the specificity of our method in predicting the first three EC digits increased from 80% to 90% at 80% coverage when compared to PSI-BLAST.
Our method can also be used in proteins for which homologous sequences with known interacting partners can be detected. Thus, our method could increase 10% the specificity of genome-wide enzyme predictions based on sequence matching by PSI-BLAST alone.
So-called ‘Evolutionary potentials’ for protein structure prediction are derived using a single experimental protein structure and all three-dimensional models of its homologous sequences.
We introduce a new type of knowledge-based potentials for protein structure prediction, called 'evolutionary potentials', which are derived using a single experimental protein structure and all three-dimensional models of its homologous sequences. The new potentials have been benchmarked against other knowledge-based potentials, resulting in a significant increase in accuracy for model assessment. In contrast to standard knowledge-based potentials, we propose that evolutionary potentials capture key determinants of thermodynamic stability and specific sequence constraints required for fast folding.
The characterization of protein interactions is essential for understanding biological systems. While genome-scale methods are available for identifying interacting proteins, they do not pinpoint the interacting motifs (e.g., a domain, sequence segments, a binding site, or a set of residues). Here, we develop and apply a method for delineating the interacting motifs of hub proteins (i.e., highly connected proteins). The method relies on the observation that proteins with common interaction partners tend to interact with these partners through a common interacting motif. The sole input for the method are binary protein interactions; neither sequence nor structure information is needed. The approach is evaluated by comparing the inferred interacting motifs with domain families defined for 368 proteins in the Structural Classification of Proteins (SCOP). The positive predictive value of the method for detecting proteins with common SCOP families is 75% at sensitivity of 10%. Most of the inferred interacting motifs were significantly associated with sequence patterns, which could be responsible for the common interactions. We find that yeast hubs with multiple interacting motifs are more likely to be essential than hubs with one or two interacting motifs, thus rationalizing the previously observed correlation between essentiality and the number of interacting partners of a protein. We also find that yeast hubs with multiple interacting motifs evolve slower than the average protein, contrary to the hubs with one or two interacting motifs. The proposed method will help us discover unknown interacting motifs and provide biological insights about protein hubs and their roles in interaction networks.
Recent advances in experimental methods have produced a deluge of protein–protein interactions data. However, these methods do not supply information on which specific protein regions are physically in contact during the interactions. Identifying these regions (interfaces) is fundamental for scientific disciplines that require detailed characterizations of protein interactions. In this work, we present a computational method that identifies groups of proteins with similar interfaces. This is achieved by relying on the observation that proteins with common interaction partners tend to interact through similar interfaces. The proposed method retrieves protein interactions from public data repositories and groups proteins that share a sensible number of interacting partners. Proteins within the same group are then labeled with the same “interacting motif” identifier (iMotif). The evaluation performed using known protein domains and structural binding sites suggests that the method is better suited for proteins with multiple interacting partners (hubs). Using yeast data, we show that the cellular essentiality of a gene better correlates with the number of interacting motifs than with the absolute number of interactions.