Search tips
Search criteria

Results 1-25 (1464065)

Clipboard (0)

Related Articles

1.  Synthetic in vitro transcriptional oscillators 
A fundamental goal of synthetic biology is to understand design principles through engineering biochemical systems.Three in vitro synthetic transcriptional oscillators were constructed and analyzed: a two-node-negative feedback oscillator, an amplified negative-feedback oscillator, and a three-node ring oscillator.The in vitro oscillators are governed by similar design principles as previous theoretical studies and synthetic oscillators in vivo.Because of unintended reactions that arise even without the complexity of living cells, several challenges remain for predictive and robust oscillator performance.
Fundamental goals for synthetic biology are to understand the principles of biological circuitry from an engineering perspective and to establish engineering methods for creating biochemical circuitry to control molecular processes—both in vitro and in vivo (Benner and Sismour, 2005; Adrianantoandro et al, 2006). Here, we make use of a previously proposed class of in vitro biochemical systems, transcriptional circuits, that can be modularly wired into arbitrarily complex networks by changing the regulatory and coding sequence domains of DNA templates (Kim et al, 2006; Subsoontorn et al 2011). Using design motifs for inhibitory and excitatory regulations, three different oscillator designs were constructed and characterized: a two-switch negative-feedback oscillator, loosely analogous to the p53–Mdm2-feedback loop (Bar-Or et al, 2000); the same oscillator augmented with a positive-feedback loop, loosely analogous to a synthetic relaxation oscillator (Atkinson et al, 2003); and a three-switch ring oscillator analogous to the repressilator (Elowitz and Leibler, 2000).
DNA and RNA hybridization reactions (Figure 1B) can be assembled to create either an inhibitable switch (Figure 1A, right and bottom) with a threshold set by the total concentration of its DNA activator strand (Figure 1C, bottom), or an activatable switch (Figure 1A, left and top) with a threshold set by its DNA inhibitor strand concentration (Figure 1C, top). This threshold mechanism is analogous to biological threshold mechanisms such as ‘inhibitor ultrasensitivity' (Ferrell, 1996) and ‘molecular titration' (Buchler and Louis, 2008). Using these design motifs, we constructed a two-switch negative-feedback oscillator (Figure 1A, inset): RNA activator rA1 activates the production of RNA inhibitor rI2 by modulating switch Sw21, while RNA inhibitor rI2, in turn, inhibits the production of RNA activator rA1 by modulating switch Sw12. A total of seven DNA strands are used, in addition to the two enzymes, bacteriophage T7 RNA polymerase and Escherichia coli ribonuclease H. The fact that such a negative-feedback loop can lead to temporal oscillations can be seen from a mathematical model of transcriptional networks. Experimental results showed qualitative agreement with predicted oscillator behavior from simple model simulations.
The fully optimized system revealed five complete oscillation cycles with a nearly 50% amplitude swing (Figure 3A) until, after ∼20 h, the production rate could no longer be sustained in the batch reaction. Gel measurements verified oscillations in RNA concentrations and switch states (Figure 3B and C). However to our surprise, rather than oscillations with constant amplitude and constant mean, the RNA inhibitor concentration builds up after each cycle. An extended mathematical model that incorporated an interference reaction from ‘waste' product (Figure 3B and C) could qualitatively capture this behavior.
Using a new autoregulatory switch Sw11, we added a positive-feedback loop to the two-node oscillator to make an amplified negative feedback oscillator (Design II, Figure 1D). Further, we replaced the excitatory connection of Sw21 by a chain of two inhibitory connections, Sw23 and Sw31, to construct a three-switch ring oscillator (Design III, Figure 1D). All three oscillator designs could be tuned to reach the oscillatory regime in parameter space.
Reassuringly, our in vitro oscillators exhibit several design principles previously observed in vivo. (1) Introducing delay in a simple negative-feedback loop can help achieve stable oscillation (Novák and Tyson, 2008; Stricker et al, 2008). (2) The addition of a positive-feedback self-loop to a negative-feedback oscillator provides access to rich dynamics and improved tunability (Tsai et al, 2008). (3) Oscillations in biochemical ring oscillators (such as the repressilator) are sensitive to parameter asymmetry among individual components (Tuttle et al, 2005). (4) The saturation of degradation machinery and the management of waste products could play an important role.
However, several significant difficulties remain for predictive and robust oscillator performances: limited lifetime of closed batch reactions, interference from waste products, and asymmetry of switch components make quantitative modeling and predictio difficult. As a complementary approach to top-down view of systems biology, cell-free in vitro systems offer a valuable training ground to create and explore increasingly interesting and powerful information-based chemical systems (Simpson, 2006). In vitro oscillators could be used to orchestrate other chemical processes such as DNA nanomachines (Dittmer and Simmel, 2004) and to provide embedded controllers within prototype artificial cells (Noireaux and Libchaber, 2004; Griffiths and Tawfik, 2006).
The construction of synthetic biochemical circuits from simple components illuminates how complex behaviors can arise in chemistry and builds a foundation for future biological technologies. A simplified analog of genetic regulatory networks, in vitro transcriptional circuits, provides a modular platform for the systematic construction of arbitrary circuits and requires only two essential enzymes, bacteriophage T7 RNA polymerase and Escherichia coli ribonuclease H, to produce and degrade RNA signals. In this study, we design and experimentally demonstrate three transcriptional oscillators in vitro. First, a negative feedback oscillator comprising two switches, regulated by excitatory and inhibitory RNA signals, showed up to five complete cycles. To demonstrate modularity and to explore the design space further, a positive-feedback loop was added that modulates and extends the oscillatory regime. Finally, a three-switch ring oscillator was constructed and analyzed. Mathematical modeling guided the design process, identified experimental conditions likely to yield oscillations, and explained the system's robust response to interference by short degradation products. Synthetic transcriptional oscillators could prove valuable for systematic exploration of biochemical circuit design principles and for controlling nanoscale devices and orchestrating processes within artificial cells.
PMCID: PMC3063688  PMID: 21283141
cell free; in vitro; oscillation; synthetic biology; transcriptional circuits
2.  ModeRNA: a tool for comparative modeling of RNA 3D structure 
Nucleic Acids Research  2011;39(10):4007-4022.
RNA is a large group of functionally important biomacromolecules. In striking analogy to proteins, the function of RNA depends on its structure and dynamics, which in turn is encoded in the linear sequence. However, while there are numerous methods for computational prediction of protein three-dimensional (3D) structure from sequence, with comparative modeling being the most reliable approach, there are very few such methods for RNA. Here, we present ModeRNA, a software tool for comparative modeling of RNA 3D structures. As an input, ModeRNA requires a 3D structure of a template RNA molecule, and a sequence alignment between the target to be modeled and the template. It must be emphasized that a good alignment is required for successful modeling, and for large and complex RNA molecules the development of a good alignment usually requires manual adjustments of the input data based on previous expertise of the respective RNA family. ModeRNA can model post-transcriptional modifications, a functionally important feature analogous to post-translational modifications in proteins. ModeRNA can also model DNA structures or use them as templates. It is equipped with many functions for merging fragments of different nucleic acid structures into a single model and analyzing their geometry. Windows and UNIX implementations of ModeRNA with comprehensive documentation and a tutorial are freely available.
PMCID: PMC3105415  PMID: 21300639
3.  RNA Nanotechnology: Engineering, Assembly and Applications in Detection, Gene Delivery and Therapy 
Biological macromolecules including DNA, RNA, and proteins, have intrinsic features that make them potential building blocks for the bottom-up fabrication of nanodevices. RNA is unique in nanoscale fabrication due to its amazing diversity of function and structure. RNA molecules can be designed and manipulated with a level of simplicity characteristic of DNA while possessing versatility in structure and function similar to that of proteins. RNA molecules typically contain a large variety of single stranded loops suitable for inter- and intra-molecular interaction. These loops can serve as mounting dovetails obviating the need for external linking dowels in fabrication and assembly.
The self-assembly of nanoparticles from RNA involves cooperative interaction of individual RNA molecules that spontaneously assemble in a predefined manner to form a larger two- or three-dimensional structure. Within the realm of self-assembly there are two main categories, namely template and non-template. Template assembly involves interaction of RNA molecules under the influence of specific external sequence, forces, or spatial constraints such as RNA transcription, hybridization, replication, annealing, molding, or replicas. In contrast, non-template assembly involves formation of a larger structure by individual components without the influence of external forces. Examples of non-template assembly are ligation, chemical conjugation, covalent linkage, and loop/loop interaction of RNA, especially the formation of RNA multimeric complexes. The best characterized RNA multiplier and the first to be described in RNA nanotechnological application is the motor pRNA of bacteriophage phi29 which form dimers, trimers, and hexamers, via hand-in-hand interaction. phi29 pRNA can be redesigned to form a variety of structures and shapes including twins, tetramers, rods, triangles, and 3D arrays several microns in size via interaction of programmed helical regions and loops. 3D RNA array formation requires a defined nucleotide number for twisting and a palindromic sequence. Such arrays are unusually stable and resistant to a wide range of temperatures, salt concentrations, and pH. Both the therapeutic siRNA or ribozyme and a receptor-binding RNA aptamer or other ligands have been engineered into individual pRNAs. Individual chimeric RNA building blocks harboring siRNA or other therapeutic molecules have been fabricated subsequently into a trimer through hand-in-hand interaction of the engineered right and left interlocking RNA loops. The incubation of these particles containing the receptor-binding aptamer or other ligands results in the binding and co-entry of trivalent therapeutic particles into cells. Such particles were subsequently shown to modulate the apoptosis of cancer cells in both cell cultures and animal trials. The use of such antigen-free 20–40 nm particles holds promise for the repeated long-term treatment of chronic diseases. Other potentially useful RNA molecules that form multimers include HIV RNA that contain kissing loop to form dimers, tecto-RNA that forms a “jigsaw puzzle,” and the Drosophila bicoid mRNA that forms multimers via “hand-by-arm” interactions.
Applications of RNA molecules involving replication, molding, embossing, and other related techniques, have recently been described that allow the utilization of a variety of materials to enhance diversity and resolution of nanomaterials. It should eventually be possible to adapt RNA to facilitate construction of ordered, patterned, or pre-programmed arrays or superstructures. Given the potential for 3D fabrication, the chance to produce reversible self-assembly, and the ability of self-repair, editing and replication, RNA self-assembly will play an increasingly significant role in integrated biological nanofabrication. A random 100-nucleotide RNA library may exist in 1.6 × 1060 varieties with multifarious structure to serve as a vital system for efficient fabrication, with a complexity and diversity far exceeding that of any current nanoscale system.
This review covers the basic concepts of RNA structure and function, certain methods for the study of RNA structure, the approaches for engineering or fabricating RNA into nanoparticles or arrays, and special features of RNA molecules that form multimers. The most recent development in exploration of RNA nanoparticles for pathogen detection, drug/gene delivery, and therapeutic application is also introduced in this review.
PMCID: PMC2842999  PMID: 16430131
RNA; Nanotechnology; Self-Assembly; RNA Application; phi29 pRNA
4.  Extension of a genetic network model by iterative experimentation and mathematical analysis 
Molecular Systems Biology  2005;1:2005.0013.
We extend the current model of the plant circadian clock, in order to accommodate new and published data. Throughout our model development we use a global parameter search to ensure that any limitations we find are due to the network architecture and not to our selection of the parameter values, which have not been determined experimentally. Our final model includes two, interlocked loops of gene regulation and is reminiscent of the circuit structures previously identified by experiments on insect and fungal clocks. It is the first Arabidopsis clock model to show such good correspondence to experimental data.Our interlocked feedback loop model predicts the regulation of two unknown components. Experiments motivated by these predictions identify the GIGANTEA gene as a strong candidate for one component, with an unexpected pattern of light regulation.*
This study involves an iterative approach of mathematical modelling and experiment to develop an accurate mathematical model of the circadian clock in the higher plant Arabidopsis thaliana. Our approach is central to systems biology and should lead to a greater, quantitative understanding of the circadian clock, as well as being more widely relevant to research into genetic networks.
The day–night cycle caused by the Earth's rotation affects most organisms, and has resulted in the evolution of the circadian clock. The circadian clock controls 24-h rhythms in processes from metabolism to behaviour; in higher eukaryotes, the circadian clock controls the rhythmic expression of 5–10% of genes. In plants, the clock controls leaf and petal movements, the opening and closing of stomatal pores, the discharge of floral fragrances and many metabolic activities, especially those associated with photosynthesis.
The relatively small number of components involved in the central circadian network makes it an ideal candidate for mathematical modelling of complex biological regulation. Genetic studies in a variety of model organisms have shown that the circadian rhythm is generated by a central network of between 6 and 12 genes. These genes form feedback loops generating a rhythm in mRNA production. One negative feedback loop in which a gene encodes a protein that, after several hours, turns off transcription is, in principle, capable of creating a circadian rhythm. However, real circadian clocks have proven to be more complicated than this, with interlocked feedback loops. Networks of this complexity are more easily understood through mathematical modelling.
The clock mechanism in the model plant, A. thaliana, was first proposed to comprise a feedback loop in which two partially redundant genes, LATE ELONGATED HYPOCOTYL (LHY) and CIRCADIAN CLOCK ASSOCIATED 1 (CCA1), repress the expression of their activator, TIMING OF CAB EXPRESSION 1 (TOC1). We previously modelled this preliminary network and showed that it was not capable of recreating several important pieces of experimental data (Locke et al, 2005). Here, we extend the LHY/CCA1–TOC1 network in new mathematical models. To check the effects of each addition to the network, the outputs of the extended models are compared to published data and to new experiments.
As is the case for most biological networks, the parameter values in our model, such as the translation rate of TOC1 protein, are unknown. We employ here an optimisation method, which works well with noisy and varied data and allows a global search of parameter space. This should ensure that the limitations we find in our networks are due to the network structure, and not to our parameter choices.
Our final interlocked feedback loop model requires two hypothetical components, genes X and Y (Figure 4), but is the first Arabidopsis clock model to exhibit such a good correspondence with experimental data. The model simulates a residual short-period oscillation in the cca1;lhy mutant, as characterised by our experiments. No single-loop model is able to do this. Our model also matches experimental data under constant light (LL) conditions and correctly senses photoperiod. The model predicts an interlocked feedback loop structure similar to that seen in the circadian clock mechanisms of other organisms.
The interlocked feedback loop model predicts a distinctive pattern of Y mRNA accumulation in the wild type (WT) and in the cca1;lhy double mutant, with Y mRNA levels increasing transiently at dawn. We designed an experiment to identify Y based on this prediction. GIGANTEA (GI) mRNA levels fit very well to our predicted profile for Y (Figure 6), identifying GI as a strong candidate for Y.
The approach described here could act as a template for experimental biologists seeking to extend models of small genetic networks. Our results illustrate the usefulness of mathematical modelling in guiding experiments, even if the models are based on limited data. Our method provides a way of identifying suitable candidate networks and quantifying how these networks better describe a wide variety of experimental measurements. The characteristics of new putative genes are thereby obtained, facilitating the experimental search for new components. To facilitate future experimental design, we provide user-friendly software that is specifically designed for numerical simulation of circadian experiments using models for several species (Brown, 2004b).
*Footnote: Synopsis highlights were added on 5 July 2005.
Circadian clocks involve feedback loops that generate rhythmic expression of key genes. Molecular genetic studies in the higher plant Arabidopsis thaliana have revealed a complex clock network. The first part of the network to be identified, a transcriptional feedback loop comprising TIMING OF CAB EXPRESSION 1 (TOC1), LATE ELONGATED HYPOCOTYL (LHY) and CIRCADIAN CLOCK ASSOCIATED 1 (CCA1), fails to account for significant experimental data. We develop an extended model that is based upon a wider range of data and accurately predicts additional experimental results. The model comprises interlocking feedback loops comparable to those identified experimentally in other circadian systems. We propose that each loop receives input signals from light, and that each loop includes a hypothetical component that had not been explicitly identified. Analysis of the model predicted the properties of these components, including an acute light induction at dawn that is rapidly repressed by LHY and CCA1. We found this unexpected regulation in RNA levels of the evening-expressed gene GIGANTEA (GI), supporting our proposed network and making GI a strong candidate for this component.
PMCID: PMC1681447  PMID: 16729048
biological rhythms; gene network; mathematical modelling; parameter estimation
5.  A multiple-template approach to protein threading 
Proteins  2011;79(6):1930-1939.
Most threading methods predict the structure of a protein using only a single template. Due to the increasing number of solved structures, a protein without solved structure is very likely to have more than one similar template structures. Therefore, a natural question to ask is if we can improve modeling accuracy using multiple templates. This paper describes a new multiple-template threading method to answer this question. At the heart of this multiple-template threading method is a novel probabilistic-consistency algorithm that can accurately align a single protein sequence simultaneously to multiple templates. Experimental results indicate that our multiple-template method can improve pairwise sequence-template alignment accuracy and generate models with better quality than single-template models even if they are built from the best single templates (P-value<10-6) while many popular multiple sequence/structure alignment tools fail to do so. The underlying reason is that our probabilistic-consistency algorithm can generate accurate multiple sequence/template alignments. In another word, without an accurate multiple sequence/template alignment the modeling accuracy cannot be improved by simply using multiple templates to increase alignment coverage. Blindly tested on the CASP9 targets with more than one good template structures, our method outperforms all other CASP9 servers except two (Zhang-Server and QUARK of the same group). Our probabilistic-consistency algorithm can possibly be extended to align multiple protein/RNA sequences and structures.
PMCID: PMC3092796  PMID: 21465564
protein modeling; multiple-template threading; probabilistic alignment matrix; probabilistic-consistency algorithm; multiple sequence/template alignment
After decades of research, protein structure prediction remains a very challenging problem. In order to address the different levels of complexity of structural modeling, two types of modeling techniques — template-based modeling and template-free modeling — have been developed. Template-based modeling can often generate a moderate- to high-resolution model when a similar, homologous template structure is found for a query protein but fails if no template or only incorrect templates are found. Template-free modeling, such as fragment-based assembly, may generate models of moderate resolution for small proteins of low topological complexity. Seldom have the two techniques been integrated together to improve protein modeling. Here we develop a recursive protein modeling approach to selectively and collaboratively apply template-based and template-free modeling methods to model template-covered (i.e. certain) and template-free (i.e. uncertain) regions of a protein. A preliminary implementation of the approach was tested on a number of hard modeling cases during the 9th Critical Assessment of Techniques for Protein Structure Prediction (CASP9) and successfully improved the quality of modeling in most of these cases. Recursive modeling can signicantly reduce the complexity of protein structure modeling and integrate template-based and template-free modeling to improve the quality and efficiency of protein structure prediction.
PMCID: PMC3622867  PMID: 22809379
Protein structure prediction; recursive protein modeling; template-free modeling; template-based modeling; CASP
7.  A comprehensive assessment of sequence-based and template-based methods for protein contact prediction 
Bioinformatics (Oxford, England)  2008;24(7):924-931.
Pair-wise residue-residue contacts in proteins can be predicted from both threading templates and sequence-based machine learning. However, most structure modeling approaches only use the template-based contact predictions in guiding the simulations; this is partly because the sequence-based contact predictions are usually considered to be less accurate than that by threading. With the rapid progress in sequence databases and machine-learning techniques, it is necessary to have a detailed and comprehensive assessment of the contact-prediction methods in different template conditions.
We develop two methods for protein-contact predictions: SVM-SEQ is a sequence-based machine learning approach which trains a variety of sequence-derived features on contact maps; SVM-LOMETS collects consensus contact predictions from multiple threading templates. We test both methods on the same set of 554 proteins which are categorized into ‘Easy’, ‘Medium’, ‘Hard’ and ‘Very Hard’ targets based on the evolutionary and structural distance between templates and targets. For the Easy and Medium targets, SVM-LOMETS obviously outperforms SVM-SEQ; but for the Hard and Very Hard targets, the accuracy of the SVM-SEQ predictions is higher than that of SVM-LOMETS by 12–25%. If we combine the SVM-SEQ and SVM-LOMETS predictions together, the total number of correctly predicted contacts in the Hard proteins will increase by more than 60% (or 70% for the long-range contact with a sequence separation ≥24), compared with SVM-LOMETS alone. The advantage of SVM-SEQ is also shown in the CASP7 free modeling targets where the SVM-SEQ is around four times more accurate than SVM-LOMETS in the long-range contact prediction. These data demonstrate that the state-of-the-art sequence-based contact prediction has reached a level which may be helpful in assisting tertiary structure modeling for the targets which do not have close structure templates. The maximum yield should be obtained by the combination of both sequence- and template-based predictions.
PMCID: PMC2648832  PMID: 18296462
8.  (PS)2-v2: template-based protein structure prediction server 
BMC Bioinformatics  2009;10:366.
Template selection and target-template alignment are critical steps for template-based modeling (TBM) methods. To identify the template for the twilight zone of 15~25% sequence similarity between targets and templates is still difficulty for template-based protein structure prediction. This study presents the (PS)2-v2 server, based on our original server with numerous enhancements and modifications, to improve reliability and applicability.
To detect homologous proteins with remote similarity, the (PS)2-v2 server utilizes the S2A2 matrix, which is a 60 × 60 substitution matrix using the secondary structure propensities of 20 amino acids, and the position-specific sequence profile (PSSM) generated by PSI-BLAST. In addition, our server uses multiple templates and multiple models to build and assess models. Our method was evaluated on the Lindahl benchmark for fold recognition and ProSup benchmark for sequence alignment. Evaluation results indicated that our method outperforms sequence-profile approaches, and had comparable performance to that of structure-based methods on these benchmarks. Finally, we tested our method using the 154 TBM targets of the CASP8 (Critical Assessment of Techniques for Protein Structure Prediction) dataset. Experimental results show that (PS)2-v2 is ranked 6th among 72 severs and is faster than the top-rank five serves, which utilize ab initio methods.
Experimental results demonstrate that (PS)2-v2 with the S2A2 matrix is useful for template selections and target-template alignments by blending the amino acid and structural propensities. The multiple-template and multiple-model strategies are able to significantly improve the accuracies for target-template alignments in the twilight zone. We believe that this server is useful in structure prediction and modeling, especially in detecting homologous templates with sequence similarity in the twilight zone.
PMCID: PMC2775752  PMID: 19878598
9.  Inferring Noncoding RNA Families and Classes by Means of Genome-Scale Structure-Based Clustering 
PLoS Computational Biology  2007;3(4):e65.
The RFAM database defines families of ncRNAs by means of sequence similarities that are sufficient to establish homology. In some cases, such as microRNAs and box H/ACA snoRNAs, functional commonalities define classes of RNAs that are characterized by structural similarities, and typically consist of multiple RNA families. Recent advances in high-throughput transcriptomics and comparative genomics have produced very large sets of putative noncoding RNAs and regulatory RNA signals. For many of them, evidence for stabilizing selection acting on their secondary structures has been derived, and at least approximate models of their structures have been computed. The overwhelming majority of these hypothetical RNAs cannot be assigned to established families or classes. We present here a structure-based clustering approach that is capable of extracting putative RNA classes from genome-wide surveys for structured RNAs. The LocARNA (local alignment of RNA) tool implements a novel variant of the Sankoff algorithm that is sufficiently fast to deal with several thousand candidate sequences. The method is also robust against false positive predictions, i.e., a contamination of the input data with unstructured or nonconserved sequences. We have successfully tested the LocARNA-based clustering approach on the sequences of the RFAM-seed alignments. Furthermore, we have applied it to a previously published set of 3,332 predicted structured elements in the Ciona intestinalis genome (Missal K, Rose D, Stadler PF (2005) Noncoding RNAs in Ciona intestinalis. Bioinformatics 21 (Supplement 2): i77–i78). In addition to recovering, e.g., tRNAs as a structure-based class, the method identifies several RNA families, including microRNA and snoRNA candidates, and suggests several novel classes of ncRNAs for which to date no representative has been experimentally characterized.
Author Summary
For a long time, it was believed that the control of processes in living organisms is almost only performed by proteins. Only recently, scientists learned that a further class of molecules, namely special RNAs, plays an important role in cell control. In consequence, research on such RNAs enjoys increasing attention over the last few years. These RNAs were called noncoding RNAs (ncRNA), because, unlike most other RNAs, these molecules do not code for proteins. Due to recent research successes, one can predict a lot of potential new ncRNAs by comparing the genomes of related organisms. Technically, comparing such RNAs is challenging and computationally expensive, since related ncRNAs often show only weak similarity on the sequence level, but share similar structures. In the paper, we present the new method LocARNA for fast and accurate comparison of RNAs with respect to their sequence and structure. Using this method, we define a distance measure between pairs of ncRNAs based on sequence and structure. This is then used for combining RNAs into a cluster for identifying groups of similar RNAs in large unorganized sets of RNA. The final aim of such a comparison is to identify new classes of ncRNAs. We applied our clustering procedure to a previously published set of 3,332 predicted ncRNAs in the C. intestinalis genomes. In addition to rediscovering known classes of RNAs, e.g., tRNAs, the method predicts microRNA candidates, and suggests several novel, experimentally uncharacterized classes of ncRNAs. For verification, we clustered about 4,000 RNAs of RFAM, which is a large database that contains RNAs with an already known classification into families. Our results show good performance of the presented structure-based clustering approach.
PMCID: PMC1851984  PMID: 17432929
10.  Quantitative analysis of regulatory flexibility under changing environmental conditions 
Day length changes with the seasons in temperate latitudes, affecting the many biological rhythms that entrain to the day/night cycle: we measure these effects on the expression of Arabidopsis clock genes, using RNA and reporter gene readouts, with a new method of phase analysis.Dusk sensitivity is proposed as a simple, natural and general mathematical measure to analyse and manipulate the changing phase of a clock output relative to the change in the day/night cycle.Dusk sensitivity shows how increasing the numbers of feedback loops in the Arabidopsis clock models allows more flexible regulation, consistent with a previously-proposed, general operating principle of biological networks.The Arabidopsis clock genes show flexibility of regulation that is characteristic of a three-loop clock model, validating aspects of the model and the operating principle, but some clock output genes show greater flexibility arising from direct light regulation.
The analysis of dynamic, non-linear regulation with the aid of mechanistic models is central to Systems Biology. This study compares the predictions of mechanistic, mathematical models of the circadian clock with molecular time-series data on rhythmic gene expression in the higher plant Arabidopsis thaliana. Analysis of the models helps us to understand (explain and predict) how the clock gene circuit balances regulation by external and endogenous factors to achieve particular behaviours. Such multi-factorial regulation is ubiquitous in, and characteristic of, living systems.
The Earth's rotation causes predictable changes in the environment, notably in the availability of sunlight for photosynthesis. Many biological processes are driven by the environmental input via sensory pathways, for example, from photoreceptors. Circadian clocks provide an alternative strategy. These endogenous, 24-h rhythms can drive biological processes that anticipate the regular environmental changes, rather than merely responding. Many rhythmic processes have both light and clock control. Indeed, the clock components themselves must balance internal timing with external inputs, because circadian clocks are reset daily through light regulation of one or more clock components. This process of entrainment is complicated by the change in day length. When the times of dawn and dusk move apart in summer, and closer together in winter, does the clock track dawn, track dusk or interpolate between them?
In plants, the clock controls leaf and petal movements, the opening and closing of stomatal pores, the discharge of floral fragrances, and many metabolic activities, especially those associated with photosynthesis. Centuries of physiological studies have shown that these rhythms can behave differently. Flowering in Ipomoea nil (Pharbitis nil, Japanese morning glory) is controlled by a rhythm that tracks the time of dusk, to give a classic example. We showed that two other rhythms associated with vegetative growth track dawn in this species (Figure 5A), so the clock system allows flexible regulation.
The relatively small number of components involved in the circadian clockwork makes it an ideal candidate for mathematical modelling. Molecular genetic studies in a variety of model eukaryotes have shown that the circadian rhythm is generated by a network of 6–20 genes. These genes form feedback loops generating a rhythm in mRNA production. A single negative feedback loop in which a gene encodes a protein that, after several hours, turns off transcription is capable of generating a circadian rhythm, in principle. A single light input can entrain the clock to ‘local time', synchronised with a light–dark cycle. However, real circadian clocks have proven to be more complicated than this, with multiple light inputs and interlocked feedback loops.
We have previously argued from mathematical analysis that multi-loop networks increase the flexibility of regulation (Rand et al, 2004) and have shown that appropriately deployed flexibility can confer functional robustness (Akman et al, 2010). Here we test whether that flexibility can be demonstrated in vivo, in the model plant, A. thaliana. The Arabidopsis clock mechanism comprises a feedback loop in which two partially redundant, myb transcription factors, LATE ELONGATED HYPOCOTYL (LHY) and CIRCADIAN CLOCK ASSOCIATED 1 (CCA1), repress the expression of their activator, TIMING OF CAB EXPRESSION 1 (TOC1). We previously modelled this single-loop circuit and showed that it was not capable of recreating important data (Locke et al, 2005a). An extended, two-loop model was developed to match observed behaviours, incorporating a hypothetical gene Y, for which the best identified candidate was the GIGANTEA gene (GI) (Locke et al, 2005b). Two further models incorporated the TOC1 homologues PSEUDO-RESPONSE REGULATOR (PRR) 9 and PRR7 (Locke et al, 2006; Zeilinger et al, 2006). In these circuits, a morning oscillator (LHY/CCA1–PRR9/7) is coupled to an evening oscillator (Y/GI–TOC1) via the original LHY/CCA1–TOC1 loop.
These clock models, like those for all other organisms, were developed using data from simple conditions of constant light, darkness or 12-h light–12-h dark cycles. We therefore tested how the clock genes in Arabidopsis responded to light–dark cycles with different photoperiods, from 3 h light to 18 h light per 24-h cycle (Edinburgh, 56° North latitude, has 17.5 h light in midsummer). The time-series assays of mRNA and in vivo reporter gene images showed a range of peak times for different genes, depending on the photoperiod (Figure 5C). A new data analysis method, mFourfit, was introduced to measure the peak times, in the Biological Rhythms Analysis Software Suite (BRASS v3.0). None of the genes showed the dusk-tracking behaviour characteristic of the Ipomoea flowering rhythm. The one-, two- and three-loop models were analysed to understand the observed patterns. A new mathematical measure, dusk sensitivity, was introduced to measure the change in timing of a model component versus a change in the time of dusk. The one- and two-loop models tracked dawn and dusk, respectively, under all conditions. Only the three-loop model (Figure 5B) had the flexibility required to match the photoperiod-dependent changes that we found in vivo, and in particular the unexpected, V-shaped pattern in the peak time of TOC1 expression. This pattern of regulation depends on the structure and light inputs to the model's evening oscillator, so the in vivo data supported this aspect of the model. LHY and CCA1 gene expression under short photoperiods showed greater dusk sensitivity, in the interval 2–6 h before dawn, than the three-loop model predicted, so these data will help to constrain future models.
The approach described here could act as a template for experimental biologists seeking to understand biological regulation using dynamic, experimental perturbations and time-series data. Simulation of mathematical models (despite known imperfections) can provide contrasting hypotheses that guide understanding. The system's detailed behaviour is complex, so a natural and general measure such as dusk sensitivity is helpful to focus on one property of the system. We used the measure to compare models, and to predict how this property could be manipulated. To enable additional analysis of this system, we provide the time-series data and experimental metadata online.
The circadian clock controls 24-h rhythms in many biological processes, allowing appropriate timing of biological rhythms relative to dawn and dusk. Known clock circuits include multiple, interlocked feedback loops. Theory suggested that multiple loops contribute the flexibility for molecular rhythms to track multiple phases of the external cycle. Clear dawn- and dusk-tracking rhythms illustrate the flexibility of timing in Ipomoea nil. Molecular clock components in Arabidopsis thaliana showed complex, photoperiod-dependent regulation, which was analysed by comparison with three contrasting models. A simple, quantitative measure, Dusk Sensitivity, was introduced to compare the behaviour of clock models with varying loop complexity. Evening-expressed clock genes showed photoperiod-dependent dusk sensitivity, as predicted by the three-loop model, whereas the one- and two-loop models tracked dawn and dusk, respectively. Output genes for starch degradation achieved dusk-tracking expression through light regulation, rather than a dusk-tracking rhythm. Model analysis predicted which biochemical processes could be manipulated to extend dusk tracking. Our results reveal how an operating principle of biological regulators applies specifically to the plant circadian clock.
PMCID: PMC3010117  PMID: 21045818
Arabidopsis thaliana; biological clocks; dynamical systems; gene regulatory networks; mathematical models; photoperiodism
11.  MULTICOM: a multi-level combination approach to protein structure prediction and its assessments in CASP8 
Bioinformatics  2010;26(7):882-888.
Motivation: Protein structure prediction is one of the most important problems in structural bioinformatics. Here we describe MULTICOM, a multi-level combination approach to improve the various steps in protein structure prediction. In contrast to those methods which look for the best templates, alignments and models, our approach tries to combine complementary and alternative templates, alignments and models to achieve on average better accuracy.
Results: The multi-level combination approach was implemented via five automated protein structure prediction servers and one human predictor which participated in the eighth Critical Assessment of Techniques for Protein Structure Prediction (CASP8), 2008. The MULTICOM servers and human predictor were consistently ranked among the top predictors on the CASP8 benchmark. The methods can predict moderate- to high-resolution models for most template-based targets and low-resolution models for some template-free targets. The results show that the multi-level combination of complementary templates, alternative alignments and similar models aided by model quality assessment can systematically improve both template-based and template-free protein modeling.
Availability: The MULTICOM server is freely available at
PMCID: PMC2844995  PMID: 20150411
12.  Sequence signatures and mRNA concentration can explain two-thirds of protein abundance variation in a human cell line 
We provide a large-scale dataset on absolute protein and matching mRNA concentrations from the human medulloblastoma cell line Daoy. The correlation between mRNA and protein concentrations is significant and positive (Rs=0.46, R2=0.29, P-value<2e16), although non-linear.Out of ∼200 tested sequence features, sequence length, frequency and properties of amino acids, as well as translation initiation-related features are the strongest individual correlates of protein abundance when accounting for variation in mRNA concentration.When integrating mRNA expression data and all sequence features into a non-parametric regression model (Multivariate Adaptive Regression Splines), we were able to explain up to 67% of the variation in protein concentrations. Half of the contributions were attributed to mRNA concentrations, the other half to sequence features relating to regulation of translation and protein degradation. The sequence features are primarily linked to the coding and 3′ untranslated region. To our knowledge, this is the most comprehensive predictive model of human protein concentrations achieved so far.
mRNA decay, translation regulation and protein degradation are essential parts of eukaryotic gene expression regulation (Hieronymus and Silver, 2004; Mata et al, 2005), which enable the dynamics of cellular systems and their responses to external and internal stimuli without having to rely exclusively on transcription regulation. The importance of these processes is emphasized by the generally low correlation between mRNA and protein concentrations. For many prokaryotic and eukaryotic organisms, <50% of variation in protein abundance variation is explained by variation in mRNA concentrations (de Sousa Abreu et al, 2009).
Given the plethora of regulatory mechanisms involved, most studies have focused so far on individual regulators and specific targets. Particularly in human, we currently lack system-wide, quantitative analyses that evaluate the relative contribution of regulatory elements encoded in the mRNA and protein sequence. Existing studies have been carried out only in bacteria and yeast (Nie et al, 2006; Brockmann et al, 2007; Tuller et al, 2007; Wu et al, 2008). Here, we present the first comprehensive analysis on the impact of translation and protein degradation on protein abundance variation in a human cell line. For this purpose, we experimentally measured absolute protein and mRNA concentrations in the Daoy medulloblastoma cell line, using shotgun proteomics and microarrays, respectively (Figure 1). These data comprise one of the largest such sets available today for human. We focused on sequence features that likely impact protein translation and protein degradation, including length, nucleotide composition, structure of the untranslated regions (UTRs), coding sequence, composition of the translation initiation site, presence of upstream open reading frames putative target sites of miRNAs, codon usage, amino-acid composition and protein degradation signals.
Three types of tests have been conducted: (a) we examined partial Spearman's rank correlation of numerical features (e.g. length) with protein concentration, accounting for variation in mRNA concentrations; (b) for numerical and categorical features (e.g. function), we compared two extreme populations with Welch's t-test and (c) using a Multivariate Adaptive Regression Splines model, we analyzed the combined contributions of mRNA expression and sequence features to protein abundance variation (Figure 1). To account for the non-linearity of many relationships, we use non-parametric approaches throughout the analysis.
We observed a significant positive correlation between mRNA and protein concentrations, larger than many previous measurements (de Sousa Abreu et al, 2009). We also show that the contribution of translation and protein degradation is at least as important as the contribution of mRNA transcription and stability to the abundance variation of the final protein products. Although variation in mRNA expression explains ∼25–30% of the variation in protein abundance, another 30–40% can be accounted for by characteristics of the sequences, which we identified in a comparative assessment of global correlates. Among these characteristics, sequence length, amino-acid frequencies and also nucleotide frequencies in the coding region are of strong influence (Figure 3A). Characteristics of the 3′UTR and of the 5′UTR, that is length, nucleotide composition and secondary structures, describe another part of the variation, leaving 33% expression variation unexplained. The unexplained fraction may be accounted for by mechanisms not considered in this analysis (e.g. regulation by RNA-binding proteins or gene-specific structural motifs), as well as expression and measurement noise.
Our combined model including mRNA concentration and sequence features can explain 67% of the variation of protein abundance in this system—and thus has the highest predictive power for human protein abundance achieved so far (Figure 3B).
Transcription, mRNA decay, translation and protein degradation are essential processes during eukaryotic gene expression, but their relative global contributions to steady-state protein concentrations in multi-cellular eukaryotes are largely unknown. Using measurements of absolute protein and mRNA abundances in cellular lysate from the human Daoy medulloblastoma cell line, we quantitatively evaluate the impact of mRNA concentration and sequence features implicated in translation and protein degradation on protein expression. Sequence features related to translation and protein degradation have an impact similar to that of mRNA abundance, and their combined contribution explains two-thirds of protein abundance variation. mRNA sequence lengths, amino-acid properties, upstream open reading frames and secondary structures in the 5′ untranslated region (UTR) were the strongest individual correlates of protein concentrations. In a combined model, characteristics of the coding region and the 3′UTR explained a larger proportion of protein abundance variation than characteristics of the 5′UTR. The absolute protein and mRNA concentration measurements for >1000 human genes described here represent one of the largest datasets currently available, and reveal both general trends and specific examples of post-transcriptional regulation.
PMCID: PMC2947365  PMID: 20739923
gene expression regulation; protein degradation; protein stability; translation
13.  A multi-template combination algorithm for protein comparative modeling 
Multiple protein templates are commonly used in manual protein structure prediction. However, few automated algorithms of selecting and combining multiple templates are available.
Here we develop an effective multi-template combination algorithm for protein comparative modeling. The algorithm selects templates according to the similarity significance of the alignments between template and target proteins. It combines the whole template-target alignments whose similarity significance score is close to that of the top template-target alignment within a threshold, whereas it only takes alignment fragments from a less similar template-target alignment that align with a sizable uncovered region of the target.
We compare the algorithm with the traditional method of using a single top template on the 45 comparative modeling targets (i.e. easy template-based modeling targets) used in the seventh edition of Critical Assessment of Techniques for Protein Structure Prediction (CASP7). The multi-template combination algorithm improves the GDT-TS scores of predicted models by 6.8% on average. The statistical analysis shows that the improvement is significant (p-value < 10-4). Compared with the ideal approach that always uses the best template, the multi-template approach yields only slightly better performance. During the CASP7 experiment, the preliminary implementation of the multi-template combination algorithm (FOLDpro) was ranked second among 67 servers in the category of high-accuracy structure prediction in terms of GDT-TS measure.
We have developed a novel multi-template algorithm to improve protein comparative modeling.
PMCID: PMC2311309  PMID: 18366648
14.  Analysis of multiple compound–protein interactions reveals novel bioactive molecules 
The authors use machine learning of compound-protein interactions to explore drug polypharmacology and to efficiently identify bioactive ligands, including novel scaffold-hopping compounds for two pharmaceutically important protein families: G-protein coupled receptors and protein kinases.
We have demonstrated that machine learning of multiple compound–protein interactions is useful for efficient ligand screening and for assessing drug polypharmacology.This approach successfully identified novel scaffold-hopping compounds for two pharmaceutically important protein families: G-protein-coupled receptors and protein kinases.These bioactive compounds were not detected by existing computational ligand-screening methods in comparative studies.The results of this study indicate that data derived from chemical genomics can be highly useful for exploring chemical space, and this systems biology perspective could accelerate drug discovery processes.
The discovery of novel bioactive molecules advances our systems-level understanding of biological processes and is crucial for innovation in drug development. Perturbations of biological systems by chemical probes provide broader applications not only for analysis of complex systems but also for intentional manipulations of these systems. Nevertheless, the lack of well-characterized chemical modulators has limited their use. Recently, chemical genomics has emerged as a promising area of research applicable to the exploration of novel bioactive molecules, and researchers are currently striving toward the identification of all possible ligands for all target protein families (Wang et al, 2009). Chemical genomics studies have shown that patterns of compound–protein interactions (CPIs) are too diverse to be understood as simple one-to-one events. There is an urgent need to develop appropriate data mining methods for characterizing and visualizing the full complexity of interactions between chemical space and biological systems. However, no existing screening approach has so far succeeded in identifying novel bioactive compounds using multiple interactions among compounds and target proteins.
High-throughput screening (HTS) and computational screening have greatly aided in the identification of early lead compounds for drug discovery. However, the large number of assays required for HTS to identify drugs that target multiple proteins render this process very costly and time-consuming. Therefore, interest in using in silico strategies for screening has increased. The most common computational approaches, ligand-based virtual screening (LBVS) and structure-based virtual screening (SBVS; Oprea and Matter, 2004; Muegge and Oloff, 2006; McInnes, 2007; Figure 1A), have been used for practical drug development. LBVS aims to identify molecules that are very similar to known active molecules and generally has difficulty identifying compounds with novel structural scaffolds that differ from reference molecules. The other popular strategy, SBVS, is constrained by the number of three-dimensional crystallographic structures available. To circumvent these limitations, we have shown that a new computational screening strategy, chemical genomics-based virtual screening (CGBVS), has the potential to identify novel, scaffold-hopping compounds and assess their polypharmacology by using a machine-learning method to recognize conserved molecular patterns in comprehensive CPI data sets.
The CGBVS strategy used in this study was made up of five steps: CPI data collection, descriptor calculation, representation of interaction vectors, predictive model construction using training data sets, and predictions from test data (Figure 1A). Importantly, step 1, the construction of a data set of chemical structures and protein sequences for known CPIs, did not require the three-dimensional protein structures needed for SBVS. In step 2, compound structures and protein sequences were converted into numerical descriptors. These descriptors were used to construct chemical or biological spaces in which decreasing distance between vectors corresponded to increasing similarity of compound structures or protein sequences. In step 3, we represented multiple CPI patterns by concatenating these chemical and protein descriptors. Using these interaction vectors, we could quantify the similarity of molecular interactions for compound–protein pairs, despite the fact that the ligand and protein similarity maps differed substantially. In step 4, concatenated vectors for CPI pairs (positive samples) and non-interacting pairs (negative samples) were input into an established machine-learning method. In the final step, the classifier constructed using training sets was applied to test data.
To evaluate the predictive value of CGBVS, we first compared its performance with that of LBVS by fivefold cross-validation. CGBVS performed with considerably higher accuracy (91.9%) than did LBVS (84.4%; Figure 1B). We next compared CGBVS and SBVS in a retrospective virtual screening based on the human β2-adrenergic receptor (ADRB2). Figure 1C shows that CGBVS provided higher hit rates than did SBVS. These results suggest that CGBVS is more successful than conventional approaches for prediction of CPIs.
We then evaluated the ability of the CGBVS method to predict the polypharmacology of ADRB2 by attempting to identify novel ADRB2 ligands from a group of G-protein-coupled receptor (GPCR) ligands. We ranked the prediction scores for the interactions of 826 reported GPCR ligands with ADRB2 and then analyzed the 50 highest-ranked compounds in greater detail. Of 21 commercially available compounds, 11 showed ADRB2-binding activity and were not previously reported to be ADRB2 ligands. These compounds included ligands not only for aminergic receptors but also for neuropeptide Y-type 1 receptors (NPY1R), which have low protein homology to ADRB2. Most ligands we identified were not detected by LBVS and SBVS, which suggests that only CGBVS could identify this unexpected cross-reaction for a ligand developed as a target to a peptidergic receptor.
The true value of CGBVS in drug discovery must be tested by assessing whether this method can identify scaffold-hopping lead compounds from a set of compounds that is structurally more diverse. To assess this ability, we analyzed 11 500 commercially available compounds to predict compounds likely to bind to two GPCRs and two protein kinases. Functional assays revealed that nine ADRB2 ligands, three NPY1R ligands, five epidermal growth factor receptor (EGFR) inhibitors, and two cyclin-dependent kinase 2 (CDK2) inhibitors were concentrated in the top-ranked compounds (hit rate=30, 15, 25, and 10%, respectively). We also evaluated the extent of scaffold hopping achieved in the identification of these novel ligands. One ADRB2 ligand, two NPY1R ligands, and one CDK2 inhibitor exhibited scaffold hopping (Figure 4), indicating that CGBVS can use this characteristic to rationally predict novel lead compounds, a crucial and very difficult step in drug discovery. This feature of CGBVS is critically different from existing predictive methods, such as LBVS, which depend on similarities between test and reference ligands, and focus on a single protein or highly homologous proteins. In particular, CGBVS is useful for targets with undefined ligands because this method can use CPIs with target proteins that exhibit lower levels of homology.
In summary, we have demonstrated that data mining of multiple CPIs is of great practical value for exploration of chemical space. As a predictive model, CGBVS could provide an important step in the discovery of such multi-target drugs by identifying the group of proteins targeted by a particular ligand, leading to innovation in pharmaceutical research.
The discovery of novel bioactive molecules advances our systems-level understanding of biological processes and is crucial for innovation in drug development. For this purpose, the emerging field of chemical genomics is currently focused on accumulating large assay data sets describing compound–protein interactions (CPIs). Although new target proteins for known drugs have recently been identified through mining of CPI databases, using these resources to identify novel ligands remains unexplored. Herein, we demonstrate that machine learning of multiple CPIs can not only assess drug polypharmacology but can also efficiently identify novel bioactive scaffold-hopping compounds. Through a machine-learning technique that uses multiple CPIs, we have successfully identified novel lead compounds for two pharmaceutically important protein families, G-protein-coupled receptors and protein kinases. These novel compounds were not identified by existing computational ligand-screening methods in comparative studies. The results of this study indicate that data derived from chemical genomics can be highly useful for exploring chemical space, and this systems biology perspective could accelerate drug discovery processes.
PMCID: PMC3094066  PMID: 21364574
chemical genomics; data mining; drug discovery; ligand screening; systems chemical biology
15.  Template-based and free modeling by RAPTOR++ in CASP8 
Proteins  2009;77(Suppl 9):133-137.
We developed and tested RAPTOR++ in CASP8 for protein structure prediction. RAPTOR++ contains four modules: threading, model quality assessment, multiple protein alignment and template-free modeling. RAPTOR++ first threads a target protein to all the templates using three methods and then predicts the quality of the 3D model implied by each alignment using a model quality assessment method. Based upon the predicted quality, RAPTOR++ employs different strategies as follows. If multiple alignments have good quality, RAPTOR++ builds a multiple protein alignment between the target and top templates and then generates a 3D model using MODELLER. If all the alignments have very low quality, RAPTOR++ uses template-free modeling. Otherwise, RAPTOR++ submits a threading-generated 3D model with the best quality. RAPTOR++ was not ready for the first 1/3 targets and was under development during the whole CASP8 season. The template-based and template-free modeling modules in RAPTOR++ are not closely integrated. We are using our template-free modeling technique to refine template-based models.
PMCID: PMC2785131  PMID: 19722267
template-based modeling; template-free modeling; protein threading; model quality assessment
16.  CASP9 Target Classification 
Proteins  2011;79(Suppl 10):21-36.
The Critical Assessment of Protein Structure Prediction round 9 (CASP9) aimed to evaluate predictions for 129 experimentally determined protein structures. To assess tertiary structure predictions, these target structures were divided into domain-based evaluation units that were then classified into two assessment categories: template based modeling (TBM) and template free modeling (FM). CASP9 targets were split into domains of structurally compact evolutionary modules. For the targets with more than one defined domain, the decision to split structures into domains for evaluation was based on server performance. Target domains were categorized based on their evolutionary relatedness to existing templates as well as their difficulty levels indicated by server performance. Those target domains with sequence-related templates and high server prediction performance were classified as TMB, while those targets without identifiable templates and low server performance were classified as FM. However, using these generalizations for classification resulted in a blurred boundary between CASP9 assessment categories. Thus, the FM category included those domains without sequence detectable templates (25 target domains) as well as some domains with difficult to detect templates whose predictions were as poor as those without templates (5 target domains). Several interesting examples are discussed, including targets with sequence related templates that exhibit unusual structural differences, targets with homologous or analogous structure templates that are not detectable by sequence, and targets with new folds.
PMCID: PMC3226894  PMID: 21997778
Protein Structure; CASP9; Classification; Fold space; sequence homologs; structure analogs; free modeling; template based modeling; structure prediction
17.  Low-homology protein threading 
Bioinformatics  2010;26(12):i294-i300.
Motivation: The challenge of template-based modeling lies in the recognition of correct templates and generation of accurate sequence-template alignments. Homologous information has proved to be very powerful in detecting remote homologs, as demonstrated by the state-of-the-art profile-based method HHpred. However, HHpred does not fare well when proteins under consideration are low-homology. A protein is low-homology if we cannot obtain sufficient amount of homologous information for it from existing protein sequence databases.
Results: We present a profile-entropy dependent scoring function for low-homology protein threading. This method will model correlation among various protein features and determine their relative importance according to the amount of homologous information available. When proteins under consideration are low-homology, our method will rely more on structure information; otherwise, homologous information. Experimental results indicate that our threading method greatly outperforms the best profile-based method HHpred and all the top CASP8 servers on low-homology proteins. Tested on the CASP8 hard targets, our threading method is also better than all the top CASP8 servers but slightly worse than Zhang-Server. This is significant considering that Zhang-Server and other top CASP8 servers use a combination of multiple structure-prediction techniques including consensus method, multiple-template modeling, template-free modeling and model refinement while our method is a classical single-template-based threading method without any post-threading refinement.
PMCID: PMC2881377  PMID: 20529920
18.  SpaK/SpaR Two-component System Characterized by a Structure-driven Domain-fusion Method and in Vitro Phosphorylation Studies 
PLoS Computational Biology  2009;5(6):e1000401.
Here we introduce a quantitative structure-driven computational domain-fusion method, which we used to predict the structures of proteins believed to be involved in regulation of the subtilin pathway in Bacillus subtilis, and used to predict a protein-protein complex formed by interaction between the proteins. Homology modeling of SpaK and SpaR yielded preliminary structural models based on a best template for SpaK comprising a dimer of a histidine kinase, and for SpaR a response regulator protein. Our LGA code was used to identify multi-domain proteins with structure homology to both modeled structures, yielding a set of domain-fusion templates then used to model a hypothetical SpaK/SpaR complex. The models were used to identify putative functional residues and residues at the protein-protein interface, and bioinformatics was used to compare functionally and structurally relevant residues in corresponding positions among proteins with structural homology to the templates. Models of the complex were evaluated in light of known properties of the functional residues within two-component systems involving His-Asp phosphorelays. Based on this analysis, a phosphotransferase complexed with a beryllofluoride was selected as the optimal template for modeling a SpaK/SpaR complex conformation. In vitro phosphorylation studies performed using wild type and site-directed SpaK mutant proteins validated the predictions derived from application of the structure-driven domain-fusion method: SpaK was phosphorylated in the presence of 32P-ATP and the phosphate moiety was subsequently transferred to SpaR, supporting the hypothesis that SpaK and SpaR function as sensor and response regulator, respectively, in a two-component signal transduction system, and furthermore suggesting that the structure-driven domain-fusion approach correctly predicted a physical interaction between SpaK and SpaR. Our domain-fusion algorithm leverages quantitative structure information and provides a tool for generation of hypotheses regarding protein function, which can then be tested using empirical methods.
Author Summary
Because proteins so frequently function in coordination with other proteins, identification and characterization of the interactions among proteins are essential for understanding how proteins work. Computational methods for identification of protein-protein interactions have been limited by the degree to which proteins are similar in sequence. However, methods that leverage structure information can overcome this limitation of sequence-based methods; the three-dimensional information provided by structure enables identification of related proteins even when their sequences are dissimilar. In this work we present a quantitative method for identification of protein interacting partners, and we demonstrate its use in modeling the structure of a hypothetical complex between two proteins that function in a bacterial signaling system. This quantitative approach comprises a tool for generation of hypotheses regarding protein function, which can then be tested using empirical methods, and provides a basis for high-throughput prediction of protein-protein interactions, which could be applied on a whole-genome scale.
PMCID: PMC2686270  PMID: 19503843
19.  mRNA turnover rate limits siRNA and microRNA efficacy 
Based on a simple model of the mRNA life cycle, we predict that mRNAs with high turnover rates in the cell are more difficult to perturb with RNAi.We test this hypothesis using a luciferase reporter system and obtain additional evidence from a variety of large-scale data sets, including microRNA overexpression experiments and RT–qPCR-based efficacy measurements for thousands of siRNAs.Our results suggest that mRNA half-lives will influence how mRNAs are differentially perturbed whenever small RNA levels change in the cell, not only after transfection but also during differentiation, pathogenesis and normal cell physiology.
What determines how strongly an mRNA responds to a microRNA or an siRNA? We know that properties of the sequence match between the small RNA and the mRNA are crucial. However, large-scale validations of siRNA efficacies have shown that certain transcripts remain recalcitrant to perturbation even after repeated redesign of the siRNA (Krueger et al, 2007). Weak response to RNAi may thus be an inherent property of the mRNA, but the underlying factors have proven difficult to uncover.
siRNAs induce degradation by sequence-specific cleavage of their target mRNAs (Elbashir et al, 2001). MicroRNAs, too, induce mRNA degradation, and ∼80% of their effect on protein levels can be explained by changes in transcript abundance (Hendrickson et al, 2009; Guo et al, 2010). Given that multiple factors act simultaneously to degrade individual mRNAs, we here consider whether variable responses to micro/siRNA regulation may, in part, be explained simply by the basic dynamics of mRNA turnover. If a transcript is already under strong destabilizing regulation, it is theoretically possible that the relative change in abundance after the addition of a novel degrading factor would be less pronounced compared with a stable transcript (Figure 1). mRNA turnover is achieved by a multitude of factors, and the influence of such factors on targetability can be explored. However, their combined action, including yet unknown factors, is summarized into a single property: the mRNA decay rate.
First, we explored the theoretical relationship between the pre-existing turnover rate of an mRNA, and its expected susceptibility to perturbation by a small RNA. We assumed a basic model of the mRNA life cycle, in which the rate of transcription is constant and the rate of degradation is described by first-order kinetics. Under this model, the relative change in steady-state expression level will become smaller as the pre-existing decay rate grows larger, independent of the transcription rate. This relationship persists also if we assume various degrees of synergy and antagonism between the pre-existing factors and the external factor, with increasing synergism leading to transcripts being more equally targetable, regardless of their pre-existing decay rate.
We next generated a series of four luciferase reporter constructs with destabilizing AU-rich elements (AREs) of various strengths incorporated into their 3′ UTRs. To evaluate how the different constructs would respond to perturbation, we performed co-transfections with an siRNA targeted at the coding region of the luciferase gene. This reduced the signal of the non-destabilized construct to 26% compared with a control siRNA. In contrast, the most destabilized construct showed 42% remaining reporter activity, and we could observe a dose–response relationship across the series.
The reporter experiment encouraged an investigation of this effect on real-world mRNAs. We analyzed a set of 2622 siRNAs, for which individual efficacies were determined using RT–qPCR 48 h post-transfection in HeLa cells ( Of these, 1778 could be associated with an experimentally determined decay rate (Figure 4A). Although the overall correlation between the two variables was modest (Spearman's rank correlation rs=0.22, P<1e−20), we found that siRNAs directed at high-turnover (t1/2<200 min) and medium-turnover (2001000 min) transcripts (P<8e−11 and 4e−9, respectively, two-tailed KS-test, Figure 4B). While 41.6% (498/1196) of the siRNAs directed at low-turnover transcripts reached 10% remaining expression or better, only 16.7% (31/186) of the siRNAs that targeted high-turnover mRNAs reached this high degree of silencing (Figure 4B). Reduced targetability (25.2%, 100/396) was also seen for transcripts with medium-turnover rate.
Our results based on siRNA data suggested that turnover rates could also influence microRNA targeting. By assembling genome-wide mRNA expression data from 20 published microRNA transfections in HeLa cells, we found that predicted target mRNAs with short and medium half-life were significantly less repressed after transfection than their long-lived counterparts (P<8e−5 and P<0.03, respectively, two-tailed KS-test). Specifically, 10.2% (293/2874) of long-lived targets versus 4.4% (41/942) of short-lived targets were strongly (z-score <−3) repressed. siRNAs are known to cause off-target effects that are mediated, in part, by microRNA-like seed complementarity (Jackson et al, 2006). We analyzed changes in transcript levels after transfection of seven different siRNAs, each with a unique seed region (Jackson et al, 2006). Putative ‘off-targets' were identified by mapping of non-conserved seed matches in 3′ UTRs. We found that low-turnover mRNAs (t1/2 >1000 min) were more affected by seed-mediated off-target silencing than high-turnover mRNAs (t1/2 <200 min), with twice as many long-lived seed-containing transcripts (3.8 versus 1.9%) being strongly (z-score <−3) repressed.
In summary, mRNA turnover rates have an important influence on the changes exerted by small RNAs on mRNA levels. It can be assumed that mRNA half-lives will influence how mRNAs are differentially perturbed whenever small RNA levels change in the cell, not only after transfection but also during differentiation, pathogenesis and normal cell physiology.
The microRNA pathway participates in basic cellular processes and its discovery has enabled the development of si/shRNAs as powerful investigational tools and potential therapeutics. Based on a simple kinetic model of the mRNA life cycle, we hypothesized that mRNAs with high turnover rates may be more resistant to RNAi-mediated silencing. The results of a simple reporter experiment strongly supported this hypothesis. We followed this with a genome-wide scale analysis of a rich corpus of experiments, including RT–qPCR validation data for thousands of siRNAs, siRNA/microRNA overexpression data and mRNA stability data. We find that short-lived transcripts are less affected by microRNA overexpression, suggesting that microRNA target prediction would be improved if mRNA turnover rates were considered. Similarly, short-lived transcripts are more difficult to silence using siRNAs, and our results may explain why certain transcripts are inherently recalcitrant to perturbation by small RNAs.
PMCID: PMC3010119  PMID: 21081925
microRNA; mRNA decay; RNAi; siRNA
20.  Restricted N-glycan Conformational Space in the PDB and Its Implication in Glycan Structure Modeling 
PLoS Computational Biology  2013;9(3):e1002946.
Understanding glycan structure and dynamics is central to understanding protein-carbohydrate recognition and its role in protein-protein interactions. Given the difficulties in obtaining the glycan's crystal structure in glycoconjugates due to its flexibility and heterogeneity, computational modeling could play an important role in providing glycosylated protein structure models. To address if glycan structures available in the PDB can be used as templates or fragments for glycan modeling, we present a survey of the N-glycan structures of 35 different sequences in the PDB. Our statistical analysis shows that the N-glycan structures found on homologous glycoproteins are significantly conserved compared to the random background, suggesting that N-glycan chains can be confidently modeled with template glycan structures whose parent glycoproteins share sequence similarity. On the other hand, N-glycan structures found on non-homologous glycoproteins do not show significant global structural similarity. Nonetheless, the internal substructures of these N-glycans, particularly, the substructures that are closer to the protein, show significantly similar structures, suggesting that such substructures can be used as fragments in glycan modeling. Increased interactions with protein might be responsible for the restricted conformational space of N-glycan chains. Our results suggest that structure prediction/modeling of N-glycans of glycoconjugates using structure database could be effective and different modeling approaches would be needed depending on the availability of template structures.
Author Summary
An N-glycan is a carbohydrate chain covalently linked to the side chain of asparagine. Due to the flexibility of carbohydrate chains, it is believed that the N-glycan chains would not have a well-defined structure. However, our survey of N-glycan structures in the PDB shows that the N-glycan structures found on the surfaces of homologous glycoproteins are significantly conserved. This suggests that the interaction between the carbohydrate and the protein structure around the glycan chain plays an important role in determining the N-glycan structure. While the global N-glycan structures found on the surfaces of non-homologous glycoproteins are not conserved, the conformations of the carbohydrate residues that are closer to the protein appear to be more conserved. Our analysis highlights the applicability of template-based approaches used in protein structure prediction to structure prediction and modeling of N-glycans of glycoproteins.
PMCID: PMC3597548  PMID: 23516343
21.  GalaxyTBM: template-based modeling by building a reliable core and refining unreliable local regions 
BMC Bioinformatics  2012;13:198.
Protein structures can be reliably predicted by template-based modeling (TBM) when experimental structures of homologous proteins are available. However, it is challenging to obtain structures more accurate than the single best templates by either combining information from multiple templates or by modeling regions that vary among templates or are not covered by any templates.
We introduce GalaxyTBM, a new TBM method in which the more reliable core region is modeled first from multiple templates and less reliable, variable local regions, such as loops or termini, are then detected and re-modeled by an ab initio method. This TBM method is based on “Seok-server,” which was tested in CASP9 and assessed to be amongst the top TBM servers. The accuracy of the initial core modeling is enhanced by focusing on more conserved regions in the multiple-template selection and multiple sequence alignment stages. Additional improvement is achieved by ab initio modeling of up to 3 unreliable local regions in the fixed framework of the core structure. Overall, GalaxyTBM reproduced the performance of Seok-server, with GalaxyTBM and Seok-server resulting in average GDT-TS of 68.1 and 68.4, respectively, when tested on 68 single-domain CASP9 TBM targets. For application to multi-domain proteins, GalaxyTBM must be combined with domain-splitting methods.
Application of GalaxyTBM to CASP9 targets demonstrates that accurate protein structure prediction is possible by use of a multiple-template-based approach, and ab initio modeling of variable regions can further enhance the model quality.
PMCID: PMC3462707  PMID: 22883815
Protein structure prediction; Model refinement; Loop modeling; Terminus modeling
22.  Modeling of loops in proteins: a multi-method approach 
Template-target sequence alignment and loop modeling are key components of protein comparative modeling. Short loops can be predicted with high accuracy using structural fragments from other, not necessairly homologous proteins, or by various minimization methods. For longer loops multiscale approaches employing coarse-grained de novo modeling techniques should be more effective.
For a representative set of protein structures of various structural classes test predictions of loop regions have been performed using MODELLER, ROSETTA, and a CABS coarse-grained de novo modeling tool. Loops of various length, from 4 to 25 residues, were modeled assuming an ideal target-template alignment of the remaining portions of the protein. It has been shown that classical modeling with MODELLER is usually better for short loops, while coarse-grained de novo modeling is more effective for longer loops. Even very long missing fragments in protein structures could be effectively modeled. Resolution of such models is usually on the level 2-6 Å, which could be sufficient for guiding protein engineering. Further improvement of modeling accuracy could be achieved by the combination of different methods. In particular, we used 10 top ranked models from sets of 500 models generated by MODELLER as multiple templates for CABS modeling. On average, the resulting molecular models were better than the models from individual methods.
Accuracy of protein modeling, as demonstrated for the problem of loop modeling, could be improved by the combinations of different modeling techniques.
PMCID: PMC2837870  PMID: 20149252
23.  The multiple-specificity landscape of modular peptide recognition domains 
Using large scale experimental datasets, the authors show how modular protein interaction domains such as PDZ, SH3 or WW domains, frequently display unexpected multiple binding specificity. The observed multiple specificity leads to new structural insights and accurately predicts new protein interactions.
Modular protein domains interacting with short linear peptides, such as PDZ, SH3 or WW domains, display a rich binding specificity with significant interplay (or correlation) between ligand residues.The binding specificity of these domains is more accurately described with a multiple specificity model.The multiple specificity reveals new structural insights and predicts new protein interactions.
Modular protein domains have a central role in the complex network of signaling pathways that governs cellular processes. Many of them, called peptide recognition domains, bind short linear regions in their target proteins, such as the well-known SH3 or PDZ domains. These domain–peptide interactions are the predominant form of protein interaction in signaling pathways.
Because of the relative simplicity of the interaction, their binding specificity is generally represented using a simple model, analogous to transcription factor binding: the domain binds a short stretch of amino acids and at each position some amino acids are preferred over other ones. Thus, for each position, a probability can be assigned to each amino acid and these probabilities are often grouped into a matrix called position weight matrix (PWM) or position-specific scoring matrix. Such a matrix can then be represented in a highly intuitive manner as a so-called sequence logo (see Figure 1).
A main shortcoming of this specificity model is that, although intuitive and interpretable, it inherently assumes that all residues in the peptide contribute independently to binding. On the basis of statistical analyses of large data sets of peptides binding to PDZ, SH3 and WW domains, we show that for most domains, this is not the case. Indeed, there is complex and highly significant interplay between the ligand residues. To overcome this issue, we develop a computational model that can both take into account such correlations and also preserve the advantages of PWMs, namely its straightforward interpretability.
Briefly, our method detects whether the domain is capable of binding its targets not only with a single specificity but also with multiple specificities. If so, it will determine all the relevant specificities (see Figure 1). This is accomplished by using a machine learning algorithm based on mixture models, and the results can be effectively visualized as multiple sequence logos. In other words, based on experimentally derived data sets of binding peptides, we determine for every domain, in addition to the known specificity, one or more new specificities. As such, we capture more real information, and our model performs better than previous models of binding specificity.
A crucial question is what these new specificities correspond to: are they simply mathematical artifacts coming out of some algorithm or do they represent something we can understand on a biophysical or structural level? Overall, the new specificities provide us with substantial new intuitive insight about the structural basis of binding for these domains. We can roughly identify two cases.
First, we have neighboring (or very close in sequence) amino acids in the ligand that show significant correlations. These usually correspond to amino acids whose side chains point in the same directions and often occupy the same physical space, and therefore can directly influence each other.
In other cases, we observe that multiple specificities found for a single domain are very different from each other. They correspond to different ways that the domain accommodates its binders. Often, conformational changes are required to switch from one binding mode to another. In almost all cases, only one canonical binding mode was previously known, and our analysis enables us to predict several interesting non-canonical ones. Specifically, we discuss one example in detail in Figure 5. In a PDZ domain of DLG1, we identify a novel binding specificity that differs from the canonical one by the presence of an additional tryptophan at the C terminus of the ligand. From a structural point of view, this would require a flexible loop to move out of the way to accommodate this rather large side chain. We find evidence of this predicted new binding mode based on both existing crystal structures and structural modeling.
Finally, our model of binding specificity leads to predictions of many new and previously unknown protein interactions. We validate a number of these using the membrane yeast two-hybrid approach.
In summary, we show here that multiple specificity is a general and underappreciated phenomenon for modular peptide recognition domains and that it leads to substantial new insight into the basis of protein interactions.
Modular protein interaction domains form the building blocks of eukaryotic signaling pathways. Many of them, known as peptide recognition domains, mediate protein interactions by recognizing short, linear amino acid stretches on the surface of their cognate partners with high specificity. Residues in these stretches are usually assumed to contribute independently to binding, which has led to a simplified understanding of protein interactions. Conversely, we observe in large binding peptide data sets that different residue positions display highly significant correlations for many domains in three distinct families (PDZ, SH3 and WW). These correlation patterns reveal a widespread occurrence of multiple binding specificities and give novel structural insights into protein interactions. For example, we predict a new binding mode of PDZ domains and structurally rationalize it for DLG1 PDZ1. We show that multiple specificity more accurately predicts protein interactions and experimentally validate some of the predictions for the human proteins DLG1 and SCRIB. Overall, our results reveal a rich specificity landscape in peptide recognition domains, suggesting new ways of encoding specificity in protein interaction networks.
PMCID: PMC3097085  PMID: 21525870
binding specificity; peptide recognition domains; PDZ; phage display; residue correlations
24.  Template-based protein structure modeling using TASSERVMT 
Proteins  2011;10.1002/prot.23183.
Template-based protein structure modeling is commonly used for protein structure prediction. Based on the observation that multiple template-based methods often perform better than single template-based methods, we further explore the use of a variable number of multiple templates for a given target in the latest variant of TASSER, TASSERVMT. We first develop an algorithm that improves the target-template alignment for a given template. The improved alignment, called the SP3 alternative alignment, is generated by a parametric alignment method coupled with short TASSER refinement on models selected using knowledge-based scores. The refined top model is then structurally aligned to the template to produce the SP3 alternative alignment. Templates identified using SP3 threading are combined with the SP3 alternative and HHEARCH alignments to provide target alignments to each template. These template models are then grouped into sets containing a variable number of template/alignment combinations. For each set, we run short TASSER simulations to build full-length models. Then, the models from all sets of templates are pooled, and the top 20–50 models selected using FTCOM ranking method. These models are then subjected to a single longer TASSER refinement run for final prediction. We benchmarked our method by comparison with our previously developed approach, pro-sp3-TASSER, on a set with 874 Easy and 318 Hard targets. The average GDT-TS score improvements for the first model are 3.5% and 4.3% for Easy and Hard targets, respectively. When tested on the 112 CASP9 targets, our method improves the average GDT-TS scores as compared to pro-sp3-TASSER by 8.2% and 9.3% for the 80 Easy and 32 Hard targets, respectively. It also shows slightly better results than the top ranked CASP9 Zhang-Server, QUARK and HHpredA methods. The program is available for download at
PMCID: PMC3291807  PMID: 22105797
template-based modeling; threading; alignment; SP3; TASSER
25.  tRNA Signatures Reveal a Polyphyletic Origin of SAR11 Strains among Alphaproteobacteria 
PLoS Computational Biology  2014;10(2):e1003454.
Molecular phylogenetics and phylogenomics are subject to noise from horizontal gene transfer (HGT) and bias from convergence in macromolecular compositions. Extensive variation in size, structure and base composition of alphaproteobacterial genomes has complicated their phylogenomics, sparking controversy over the origins and closest relatives of the SAR11 strains. SAR11 are highly abundant, cosmopolitan aquatic Alphaproteobacteria with streamlined, A+T-biased genomes. A dominant view holds that SAR11 are monophyletic and related to both Rickettsiales and the ancestor of mitochondria. Other studies dispute this, finding evidence of a polyphyletic origin of SAR11 with most strains distantly related to Rickettsiales. Although careful evolutionary modeling can reduce bias and noise in phylogenomic inference, entirely different approaches may be useful to extract robust phylogenetic signals from genomes. Here we develop simple phyloclassifiers from bioinformatically derived tRNA Class-Informative Features (CIFs), features predicted to target tRNAs for specific interactions within the tRNA interaction network. Our tRNA CIF-based model robustly and accurately classifies alphaproteobacterial genomes into one of seven undisputed monophyletic orders or families, despite great variability in tRNA gene complement sizes and base compositions. Our model robustly rejects monophyly of SAR11, classifying all but one strain as Rhizobiales with strong statistical support. Yet remarkably, conventional phylogenetic analysis of tRNAs classifies all SAR11 strains identically as Rickettsiales. We attribute this discrepancy to convergence of SAR11 and Rickettsiales tRNA base compositions. Thus, tRNA CIFs appear more robust to compositional convergence than tRNA sequences generally. Our results suggest that tRNA-CIF-based phyloclassification is robust to HGT of components of the tRNA interaction network, such as aminoacyl-tRNA synthetases. We explain why tRNAs are especially advantageous for prediction of traits governing macromolecular interactions from genomic data, and why such traits may be advantageous in the search for robust signals to address difficult problems in classification and phylogeny.
Author Summary
If gene products work well in the networks of foreign cells, their genes may transfer horizontally between unrelated genomes. What factors dictate the ability to integrate into foreign networks? Different RNAs and proteins must interact specifically in order to function well as a system. For example, tRNA functions are determined by the interactions they have with other macromolecules. We have developed ways to predict, from genomic data alone, how tRNAs distinguish themselves to their specific interaction partners. Here, as proof of concept, we built a robust computational model from these bioinformatic predictions in seven lineages of Alphaproteobacteria. We validated our model by classifying hundreds of diverse alphaproteobacterial taxa and tested it on eight strains of SAR11, a phylogenetically controversial group that is highly abundant in the world's oceans. We found that different strains of SAR11 are more distantly related, both to each other and to mitochondria, than widely believed. We explain conflicting results about SAR11 as an artifact of bias created by the variability in base contents of alphaproteobacterial genomes. While this bias affects tRNAs too, our classifier appears unexpectedly robust to it. More broadly, our results suggest that traits governing macromolecular interactions may be more faithfully vertically inherited than the macromolecules themselves.
PMCID: PMC3937112  PMID: 24586126

Results 1-25 (1464065)