Over the last decade, and especially after the advent of fluorescent in situ hybridization imaging and chromosome conformation capture methods, the availability of experimental data on genome three-dimensional organization has dramatically increased. We now have access to unprecedented details of how genomes organize within the interphase nucleus. Development of new computational approaches to leverage this data has already resulted in the first three-dimensional structures of genomic domains and genomes. Such approaches expand our knowledge of the chromatin folding principles, which has been classically studied using polymer physics and molecular simulations. Our outlook describes computational approaches for integrating experimental data with polymer physics, thereby bridging the resolution gap for structural determination of genomes and genomic domains.
The central dogma of molecular biology has provided a meaningful principle
for data integration in the field of genomics. In this context, integration reflects
the known transitions from a chromosome to a protein sequence: transcription,
intron splicing, exon assembly and translation. There is no such clear principle for
integrating proteomics data, since the laws governing protein folding and interactivity
are not quite understood. In our effort to bring together independent pieces of
information relative to proteins in a biologically meaningful way, we assess the bias of
bioinformatics resources and consequent approximations in the framework of small-scale
studies. We analyse proteomics data while following both a data-driven (focus
on proteins smaller than 10 kDa) and a hypothesis-driven (focus on whole bacterial
proteomes) approach. These applications are potentially the source of specialized
complements to classical biological ontologies.
All complex life on Earth is eukaryotic. All eukaryotic cells share a common ancestor that arose just once in four billion years of evolution. Prokaryotes show no tendency to evolve greater morphological complexity, despite their metabolic virtuosity. Here I argue that the eukaryotic cell originated in a unique prokaryotic endosymbiosis, a singular event that transformed the selection pressures acting on both host and endosymbiont.
The reductive evolution and specialisation of endosymbionts to mitochondria resulted in an extreme genomic asymmetry, in which the residual mitochondrial genomes enabled the expansion of bioenergetic membranes over several orders of magnitude, overcoming the energetic constraints on prokaryotic genome size, and permitting the host cell genome to expand (in principle) over 200,000-fold. This energetic transformation was permissive, not prescriptive; I suggest that the actual increase in early eukaryotic genome size was driven by a heavy early bombardment of genes and introns from the endosymbiont to the host cell, producing a high mutation rate. Unlike prokaryotes, with lower mutation rates and heavy selection pressure to lose genes, early eukaryotes without genome-size limitations could mask mutations by cell fusion and genome duplication, as in allopolyploidy, giving rise to a proto-sexual cell cycle. The side effect was that a large number of shared eukaryotic basal traits accumulated in the same population, a sexual eukaryotic common ancestor, radically different to any known prokaryote.
The combination of massive bioenergetic expansion, release from genome-size constraints, and high mutation rate favoured a protosexual cell cycle and the accumulation of eukaryotic traits. These factors explain the unique origin of eukaryotes, the absence of true evolutionary intermediates, and the evolution of sex in eukaryotes but not prokaryotes.
This article was reviewed by: Eugene Koonin, William Martin, Ford Doolittle and Mark van der Giezen. For complete reports see the Reviewers' Comments section.
Significant progress has been made in recent years in a variety of seemingly unrelated fields such as sequencing, protein structure prediction, and high-throughput transcriptomics and metabolomics. At the same time new microscopic models were developed that made it possible to analyze evolution of genes and genomes from first principles. The results from these efforts enable, for the first time, a comprehensive insight into the evolution of complex systems and organisms on all scales – from sequences to organisms and populations. Every newly sequenced genome uncovers new genes, families, and folds. Where do these new genes come from? How does gene duplication and subsequent divergence of sequence and structure affect the fitness of the organism? What role does regulation play in the evolution of proteins and folds? Emerging synergism between data and modeling provide first robust answers to these questions.
The primary role of the nucleus as an information storage, retrieval, and replication site requires the physical organization and compaction of meters of DNA. Although it has been clear for many years that nucleosomes constitute the first level of chromatin compaction, this contributes a relatively small fraction of the condensation needed to fit the typical genome into an interphase nucleus or set of metaphase chromosomes, indicating that there are additional “higher order” levels of chromatin condensation. Identifying these levels, their interrelationships, and the principles that govern their occurrence has been a challenging and much discussed problem. In this article, we focus on recent experimental advances and the emerging evidence indicating that structural plasticity and chromatin dynamics play dominant roles in genome organization. We also discuss novel approaches likely to yield important insights in the near future, and suggest research areas that merit further study.
How chromosomes are folded and organized within the nucleus is intensely debated. Recent work indicates their higher-order structure is surprisingly dynamic, which may be critical for functional plasticity.
Motivation: Eugene Myers in his string graph paper suggested that in a string graph or equivalently a unitig graph, any path spells a valid assembly. As a string/unitig graph also encodes every valid assembly of reads, such a graph, provided that it can be constructed correctly, is in fact a lossless representation of reads. In principle, every analysis based on whole-genome shotgun sequencing (WGS) data, such as SNP and insertion/deletion (INDEL) calling, can also be achieved with unitigs.
Results: To explore the feasibility of using de novo assembly in the context of resequencing, we developed a de novo assembler, fermi, that assembles Illumina short reads into unitigs while preserving most of information of the input reads. SNPs and INDELs can be called by mapping the unitigs against a reference genome. By applying the method on 35-fold human resequencing data, we showed that in comparison to the standard pipeline, our approach yields similar accuracy for SNP calling and better results for INDEL calling. It has higher sensitivity than other de novo assembly based methods for variant calling. Our work suggests that variant calling with de novo assembly can be a beneficial complement to the standard variant calling pipeline for whole-genome resequencing. In the methodological aspects, we propose FMD-index for forward–backward extension of DNA sequences, a fast algorithm for finding all super-maximal exact matches and one-pass construction of unitigs from an FMD-index.
Analysis of a large collection of short insertions and deletions in primates and flies shows that the rate of insertions or deletions of specific lengths can vary by more than 100 fold, depending on the surrounding sequence.
Insertions and deletions (indels) are an important evolutionary force, making the evolutionary process more efficient and flexible by copying and removing genomic fragments of various lengths instead of rediscovering them by point mutations. As a mutational process, indels are known to be more active in specific sequences (like micro-satellites) but not much is known about the more general and mechanistic effect of sequence context on the insertion and deletion susceptibility of genomic loci.
Here we analyze a large collection of high confidence short insertions and deletions in primates and flies, revealing extensive correlations between sequence context and indel rates and building principled models for predicting these rates from sequence. According to our results, the rate of insertion or deletion of specific lengths can vary by more than 100-fold, depending on the surrounding sequence. These mutational biases can strongly influence the composition of the genome and the rate at which particular sequences appear. We exemplify this by showing how degenerate loci in human exons are selected to reduce their frame shifting indel propensity.
Insertions and deletions are strongly affected by sequence context. Consequentially, genomes must adapt to significant variation in the mutational input at indel-prone and indel-immune loci.
Preterm delivery (PTD) is a significant public health problem associated with greater risk of mortality and morbidity in infants and mothers. Pathophysiologic processes that may lead to PTD start early in pregnancy. We investigated early pregnancy peripheral blood global gene expression and PTD risk.
As part of a prospective study, ribonucleic acid was extracted from blood samples (collected at 16 weeks gestational age) from 14 women who had PTD (cases) and 16 women who delivered at term (controls). Gene expressions were measured using the GeneChip® Human Genome U133 Plus 2.0 Array. Student's T-test and fold change analysis were used to identify differentially expressed genes. We used hierarchical clustering and principle components analysis to characterize signature gene expression patterns among cases and controls. Pathway and promoter sequence analyses were used to investigate functions and functional relationships as well as regulatory regions of differentially expressed genes.
A total of 209 genes, including potential candidate genes (e.g. PTGDS, prostaglandin D2 synthase 21 kDa), were differentially expressed. A set of these genes achieved accurate pre-diagnostic separation of cases and controls. These genes participate in functions related to immune system and inflammation, organ development, metabolism (lipid, carbohydrate and amino acid) and cell signaling. Binding sites of putative transcription factors such as EGR1 (early growth response 1), TFAP2A (transcription factor AP2A), Sp1 (specificity protein 1) and Sp3 (specificity protein 3) were over represented in promoter regions of differentially expressed genes. Real-time PCR confirmed microarray expression measurements of selected genes.
PTD is associated with maternal early pregnancy peripheral blood gene expression changes. Maternal early pregnancy peripheral blood gene expression patterns may be useful for better understanding of PTD pathophysiology and PTD risk prediction.
The spatial organization of chromosomes inside the cell nucleus is still poorly understood. This organization is guided by intra- and interchromosomal contacts and by interactions of specific chromosomal loci with relatively fixed nuclear “landmarks” such as the nuclear envelope and the nucleolus. New molecular genome-wide mapping techniques have begun to uncover both types of molecular interactions, providing insights into the fundamental principles of interphase chromosome folding.
Osteoclasts are the principle bone-resorbing cells. Precise control of balanced osteoclast activity is indispensable for bone homeostasis. Osteoclast activation mediated by RANK-TRAF6 axis has been clearly identified. However, a negative regulation-machinery in osteoclast remains unclear. TRAF family member-associated NF-κB activator (TANK) is induced by about 10 folds during osteoclastogenesis, according to a genome-wide analysis of gene expression before and after osteoclast maturation, and confirmed by western blot and quantitative RT-PCR. Bone marrow macrophages (BMMs) transduced with lentivirus carrying tank-shRNA were induced to form osteoclast in the presence of RANKL and M-CSF. Tank expression was downregulated by 90% by Tank-shRNA, which is confirmed by western blot. Compared with wild-type (WT) cells, osteoclastogenesis of Tank-silenced BMMs was increased, according to tartrate-resistant acid phosphatase (TRAP) stain on day 5 and day 7. Number of bone resorption pits by Tank-silenced osteoclasts was increased by 176% compared with WT cells, as shown by wheat germ agglutinin (WGA) stain and scanning electronic microscope (SEM) analysis. Survival rate of Tank-silenced mature osteoclast is also increased. However, acid production of Tank-knockdown cells was not changed compared with control cells. IκBα phosphorylation is increased in tank-silenced cells, indicating that TANK may negatively regulate NF-κB activity in osteoclast. In conclusion, Tank, whose expression is increased during osteoclastogenesis, inhibits osteoclast formation, activity and survival, by regulating NF-κB activity and c-FLIP expression. Tank enrolls itself in a negative feedback loop in bone resorption. These results may provide means for therapeutic intervention in diseases of excessive bone resorption.
TANK; RANKL; NF-κB; Osteoclast.
Microarray technology is a widely used approach for monitoring genome-wide gene expression. For Arabidopsis, there are over 1,800 microarray hybridizations representing many different experimental conditions on Affymetrix™ ATH1 gene chips alone. This huge amount of data offers a unique opportunity to infer the principles that govern the regulation of gene expression in plants.
We used bioinformatics methods to analyze publicly available data obtained using the ATH1 chip from Affymetrix. A total of 1887 ATH1 hybridizations were normalized and filtered to eliminate low-quality hybridizations. We classified and compared control and treatment hybridizations and determined differential gene expression. The largest differences in gene expression were observed when comparing samples obtained from different organs. On average, ten-fold more genes were differentially expressed between organs as compared to any other experimental variable. We defined "gene responsiveness" as the number of comparisons in which a gene changed its expression significantly. We defined genes with the highest and lowest responsiveness levels as hypervariable and housekeeping genes, respectively. Remarkably, housekeeping genes were best distinguished from hypervariable genes by differences in methylation status in their transcribed regions. Moreover, methylation in the transcribed region was inversely correlated (R2 = 0.8) with gene responsiveness on a genome-wide scale. We provide an example of this negative relationship using genes encoding TCA cycle enzymes, by contrasting their regulatory responsiveness to nitrate and methylation status in their transcribed regions.
Our results indicate that the Arabidopsis transcriptome is largely established during development and is comparatively stable when faced with external perturbations. We suggest a novel functional role for DNA methylation in the transcribed region as a key determinant capable of restraining the capacity of a gene to respond to internal/external cues. Our findings suggest a prominent role for epigenetic mechanisms in the regulation of gene expression in plants.
Specific attachment of chromosomal sites to the nuclear matrix is crucial to the control of transcription and DNA replication.
Although the principles governing chromosomal architecture are largely unresolved, there is evidence that higher-order chromatin folding is mediated by the anchoring of specific DNA sequences to the nuclear matrix. These genome anchors are also crucial regulators of gene expression and DNA replication, and play a role in pathogenesis.
The members of cupin superfamily exhibit large variations in their sequences, functions, organization of domains, quaternary associations and the nature of bound metal ion, despite having a conserved β-barrel structural scaffold. Here, an attempt has been made to understand structure-function relationships among the members of this diverse superfamily and identify the principles governing functional diversity. The cupin superfamily also contains proteins for which the structures are available through world-wide structural genomics initiatives but characterized as “hypothetical”. We have explored the feasibility of obtaining clues to functions of such proteins by means of comparative analysis with cupins of known structure and function.
A 3-D structure-based phylogenetic approach was undertaken. Interestingly, a dendrogram generated solely on the basis of structural dissimilarity measure at the level of domain folds was found to cluster functionally similar members. This clustering also reflects an independent evolution of the two domains in bicupins. Close examination of structural superposition of members across various functional clusters reveals structural variations in regions that not only form the active site pocket but are also involved in interaction with another domain in the same polypeptide or in the oligomer.
Structure-based phylogeny of cupins can influence identification of functions of proteins of yet unknown function with cupin fold. This approach can be extended to other proteins with a common fold that show high evolutionary divergence. This approach is expected to have an influence on the function annotation in structural genomics initiatives.
Cytokinesis requires duplication of cellular structures followed by bipolarization of the predivisional cell. As a common principle, this applies to prokaryotes as well as eukaryotes. With respect to eukaryotes, the discussion has focused mainly on Saccharomyces cerevisiae and on Schizosaccharomyces pombe. Escherichia coli and to a lesser extent Bacillus subtilis have been used as prokaryotic examples. To establish a bipolar cell, duplication of a eukaryotic origin of DNA replication as well as its genome is not sufficient. Duplication of the microtubule-organizing center is required as a prelude to mitosis, and it is here that the dynamic cytoskeleton with all its associated proteins comes to the fore. In prokaryotes, a cytoskeleton that pervades the cytoplasm appears to be absent. DNA replication and the concomitant DNA segregation seem to occur without help from extensive cytosolic supramacromolecular assemblies but with help from the elongating cellular envelope. Prokaryotic cytokinesis proceeds through a contracting ring, which has a roughly 100-fold-smaller circumference than its eukaryotic counterpart. Although the ring contains proteins that can be considered as predecessors of actin, tubulin, and microtubule-associated proteins, its macromolecular composition is essentially different.
Spatial organization of chromatin in the interphase nucleus plays a role in gene expression and inheritance. Although it appears not to be random, the principles of this organization are largely unknown. In this work, we show an explicit relationship between the intranuclear localization of various chromosome segments and the pattern of gene distribution along the genome sequence. Using a 7-megabase-long region of the Drosophila melanogaster chromosome 2 as a model, we observed that the six gene-poor chromosome segments identified in the region interact with components of the nuclear matrix to form a compact stable cluster. The six gene-rich segments form a spatially segregated unstable cluster dependent on nonmatrix nuclear proteins. The resulting composite structure formed by clusters of gene-rich and gene-poor regions is reproducible between the nuclei. We suggest that certain aspects of chromosome folding in interphase are predetermined and can be inferred through in silico analysis of chromosome sequence, using gene density profile as a manifestation of “folding code.”
This article describes a simple and inexpensive hands-on simulation of protein folding suitable for use in large lecture classes. This activity uses a minimum of parts, tools, and skill to simulate some of the fundamental principles of protein folding. The major concepts targeted are that proteins begin as linear polypeptides and fold to three-dimensional structures, noncovalent interactions drive this folding process, and the final folded shape of a protein depends on its amino acid sequence. At the start of the activity, students are given pieces of insulated wire from which they each construct and fold their own polypeptide. This activity was evaluated in three ways. A random sample of student-generated polypeptides collected after the activity shows that most students were able to create an appropriate structure. After this activity, students (n = 154) completed an open-ended survey. Their responses showed that more than three-quarters of the students learned one or more of the core concepts being demonstrated. Finally, a follow-up survey was conducted seven weeks after the activity; responses to this survey (n = 63) showed that a similar fraction of students still retained these key concepts. This activity should be useful in large introductory-level college biology or biochemistry lectures.
Protein function is generated and maintained by the proteostasis network (PN) (Balch et al. (2008) Science, 319:916). The PN is a modular, yet integrated system unique to each cell type that is sensitive to signaling pathways that direct development and aging, and respond to folding stress. Mismanagement of protein folding and function triggered by genetic, epigenetic, and environmental causes poses a major challenge to human health and lifespan. Herein, we address the impact of proteostasis defined by the FoldFx model on our understanding of protein folding and function in biology. FoldFx describes how general proteostasis control (GPC) enables the polypeptide chain sequence to achieve functional balance in the context of the cellular proteome. By linking together the chemical and energetic properties of the protein fold with the composition of the PN we discuss the principle of the proteostasis boundary (PB) as a key component of GPC. The curved surface of the PB observed in 3-dimensional space suggests that the polypeptide chain sequence and the PN operate as an evolutionarily conserved functional unit to generate and sustain protein dynamics required for biology. Modeling general proteostasis provides a rational basis for tackling some of the most challenging diseases facing mankind in the 21st century.
The recognition of functional binding sites in genomic DNA remains one of the fundamental challenges of genome research. During the last decades, a plethora of different and well-adapted models has been developed, but only little attention has been payed to the development of different and similarly well-adapted learning principles. Only recently it was noticed that discriminative learning principles can be superior over generative ones in diverse bioinformatics applications, too.
Here, we propose a generalization of generative and discriminative learning principles containing the maximum likelihood, maximum a posteriori, maximum conditional likelihood, maximum supervised posterior, generative-discriminative trade-off, and penalized generative-discriminative trade-off learning principles as special cases, and we illustrate its efficacy for the recognition of vertebrate transcription factor binding sites.
We find that the proposed learning principle helps to improve the recognition of transcription factor binding sites, enabling better computational approaches for extracting as much information as possible from valuable wet-lab data. We make all implementations available in the open-source library Jstacs so that this learning principle can be easily applied to other classification problems in the field of genome and epigenome analysis.
We provide an overview of lipid-dependent polytopic membrane protein folding and topogenesis. Lipid dependence of this process was determined by employing Escherichia coli cells in which specific lipids can be eliminated, substituted, tightly titrated or controlled temporally during membrane protein synthesis and assembly. The secondary transport protein lactose permease (LacY) was used to establish general principles underlying the molecular basis of lipid-dependent effects on protein domain folding, protein transmembrane domain (TM) orientation, and function. These principles were then extended to several other secondary transport proteins of E. coli. The methods used to follow proper conformational organization of protein domains and the topological organization of protein TMs in whole cells and membranes are described. The proper folding of an extramembrane domain of LacY that is crucial for energy dependent uphill transport function depends on specific lipids acting as non-protein molecular chaperones. Correct TM topogenesis is dependent on charge interactions between the cytoplasmic surface of membrane proteins and a proper balance of the membrane surface net charge defined by the lipid head groups. Short-range interactions between the nascent protein chain and the translocon are necessary but not sufficient for establishment of final topology. After release from the translocon short-range interactions between lipid head groups and the nascent protein chain, partitioning of protein hydrophobic domains into the membrane bilayer, and long–range interactions within the protein thermodynamically drive final membrane protein organization. Given the diversity of membrane lipid compositions throughout nature, it is tempting to speculate that during the course of evolution the physical and chemical properties of proteins and lipids have co-evolved in the context of the lipid environment of membrane systems in which both are mutually depend on each other for functional organization of proteins.
phosphatidylethanolamine; lactose permease; protein topology; lipochaperone; positive-inside rule
In this review, we give an overview of recent literature on the structure and stability of unimolecular G-rich quadruplex structures that are relevant to drug design and for in vivo function. The unifying theme in this review is energetics. The thermodynamic stability of quadruplexes has not been studied in the same detail as DNA and RNA duplexes, and there are important differences in the balance of forces between these classes of folded oligonucleotides. We provide an overview of the principles of stability and where available the experimental data that report on these principles. Significant gaps in the literature have been identified, that should be filled by a systematic study of well-defined quadruplexes not only to provide the basic understanding of stability both for design purposes, but also as it relates to in vivo occurrence of quadruplexes. Techniques that are commonly applied to the determination of the structure, stability and folding are discussed in terms of information content and limitations. Quadruplex structures fold and unfold comparatively slowly, and DNA unwinding events associated with transcription and replication may be operating far from equilibrium. The kinetics of formation and resolution of quadruplexes, and methodologies are discussed in the context of stability and their possible biological occurrence.
Identification of the structural domains of proteins is important for our understanding of the organizational principles and mechanisms of protein folding, and for insights into protein function and evolution. Algorithmic methods of dissecting protein of known structure into domains developed so far are based on an examination of multiple geometrical, physical and topological features. Successful as many of these approaches are, they employ a lot of heuristics, and it is not clear whether they illuminate any deep underlying principles of protein domain organization. Other well-performing domain dissection methods rely on comparative sequence analysis. These methods are applicable to sequences with known and unknown structure alike, and their success highlights a fundamental principle of protein modularity, but this does not directly improve our understanding of protein spatial structure.
We present a novel graph-theoretical algorithm for the identification of domains in proteins with known three-dimensional structure. We represent the protein structure as an undirected, unweighted and unlabeled graph whose nodes correspond to the secondary structure elements and edges represent physical proximity of at least one pair of alpha carbon atoms from two elements. Domains are identified as constrained partitions of the graph, corresponding to sets of vertices obtained by the maximization of the cycle distributions found in the graph. When a partition is found, the algorithm is iteratively applied to each of the resulting subgraphs. The decision to accept or reject a tentative cut position is based on a specific classifier. The algorithm is applied iteratively to each of the resulting subgraphs and terminates automatically if partitions are no longer accepted. The distribution of cycles is the only type of information on which the decision about protein dissection is based. Despite the barebone simplicity of the approach, our algorithm approaches the best heuristic algorithms in accuracy.
Our graph-theoretical algorithm uses only topological information present in the protein structure itself to find the domains and does not rely on any geometrical or physical information about protein molecule. Perhaps unexpectedly, these drastic constraints on resources, which result in a seemingly approximate description of protein structures and leave only a handful of parameters available for analysis, do not lead to any significant deterioration of algorithm accuracy. It appears that protein structures can be rigorously treated as topological rather than geometrical objects and that the majority of information about protein domains can be inferred from the coarse-grained measure of pairwise proximity between elements of secondary structure elements.
Difficult problems in structural bioinformatics are often studied in simple exact models to gain insights and to derive general principles. Protein folding, for example, has long been studied in the lattice model. Recently, researchers have also begun to apply the lattice model to the study of RNA folding.
We present a novel method for predicting RNA secondary structures with pseudoknots: first simulate the folding dynamics of the RNA sequence on the 3D triangular lattice, next extract and select a set of disjoint base pairs from the best lattice conformation found by the folding simulation. Experiments on sequences from PseudoBase show that our prediction method outperforms the HotKnot algorithm of Ren, Rastegari, Condon and Hoos, a leading method for RNA pseudoknot prediction. Our method for RNA secondary structure prediction can be adapted into an efficient reconstruction method that, given an RNA sequence and an associated secondary structure, finds a conformation of the sequence on the 3D triangular lattice that realizes the base pairs in the secondary structure. We implemented a suite of computer programs for the simulation and visualization of RNA folding on the 3D triangular lattice. These programs come with detailed documentation and are accessible from the companion website of this paper at http://www.cs.usu.edu/~mjiang/rna/DeltaIS/.
Folding simulation on the 3D triangular lattice is effective method for RNA secondary structure prediction and lattice conformation reconstruction. The visualization software for the lattice conformations of RNA structures is a valuable tool for the study of RNA folding and is a great pedagogic device.
A complete understanding of a protein folding mechanism requires description of the distribution of microscopic pathways that connect the folded and unfolded states. This distribution can, in principle, be described by computer simulations and theoretical models of protein folding, but is hidden in conventional experiments on large ensembles of molecules because only average properties are measured. A long-term goal of single molecule fluorescence studies is to time-resolve the structural events as individual molecules make transitions between folded and unfolded states. Although such studies are still in their infancy, the work up to now shows great promise and has already produced novel and important information on current issues in protein folding that has been impossible or difficult to obtain from ensemble measurements.
To understand the physical and evolutionary determinants of protein folding, we map out the complete organization of thermodynamic and kinetic properties for protein sequences that share the same fold. The exhaustive nature of our study necessitates using simplified models of protein folding. We obtain a stability map and a folding rate map in sequence space. Comparison of the two maps reveals a common organizational principle: optimality decreases more or less uniformly with distance from the optimal sequence in the sequence space. This gives a funnel-shaped optimality surface. Evolutionary dynamics of a sequence population on these two maps reveal how the simple organization of sequence space affects the distributions of stability and folding rate preferred by evolution.
protein folding; protein sequence structure relationships; lattice model; hydrophobic polar; protein evolution
Existing strategies for creating biosensors mainly rely on large conformational changes to transduce a binding event to an output signal. Most molecules, however, do not exhibit large-scale structural changes upon substrate binding. Here, we present a general approach (alternate frame folding, or AFF) for engineering allosteric control into ligand binding proteins. AFF can in principle be applied to any protein to establish a binding-induced conformational change, even if none exists in the natural molecule. The AFF design duplicates a portion of the amino acid sequence, creating an additional “frame” of folding. One frame corresponds to the wild-type sequence, and folding produces the normal structure. Folding in the second frame yields a circularly permuted protein. Because the two native structures compete for a shared sequence, they fold in a mutually exclusive fashion. Binding energy is used to drive the conformational change from one fold to the other. We demonstrate the approach by converting the protein calbindin D9k into a molecular switch that senses Ca2+. The structures of Ca2+-free and Ca2+-bound calbindin are nearly identical. Nevertheless, the AFF mechanism engineers a robust conformational change that we detect using two covalently attached fluorescent groups. Biological fluorophores can also be employed to create a genetically encoded sensor. AFF should be broadly applicable to create sensors for a variety of small molecules.