Over the last decade, and especially after the advent of fluorescent in situ hybridization imaging and chromosome conformation capture methods, the availability of experimental data on genome three-dimensional organization has dramatically increased. We now have access to unprecedented details of how genomes organize within the interphase nucleus. Development of new computational approaches to leverage this data has already resulted in the first three-dimensional structures of genomic domains and genomes. Such approaches expand our knowledge of the chromatin folding principles, which has been classically studied using polymer physics and molecular simulations. Our outlook describes computational approaches for integrating experimental data with polymer physics, thereby bridging the resolution gap for structural determination of genomes and genomic domains.
The central dogma of molecular biology has provided a meaningful principle
for data integration in the field of genomics. In this context, integration reflects
the known transitions from a chromosome to a protein sequence: transcription,
intron splicing, exon assembly and translation. There is no such clear principle for
integrating proteomics data, since the laws governing protein folding and interactivity
are not quite understood. In our effort to bring together independent pieces of
information relative to proteins in a biologically meaningful way, we assess the bias of
bioinformatics resources and consequent approximations in the framework of small-scale
studies. We analyse proteomics data while following both a data-driven (focus
on proteins smaller than 10 kDa) and a hypothesis-driven (focus on whole bacterial
proteomes) approach. These applications are potentially the source of specialized
complements to classical biological ontologies.
This paper outlines the history behind open access principles and describes the development of a managed access data-sharing process for the UK10K Project, currently Britain’s largest genomic sequencing consortium (2010 to 2013). Funded by the Wellcome Trust, the purpose of UK10K was two-fold: to investigate how low-frequency and rare genetic variants contribute to human disease, and to provide an enduring data resource for future research into human genetics. In this paper, we discuss the challenge of reconciling data-sharing principles with the practicalities of delivering a sequencing project of UK10K’s scope and magnitude. We describe the development of a sustainable, easy-to-use managed access system that allowed rapid access to UK10K data, while protecting the interests of participants and data generators alike. Specifically, we focus in depth on the three key issues that emerge in the data pipeline: study recruitment, data release and data access.
All complex life on Earth is eukaryotic. All eukaryotic cells share a common ancestor that arose just once in four billion years of evolution. Prokaryotes show no tendency to evolve greater morphological complexity, despite their metabolic virtuosity. Here I argue that the eukaryotic cell originated in a unique prokaryotic endosymbiosis, a singular event that transformed the selection pressures acting on both host and endosymbiont.
The reductive evolution and specialisation of endosymbionts to mitochondria resulted in an extreme genomic asymmetry, in which the residual mitochondrial genomes enabled the expansion of bioenergetic membranes over several orders of magnitude, overcoming the energetic constraints on prokaryotic genome size, and permitting the host cell genome to expand (in principle) over 200,000-fold. This energetic transformation was permissive, not prescriptive; I suggest that the actual increase in early eukaryotic genome size was driven by a heavy early bombardment of genes and introns from the endosymbiont to the host cell, producing a high mutation rate. Unlike prokaryotes, with lower mutation rates and heavy selection pressure to lose genes, early eukaryotes without genome-size limitations could mask mutations by cell fusion and genome duplication, as in allopolyploidy, giving rise to a proto-sexual cell cycle. The side effect was that a large number of shared eukaryotic basal traits accumulated in the same population, a sexual eukaryotic common ancestor, radically different to any known prokaryote.
The combination of massive bioenergetic expansion, release from genome-size constraints, and high mutation rate favoured a protosexual cell cycle and the accumulation of eukaryotic traits. These factors explain the unique origin of eukaryotes, the absence of true evolutionary intermediates, and the evolution of sex in eukaryotes but not prokaryotes.
This article was reviewed by: Eugene Koonin, William Martin, Ford Doolittle and Mark van der Giezen. For complete reports see the Reviewers' Comments section.
Significant progress has been made in recent years in a variety of seemingly unrelated fields such as sequencing, protein structure prediction, and high-throughput transcriptomics and metabolomics. At the same time new microscopic models were developed that made it possible to analyze evolution of genes and genomes from first principles. The results from these efforts enable, for the first time, a comprehensive insight into the evolution of complex systems and organisms on all scales – from sequences to organisms and populations. Every newly sequenced genome uncovers new genes, families, and folds. Where do these new genes come from? How does gene duplication and subsequent divergence of sequence and structure affect the fitness of the organism? What role does regulation play in the evolution of proteins and folds? Emerging synergism between data and modeling provide first robust answers to these questions.
The bacterial chromosome must be compacted over 1000-fold to fit into its cellular compartment. How it is condensed, organized and ultimately segregated has been a puzzle for over half a century. Recent advances in live-cell imaging and genome-scale analyses have led to new insights into these problems. We argue that the key feature of compaction is orderly folding of DNA along adjacent segments, and that this organization provides easy and efficient access for protein-DNA transactions and plays a central role in driving segregation. Similar principles and common proteins are used in eukaryotes to condense and resolve sister chromatids at metaphase.
Accumulating evidence demonstrates that the three-dimensional (3D) organization of chromosomes within the eukaryotic nucleus reflects and influences genomic activities, including transcription, DNA replication, recombination and DNA repair. In order to uncover structure-function relationships, it is necessary first to understand the principles underlying the folding and the 3D arrangement of chromosomes. Chromosome conformation capture (3C) provides a powerful tool for detecting interactions within and between chromosomes. A high throughput derivative of 3C, chromosome conformation capture on chip (4C), executes a genome-wide interrogation of interaction partners for a given locus. We recently developed a new method, a derivative of 3C and 4C, which, similar to Hi-C, is capable of comprehensively identifying long-range chromosome interactions throughout a genome in an unbiased fashion. Hence, our method can be applied to decipher the 3D architectures of genomes. Here, we provide a detailed protocol for this method.
Chromatin; chromosome; chromosome conformation capture (3C); chromosome conformation capture on chip (4C); genome architecture, three-dimensional (3D) organization
Analysis of a large collection of short insertions and deletions in primates and flies shows that the rate of insertions or deletions of specific lengths can vary by more than 100 fold, depending on the surrounding sequence.
Insertions and deletions (indels) are an important evolutionary force, making the evolutionary process more efficient and flexible by copying and removing genomic fragments of various lengths instead of rediscovering them by point mutations. As a mutational process, indels are known to be more active in specific sequences (like micro-satellites) but not much is known about the more general and mechanistic effect of sequence context on the insertion and deletion susceptibility of genomic loci.
Here we analyze a large collection of high confidence short insertions and deletions in primates and flies, revealing extensive correlations between sequence context and indel rates and building principled models for predicting these rates from sequence. According to our results, the rate of insertion or deletion of specific lengths can vary by more than 100-fold, depending on the surrounding sequence. These mutational biases can strongly influence the composition of the genome and the rate at which particular sequences appear. We exemplify this by showing how degenerate loci in human exons are selected to reduce their frame shifting indel propensity.
Insertions and deletions are strongly affected by sequence context. Consequentially, genomes must adapt to significant variation in the mutational input at indel-prone and indel-immune loci.
The primary role of the nucleus as an information storage, retrieval, and replication site requires the physical organization and compaction of meters of DNA. Although it has been clear for many years that nucleosomes constitute the first level of chromatin compaction, this contributes a relatively small fraction of the condensation needed to fit the typical genome into an interphase nucleus or set of metaphase chromosomes, indicating that there are additional “higher order” levels of chromatin condensation. Identifying these levels, their interrelationships, and the principles that govern their occurrence has been a challenging and much discussed problem. In this article, we focus on recent experimental advances and the emerging evidence indicating that structural plasticity and chromatin dynamics play dominant roles in genome organization. We also discuss novel approaches likely to yield important insights in the near future, and suggest research areas that merit further study.
How chromosomes are folded and organized within the nucleus is intensely debated. Recent work indicates their higher-order structure is surprisingly dynamic, which may be critical for functional plasticity.
Motivation: Eugene Myers in his string graph paper suggested that in a string graph or equivalently a unitig graph, any path spells a valid assembly. As a string/unitig graph also encodes every valid assembly of reads, such a graph, provided that it can be constructed correctly, is in fact a lossless representation of reads. In principle, every analysis based on whole-genome shotgun sequencing (WGS) data, such as SNP and insertion/deletion (INDEL) calling, can also be achieved with unitigs.
Results: To explore the feasibility of using de novo assembly in the context of resequencing, we developed a de novo assembler, fermi, that assembles Illumina short reads into unitigs while preserving most of information of the input reads. SNPs and INDELs can be called by mapping the unitigs against a reference genome. By applying the method on 35-fold human resequencing data, we showed that in comparison to the standard pipeline, our approach yields similar accuracy for SNP calling and better results for INDEL calling. It has higher sensitivity than other de novo assembly based methods for variant calling. Our work suggests that variant calling with de novo assembly can be a beneficial complement to the standard variant calling pipeline for whole-genome resequencing. In the methodological aspects, we propose FMD-index for forward–backward extension of DNA sequences, a fast algorithm for finding all super-maximal exact matches and one-pass construction of unitigs from an FMD-index.
Microarray technology is a widely used approach for monitoring genome-wide gene expression. For Arabidopsis, there are over 1,800 microarray hybridizations representing many different experimental conditions on Affymetrix™ ATH1 gene chips alone. This huge amount of data offers a unique opportunity to infer the principles that govern the regulation of gene expression in plants.
We used bioinformatics methods to analyze publicly available data obtained using the ATH1 chip from Affymetrix. A total of 1887 ATH1 hybridizations were normalized and filtered to eliminate low-quality hybridizations. We classified and compared control and treatment hybridizations and determined differential gene expression. The largest differences in gene expression were observed when comparing samples obtained from different organs. On average, ten-fold more genes were differentially expressed between organs as compared to any other experimental variable. We defined "gene responsiveness" as the number of comparisons in which a gene changed its expression significantly. We defined genes with the highest and lowest responsiveness levels as hypervariable and housekeeping genes, respectively. Remarkably, housekeeping genes were best distinguished from hypervariable genes by differences in methylation status in their transcribed regions. Moreover, methylation in the transcribed region was inversely correlated (R2 = 0.8) with gene responsiveness on a genome-wide scale. We provide an example of this negative relationship using genes encoding TCA cycle enzymes, by contrasting their regulatory responsiveness to nitrate and methylation status in their transcribed regions.
Our results indicate that the Arabidopsis transcriptome is largely established during development and is comparatively stable when faced with external perturbations. We suggest a novel functional role for DNA methylation in the transcribed region as a key determinant capable of restraining the capacity of a gene to respond to internal/external cues. Our findings suggest a prominent role for epigenetic mechanisms in the regulation of gene expression in plants.
Specific attachment of chromosomal sites to the nuclear matrix is crucial to the control of transcription and DNA replication.
Although the principles governing chromosomal architecture are largely unresolved, there is evidence that higher-order chromatin folding is mediated by the anchoring of specific DNA sequences to the nuclear matrix. These genome anchors are also crucial regulators of gene expression and DNA replication, and play a role in pathogenesis.
The members of cupin superfamily exhibit large variations in their sequences, functions, organization of domains, quaternary associations and the nature of bound metal ion, despite having a conserved β-barrel structural scaffold. Here, an attempt has been made to understand structure-function relationships among the members of this diverse superfamily and identify the principles governing functional diversity. The cupin superfamily also contains proteins for which the structures are available through world-wide structural genomics initiatives but characterized as “hypothetical”. We have explored the feasibility of obtaining clues to functions of such proteins by means of comparative analysis with cupins of known structure and function.
A 3-D structure-based phylogenetic approach was undertaken. Interestingly, a dendrogram generated solely on the basis of structural dissimilarity measure at the level of domain folds was found to cluster functionally similar members. This clustering also reflects an independent evolution of the two domains in bicupins. Close examination of structural superposition of members across various functional clusters reveals structural variations in regions that not only form the active site pocket but are also involved in interaction with another domain in the same polypeptide or in the oligomer.
Structure-based phylogeny of cupins can influence identification of functions of proteins of yet unknown function with cupin fold. This approach can be extended to other proteins with a common fold that show high evolutionary divergence. This approach is expected to have an influence on the function annotation in structural genomics initiatives.
Cytokinesis requires duplication of cellular structures followed by bipolarization of the predivisional cell. As a common principle, this applies to prokaryotes as well as eukaryotes. With respect to eukaryotes, the discussion has focused mainly on Saccharomyces cerevisiae and on Schizosaccharomyces pombe. Escherichia coli and to a lesser extent Bacillus subtilis have been used as prokaryotic examples. To establish a bipolar cell, duplication of a eukaryotic origin of DNA replication as well as its genome is not sufficient. Duplication of the microtubule-organizing center is required as a prelude to mitosis, and it is here that the dynamic cytoskeleton with all its associated proteins comes to the fore. In prokaryotes, a cytoskeleton that pervades the cytoplasm appears to be absent. DNA replication and the concomitant DNA segregation seem to occur without help from extensive cytosolic supramacromolecular assemblies but with help from the elongating cellular envelope. Prokaryotic cytokinesis proceeds through a contracting ring, which has a roughly 100-fold-smaller circumference than its eukaryotic counterpart. Although the ring contains proteins that can be considered as predecessors of actin, tubulin, and microtubule-associated proteins, its macromolecular composition is essentially different.
Spatial organization of chromatin in the interphase nucleus plays a role in gene expression and inheritance. Although it appears not to be random, the principles of this organization are largely unknown. In this work, we show an explicit relationship between the intranuclear localization of various chromosome segments and the pattern of gene distribution along the genome sequence. Using a 7-megabase-long region of the Drosophila melanogaster chromosome 2 as a model, we observed that the six gene-poor chromosome segments identified in the region interact with components of the nuclear matrix to form a compact stable cluster. The six gene-rich segments form a spatially segregated unstable cluster dependent on nonmatrix nuclear proteins. The resulting composite structure formed by clusters of gene-rich and gene-poor regions is reproducible between the nuclei. We suggest that certain aspects of chromosome folding in interphase are predetermined and can be inferred through in silico analysis of chromosome sequence, using gene density profile as a manifestation of “folding code.”
Preterm delivery (PTD) is a significant public health problem associated with greater risk of mortality and morbidity in infants and mothers. Pathophysiologic processes that may lead to PTD start early in pregnancy. We investigated early pregnancy peripheral blood global gene expression and PTD risk.
As part of a prospective study, ribonucleic acid was extracted from blood samples (collected at 16 weeks gestational age) from 14 women who had PTD (cases) and 16 women who delivered at term (controls). Gene expressions were measured using the GeneChip® Human Genome U133 Plus 2.0 Array. Student's T-test and fold change analysis were used to identify differentially expressed genes. We used hierarchical clustering and principle components analysis to characterize signature gene expression patterns among cases and controls. Pathway and promoter sequence analyses were used to investigate functions and functional relationships as well as regulatory regions of differentially expressed genes.
A total of 209 genes, including potential candidate genes (e.g. PTGDS, prostaglandin D2 synthase 21 kDa), were differentially expressed. A set of these genes achieved accurate pre-diagnostic separation of cases and controls. These genes participate in functions related to immune system and inflammation, organ development, metabolism (lipid, carbohydrate and amino acid) and cell signaling. Binding sites of putative transcription factors such as EGR1 (early growth response 1), TFAP2A (transcription factor AP2A), Sp1 (specificity protein 1) and Sp3 (specificity protein 3) were over represented in promoter regions of differentially expressed genes. Real-time PCR confirmed microarray expression measurements of selected genes.
PTD is associated with maternal early pregnancy peripheral blood gene expression changes. Maternal early pregnancy peripheral blood gene expression patterns may be useful for better understanding of PTD pathophysiology and PTD risk prediction.
The spatial organization of chromosomes inside the cell nucleus is still poorly understood. This organization is guided by intra- and interchromosomal contacts and by interactions of specific chromosomal loci with relatively fixed nuclear “landmarks” such as the nuclear envelope and the nucleolus. New molecular genome-wide mapping techniques have begun to uncover both types of molecular interactions, providing insights into the fundamental principles of interphase chromosome folding.
Osteoclasts are the principle bone-resorbing cells. Precise control of balanced osteoclast activity is indispensable for bone homeostasis. Osteoclast activation mediated by RANK-TRAF6 axis has been clearly identified. However, a negative regulation-machinery in osteoclast remains unclear. TRAF family member-associated NF-κB activator (TANK) is induced by about 10 folds during osteoclastogenesis, according to a genome-wide analysis of gene expression before and after osteoclast maturation, and confirmed by western blot and quantitative RT-PCR. Bone marrow macrophages (BMMs) transduced with lentivirus carrying tank-shRNA were induced to form osteoclast in the presence of RANKL and M-CSF. Tank expression was downregulated by 90% by Tank-shRNA, which is confirmed by western blot. Compared with wild-type (WT) cells, osteoclastogenesis of Tank-silenced BMMs was increased, according to tartrate-resistant acid phosphatase (TRAP) stain on day 5 and day 7. Number of bone resorption pits by Tank-silenced osteoclasts was increased by 176% compared with WT cells, as shown by wheat germ agglutinin (WGA) stain and scanning electronic microscope (SEM) analysis. Survival rate of Tank-silenced mature osteoclast is also increased. However, acid production of Tank-knockdown cells was not changed compared with control cells. IκBα phosphorylation is increased in tank-silenced cells, indicating that TANK may negatively regulate NF-κB activity in osteoclast. In conclusion, Tank, whose expression is increased during osteoclastogenesis, inhibits osteoclast formation, activity and survival, by regulating NF-κB activity and c-FLIP expression. Tank enrolls itself in a negative feedback loop in bone resorption. These results may provide means for therapeutic intervention in diseases of excessive bone resorption.
TANK; RANKL; NF-κB; Osteoclast.
Metabolic and stoichiometric theories of ecology have provided broad complementary principles to understand ecosystem processes across different levels of biological organization. We tested several of their cornerstone hypotheses by measuring the nucleic acid (NA) and phosphorus (P) content of crustacean zooplankton species in 22 high mountain lakes (Sierra Nevada and the Pyrenees mountains, Spain). The P-allocation hypothesis (PAH) proposes that the genome size is smaller in cladocerans than in copepods as a result of selection for fast growth towards P-allocation from DNA to RNA under P limitation. Consistent with the PAH, the RNA:DNA ratio was >8-fold higher in cladocerans than in copepods, although ‘fast-growth’ cladocerans did not always exhibit higher RNA and lower DNA contents in comparison to ‘slow-growth’ copepods. We also showed strong associations among growth rate, RNA, and total P content supporting the growth rate hypothesis, which predicts that fast-growing organisms have high P content because of the preferential allocation to P-rich ribosomal RNA. In addition, we found that ontogenetic variability in NA content of the copepod Mixodiaptomus laciniatus (intra- and interstage variability) was comparable to the interspecific variability across other zooplankton species. Further, according to the metabolic theory of ecology, temperature should enhance growth rate and hence RNA demands. RNA content in zooplankton was correlated with temperature, but the relationships were nutrient-dependent, with a positive correlation in nutrient-rich ecosystems and a negative one in those with scarce nutrients. Overall our results illustrate the mechanistic connections among organismal NA content, growth rate, nutrients and temperature, contributing to the conceptual unification of metabolic and stoichiometric theories.
This article describes a simple and inexpensive hands-on simulation of protein folding suitable for use in large lecture classes. This activity uses a minimum of parts, tools, and skill to simulate some of the fundamental principles of protein folding. The major concepts targeted are that proteins begin as linear polypeptides and fold to three-dimensional structures, noncovalent interactions drive this folding process, and the final folded shape of a protein depends on its amino acid sequence. At the start of the activity, students are given pieces of insulated wire from which they each construct and fold their own polypeptide. This activity was evaluated in three ways. A random sample of student-generated polypeptides collected after the activity shows that most students were able to create an appropriate structure. After this activity, students (n = 154) completed an open-ended survey. Their responses showed that more than three-quarters of the students learned one or more of the core concepts being demonstrated. Finally, a follow-up survey was conducted seven weeks after the activity; responses to this survey (n = 63) showed that a similar fraction of students still retained these key concepts. This activity should be useful in large introductory-level college biology or biochemistry lectures.
Protein function is generated and maintained by the proteostasis network (PN) (Balch et al. (2008) Science, 319:916). The PN is a modular, yet integrated system unique to each cell type that is sensitive to signaling pathways that direct development and aging, and respond to folding stress. Mismanagement of protein folding and function triggered by genetic, epigenetic, and environmental causes poses a major challenge to human health and lifespan. Herein, we address the impact of proteostasis defined by the FoldFx model on our understanding of protein folding and function in biology. FoldFx describes how general proteostasis control (GPC) enables the polypeptide chain sequence to achieve functional balance in the context of the cellular proteome. By linking together the chemical and energetic properties of the protein fold with the composition of the PN we discuss the principle of the proteostasis boundary (PB) as a key component of GPC. The curved surface of the PB observed in 3-dimensional space suggests that the polypeptide chain sequence and the PN operate as an evolutionarily conserved functional unit to generate and sustain protein dynamics required for biology. Modeling general proteostasis provides a rational basis for tackling some of the most challenging diseases facing mankind in the 21st century.
In this review, we give an overview of recent literature on the structure and stability of unimolecular G-rich quadruplex structures that are relevant to drug design and for in vivo function. The unifying theme in this review is energetics. The thermodynamic stability of quadruplexes has not been studied in the same detail as DNA and RNA duplexes, and there are important differences in the balance of forces between these classes of folded oligonucleotides. We provide an overview of the principles of stability and where available the experimental data that report on these principles. Significant gaps in the literature have been identified, that should be filled by a systematic study of well-defined quadruplexes not only to provide the basic understanding of stability both for design purposes, but also as it relates to in vivo occurrence of quadruplexes. Techniques that are commonly applied to the determination of the structure, stability and folding are discussed in terms of information content and limitations. Quadruplex structures fold and unfold comparatively slowly, and DNA unwinding events associated with transcription and replication may be operating far from equilibrium. The kinetics of formation and resolution of quadruplexes, and methodologies are discussed in the context of stability and their possible biological occurrence.
We provide an overview of lipid-dependent polytopic membrane protein folding and topogenesis. Lipid dependence of this process was determined by employing Escherichia coli cells in which specific lipids can be eliminated, substituted, tightly titrated or controlled temporally during membrane protein synthesis and assembly. The secondary transport protein lactose permease (LacY) was used to establish general principles underlying the molecular basis of lipid-dependent effects on protein domain folding, protein transmembrane domain (TM) orientation, and function. These principles were then extended to several other secondary transport proteins of E. coli. The methods used to follow proper conformational organization of protein domains and the topological organization of protein TMs in whole cells and membranes are described. The proper folding of an extramembrane domain of LacY that is crucial for energy dependent uphill transport function depends on specific lipids acting as non-protein molecular chaperones. Correct TM topogenesis is dependent on charge interactions between the cytoplasmic surface of membrane proteins and a proper balance of the membrane surface net charge defined by the lipid head groups. Short-range interactions between the nascent protein chain and the translocon are necessary but not sufficient for establishment of final topology. After release from the translocon short-range interactions between lipid head groups and the nascent protein chain, partitioning of protein hydrophobic domains into the membrane bilayer, and long–range interactions within the protein thermodynamically drive final membrane protein organization. Given the diversity of membrane lipid compositions throughout nature, it is tempting to speculate that during the course of evolution the physical and chemical properties of proteins and lipids have co-evolved in the context of the lipid environment of membrane systems in which both are mutually depend on each other for functional organization of proteins.
phosphatidylethanolamine; lactose permease; protein topology; lipochaperone; positive-inside rule
The recognition of functional binding sites in genomic DNA remains one of the fundamental challenges of genome research. During the last decades, a plethora of different and well-adapted models has been developed, but only little attention has been payed to the development of different and similarly well-adapted learning principles. Only recently it was noticed that discriminative learning principles can be superior over generative ones in diverse bioinformatics applications, too.
Here, we propose a generalization of generative and discriminative learning principles containing the maximum likelihood, maximum a posteriori, maximum conditional likelihood, maximum supervised posterior, generative-discriminative trade-off, and penalized generative-discriminative trade-off learning principles as special cases, and we illustrate its efficacy for the recognition of vertebrate transcription factor binding sites.
We find that the proposed learning principle helps to improve the recognition of transcription factor binding sites, enabling better computational approaches for extracting as much information as possible from valuable wet-lab data. We make all implementations available in the open-source library Jstacs so that this learning principle can be easily applied to other classification problems in the field of genome and epigenome analysis.
Computer generated trajectories can, in principle, reveal the folding pathways of a protein at atomic resolution and possibly suggest general and simple rules for predicting the folded structure of a given sequence. While such reversible folding trajectories can only be determined ab initio using all-atom transferable force-fields for a few small proteins, they can be determined for a large number of proteins using coarse-grained and structure-based force-fields, in which a known folded structure is by construction the absolute energy and free-energy minimum. Here we use a model of the fast folding helical λ-repressor protein to generate trajectories in which native and non-native states are in equilibrium and transitions are accurately sampled. Yet, representation of the free-energy surface, which underlies the thermodynamic and dynamic properties of the protein model, from such a trajectory remains a challenge. Projections over one or a small number of arbitrarily chosen progress variables often hide the most important features of such surfaces. The results unequivocally show that an unprojected representation of the free-energy surface provides important and unbiased information and allows a simple and meaningful description of many-dimensional, heterogeneous trajectories, providing new insight into the possible mechanisms of fast-folding proteins.
The process of protein folding is a complex transition from a disordered to an ordered state. Here, we simulate a specific fast-folding protein at the point at which the native and denatured states are at equilibrium and show that obtaining an accurate description of the mechanisms of folding and unfolding is far from trivial. Using simple quantities which quantify the degree of native order is, in the case of this protein, clearly misleading. We show that an unbiased representation of the free-energy surface can be obtained; using such a representation we are able to redesign the landscape and thus modify, upon site-specific “mutations”, the folding and unfolding rates. This leads us to formulate a hypothesis to explain the very fast folding of many proteins.