|Home | About | Journals | Submit | Contact Us | Français|
Computational modeling is essential for structural characterization of biomolecular mechanisms across the broad spectrum of scales. Adequate understanding of biomolecular mechanisms inherently involves our ability to model them. Structural modeling of individual biomolecules and their interactions has been rapidly progressing. However, in terms of the broader picture, the focus is shifting toward larger systems, up to the level of a cell. Such modeling involves a more dynamic and realistic representation of the interactomes in vivo, in a crowded cellular environment, as well as membranes and membrane proteins, and other cellular components. Structural modeling of a cell complements computational approaches to cellular mechanisms based on differential equations, graph models, and other techniques to model biological networks, imaging data, etc. Structural modeling along with other computational and experimental approaches will provide a fundamental understanding of life at the molecular level and lead to important applications to biology and medicine. A cross section of diverse approaches presented in this review illustrates the developing shift from the structural modeling of individual molecules to that of cell biology. Studies in several related areas are covered: biological networks; automated construction of three-dimensional cell models using experimental data; modeling of protein complexes; prediction of non-specific and transient protein interactions; thermodynamic and kinetic effects of crowding; cellular membrane modeling; and modeling of chromosomes. The review presents an expert opinion on the current state-of-the-art in these various aspects of structural modeling in cellular biology, and the prospects of future developments in this emerging field.
Structural characterization of biomolecular mechanisms across a broad spectrum of scales is key to our understanding of life at the molecular level. Along with experimental techniques, computational modeling is an essential part of this characterization, as a source of structural information and the means of predicting new, experimentally unobserved/unobservable phenomena. An adequate understanding of biomolecular mechanisms inherently involves our ability to model them.
Structural modeling of individual biomolecules and their interactions has been rapidly progressing [1, 2], with many challenges to be addressed in the coming years. However, in large part due to this progress, in terms of a broader picture, the focus is inevitably shifting toward larger systems, up to the level of a cell. Such modeling should involve a more dynamic and realistic representation of the interactomes in vivo, in a crowded cellular environment, as well as of membranes, membrane proteins, and other cellular components. The atomistic modeling methodology requires vigorous development. The efforts in structural modeling of a cell do not negate the need for this development. To the contrary - they will spur it, and expand its scope to larger and more heterogeneous systems.
Whole cell modeling is “the grand challenge of the 21st century” . Specifically it is important for a variety of reasons, including integration of heterogeneous datasets into a unified representation of knowledge about a given organism, prediction of complex multi-network phenotypes, identification of gaps in our knowledge of cellular processes, and development of our ability to modulate them .
Emerging experimental techniques, such as femtosecond crystallography with X-ray free-electron lasers, small angle x-ray scattering, and advances in widely adopted methods such as high-resolution cryoelectron microscopy [5-10] provide new data and experimental validation for the modeling. Great examples of joint experimental and computational techniques are rapidly developing approaches to identification of biological assemblies and construction of large protein complexes with hybrid methods, such as integrative modeling .
At this point, molecular and cellular modelers use substantially different approaches and, in fact, speak largely different “scientific languages.” Modeling of structures in molecular biology usually means predicting the structure or simulating the folding of a protein, or modeling the interactions between two isolated molecules. It is usually assumed that folding or binding occurs in dilute solutions, so the only environmental concerns are modeling the effect of water and possibly ionic strength. While these calculations are far from easy, and require substantive understanding of biophysics, sophisticated simulation software and significant computational efforts, modeling in cellular biology generally deals with much more complex systems. Accordingly, simulations at the cellular level usually require substantial coarse-graining and simplifications, and existing models of “virtual cell” are largely based on differential equations, imaging data, and other integrative approaches [12, 13]. It is clear that closing the gap between such different levels of approximation will require significant effort, and a number of groups with backgrounds in structural modeling at the molecular level have made progress toward the development of multi-scale approaches to introduce a higher degree of structural information into the modeling of cells. Some of these approaches are coarse-grained and some are atomic resolution, but if put together would potentially provide an integral and self-consistent model of the whole system.
A cross section of diverse approaches presented in this review illustrates the developing shift from the structural modeling of individual molecules to that of cell biology. Studies in several related areas are covered: biological networks; automated construction of three-dimensional (3D) molecular cell models using experimental data; structural modeling of protein complexes; prediction of the non-specific and transient protein-protein interactions that occur in the crowded environment of the cell; atomistic modeling of the thermodynamic and kinetic effects of crowding on folding and binding of macromolecules; all-atom cellular membrane modeling and simulation; and modeling of chromosomes.
This review originated from a discussion at the 2014 meeting on Modeling of Protein Interactions (http://conferences.compbio.ku.edu), and presents an expert opinion on the current state-of-the-art in various aspects of structural modeling in cellular biology, and the prospects of future developments in this emerging field.
Structures of proteins, nucleic acids, and their complexes have provided foundational knowledge to gain insight into the molecular mechanisms of biological processes . An overall understanding of complex cellular phenotypes, however, requires considering multiple molecular species and their interactions. There are growing interests in understanding the nature of protein-protein and protein-nucleic acid interactions, as well as in computational docking studies . Since many different species of molecules participate in determining the outcome of cellular processes, the discovery of relevant molecular players, their post-translational modifications, and the formation of networks for gene regulation and signal transduction have been the focus of many experimental investigations.
There are a large number of resources where experimental knowledge and computational tools of networks are organized and curated. The Systems Biology Markup Language (SBML) provides an open interchange format for modeling metabolic networks, cell signaling networks, and other biological processes . The Biological Pathway Exchange BioPAX provides a standard language to facilitate integration, exchange, visualization and analysis of biological pathway data . The KEGG (Kyoto Encyclopedia of Genes and Genomes) database contains rich information on genomes, biological pathways, diseases, drugs, and chemical substances . The BioModels is a repository of computational models of biological processes curated from literature and enriched with cross-references, providing valuable resources for studying behavior of metabolic and signal transduction networks .
Developing quantitative models to account for experimental facts and to predict emergent biological behaviors are key to gaining mechanistic understanding of cellular processes. As an example, much has been learned from quantitative models of the lysogeny-lysis decision network based on classic studies of phage lambda, including understanding of system stability against perturbations, robustness against genetic mutations, regulation of cellular fate and rare-event transitions, and as network architectural determinants for heritable epigenetic state [20-22]. Studies based on quantitative models of network systems biology integrating many cellular components will continue to contribute to our understanding of broad biological questions such as stem cell differentiations [23-27] and cancer development [28, 29].
There exists a hierarchy of modeling frameworks for studying biological networks. These include graph models , Boolean networks , ordinary differential equations (ODE) , stochastic differential equations (SDE) , and chemical master equations (CME) [34-38], in increasing accuracy but also complexity. There are several important considerations in modeling complex biological phenomena. First, we need to gather sufficient and unambiguous biological facts to construct an insightful and appropriate biological network model. This often requires in-depth biological knowledge and can benefit from large amount of data from the convergence of modern high-throughput measuring techniques. Second, we need to obtain a model at the appropriate level of details. Different choice of ODE models, stochastic differential equation model, or the chemical master equation model may lead to different conclusion [38, 39]. At high concentrations such as those found in a metabolic network, an ODE model is preferred, since detailed stochastic models limit the scale of problems that can be examined, with no additional advantages. However, at low concentrations as in gene regulatory networks and signal transduction networks, the copy numbers of involved molecules may be very small (e.g., nM concentration), and stochasticity often plays an important role . At this level, the choice of SDE (Langevin or Fokker-Planck) or CME formulation may be required. As these different model choices may yield different results, a challenging issue is to develop hybrid models and to determine when a particular modeling formalism is appropriate and when an alternative approach is necessary, e.g. when will an ODE model break down and an SDE model be used, and when will an SDE model break down and a CME model be used. Third, describing the complex geometry of the cellular environment may also be necessary for the modeling of transport and communication among its various spatial regions and compartments , requiring the construction of a more realistic “virtual cell” and the use of partial differential equations (PDE). Fourth, we need to ensure that computational methods and algorithms can yield the correct solutions, or at least we should be aware of the limitation of the algorithms and recognize possible errors in the computational answers. Correct computational results may not be found even though the problem is formulated correctly. This is especially relevant for problems where stochasticity is significant and when sampling methods are employed. For example, the widely used method of Stochastic Simulation Algorithm does not work well in studying rare events important in many biological phenomena [41-43]. Recent progress at the most detailed level in finding an exact solution of the probability landscape governed by the chemical master equation, in developing optimal method for state enumeration, and in formulating a theoretical framework for a priori estimation of truncation error of finite state space, and in biased Monte Carlo sampling of reaction trajectories for rare events have shown promise in resolving these issues [37, 38, 41-47].
Developing quantitative models requires knowledge of reaction rates and binding constants as parameters. Whether it is a large comprehensive network or a minimalistic network most germane to the question at hand, the availability and validity of model parameters are a challenging issue, as it is unrealistic to expect to have in vivo measurement of every model parameter. The study of the epigenetic circuit of phage lambda showed that introduction of modest protein-protein cooperative interaction of CI-dimers can lead to desired probabilistic landscape of deep-threshold and efficient switch , illustrating the importance of protein-protein binding in regulating network phenotypes. Studies of computational biophysics on binding affinities and reaction rates therefore are of growing importance for developing effective models of systems biology [48-51]. However, there is a gap between single valued rate and binding affinity parameters and the rich information contained in ensembles of interacting protein-protein or protein-nucleic acid complexes examined in studies such as protein docking. Identification of potential interactions and binding partners from experimental and computational structural biology studies of protein-protein complexes will provide valuable information for improving network models of cellular processes. How biophysical studies of protein stability and binding interactions can inform developing systems biology models to go beyond single parameter and homogeneous systems remains open and progress will likely be fruitful [48-51].
The cell is a hierarchy of structures that span from atoms to organelles, all of which interact in an intricate choreography with tempos that range from femtoseconds to hours. The biological mesoscale range includes biological structures from 10 to 100 nanometers. Structures of this size include viruses, cellular organelles, large molecular complexes, and any other internal cellular environments within that range. The mesoscale is important because it represents the scale of cellular systems that is not fully accessible to a single experimental technique.
Structural data is now available at a wide range of length scales – from atomic resolution structures of cellular protein and nucleic acid components to organelle and larger cellular structures. Biophysical techniques range from atomic resolution X-ray crystallography and NMR spectroscopy, to electron and light microscopy. In addition, spatial distributions and dynamics are accessible by a variety of fluorescence microscopy methods, and expression and concentration levels are obtainable via technologies ranging from chip arrays and other mRNA technologies to mass spectrometry and other proteomic analyses.
Over the past several years there have been a number of efforts to build complete structural models of cellular environments at molecular detail. This type of work has typically focused on a particular portion of a cell, for example E. coli cytoplasm , M. genitalium cytoplasm , bacterial division machinery , synaptic vesicles , and an entire synaptic bouton . Because of the size and complexity of cellular structure, there are numerous challenges that must be faced before building a structural model of a complete cell becomes a reality. Among these challenges are: 1) development of a model building framework that can unify the various cellular components at multiple scales; 2) the implementation of accelerated computation through parallelization and custom hardware solutions; 3) the data analysis and visualization software capable of handling large complex models; 4) the development of metrics to quantify and validate the models; and 5) the development of communities and collaborations to be able to approach such large and complex modeling tasks, and to continually improve and curate the models.
Here we focus on cellPACK [57, 58] which has been developed as a computational framework that attempts to address some of these challenges. The cellPACK software uses structural and distribution data for a given mesoscale environment gathered from different experimental methods and automatically synthesizes one or many 3D models that are statistically consistent with all of this available information. For a given cellular or subcellular structure, the geometry of the large components such as organelles or intact virions seen with electron microscopy can define specific volumes and surfaces to fill with the smaller molecular entities. Since the locations of the contents of these larger components are constantly changing, cellPACK uses statistical measures to place these molecular components into the compartmental volumes and membrane surfaces. Thus, a filled model is one snapshot of many possible fills.
cellPACK uses distance field grid to discretize and describe a volume, enabling multiple modular packing algorithms to interoperate on the same model and can combine several complex packing algorithms to integrate three different major localization modes – volumetric, surface, and procedural – into unified models. It has numerous modules for cell/molecule-specific packing. In the resultant model, each molecular object retains a connection to various other forms of data to enable deeper analysis, in preparation for systems integration or large-scale simulations, or for modifications of the molecule’s representations.
To date, cellPACK has been used to generate models of blood plasma, the immature and mature HIV virion, the packing of synaptic vesicles and a preliminary model of a mycoplasma (Fig. 1). These models contain thousands to 10s of thousands of individual biomolecules in the context of cellular environments. Prior to cellPACK, such models had to be built by hand taking weeks or months, and presented a serious bottleneck in preparing the starting conditions for input to large-scale simulations such as Brownian dynamics [52, 59, 60]. With the automated procedures in cellPACK, such models can be produced in minutes, making possible the construction of a large ensemble of models each different in detail, but each consistent with the input experimental data. This enables the possibility to run many parallel simulations, each with a different initial model. Additionally, cellPACK enables the exploration of different structural hypotheses, creating models that can be compared with experimental observation.
Frameworks such as cellPACK will enable the structural modeling community to create and share models of complex molecular environments and make possible new analyses and simulations of these environments (see http://cellpack.org).
Protein-protein interactions are central for cellular processes. Experimental approaches to determining the interaction networks have limited reliability [61, 62]. Thus computational prediction of interactors is important . For proper training and validation of such approaches one needs representative databases of interacting proteins , as well as those that do not interact . Structural characterization of proteins is essential for understanding molecular processes in the cell. However, only a fraction of known proteins have experimentally determined structures. That fraction is even smaller for protein-protein complexes. Thus, modeling is key to their structural determination [66-68].
An important insight into the basic rules of protein recognition is provided by the studies of large-scale structural recognition factors in macromolecular assemblies , and binding-related anisotropy of protein shape [70, 71]. Such factors in protein association have to do with the funnel-like intermolecular energy landscape . It has been shown that simple energy functions, including coarse-grained (low-resolution) models, reveal major landscape characteristics, such as the number and distribution of the funnel-like energy basins, transition between low and high resolution, and funnel size . The intermolecular energy landscapes are further characterized by conformational properties of interacting proteins [74-76].
The docking degrees of freedom involve six external degrees of the rigid body movement (3 translation coordinates and 3 angles of rotation), as well as internal degrees of freedom, which determine the conformation of the proteins. To make the number of the internal degrees of freedom manageable, approximations are essential. The rigid-body approximation leaves only the external degrees of freedom and approximates internal degrees of freedom by making the proteins soft and thus tolerant to local structural mismatches. The rigid-body approximation is adequate for bound docking (separated proteins from co-crystallized complexes), low-resolution unbound docking (structures determined outside of the complex), as well as in some cases of high-resolution unbound docking. However, in general, for the atomic resolution unbound docking, some form of conformational sampling is required. For most crystallographically determined complexes, the unbound to bound conformational change is largely restricted to the surface side chains , thus drastically limiting the combinatorics of the conformational search. Protein docking approaches are extensively evaluated in the community-wide experiment on Critical Assessment of Predicted Interactions (CAPRI) , and in numerous studies based on benchmarking sets (e.g. [77, 78]). Protein docking procedures were also shown to be successful in packing protein structural motifs  and predicting complexes of membrane proteins .
The coarse-graining of protein structures allows exploration of structural dynamics of large-scale (microseconds or longer) processes [81, 82]. It also allows comparison with low-resolution experimental data, which often is the only available structural information on the system . Coarse-grained elastic networks modeling of structure fluctuations showed that, on average, the interface is more rigid than the rest of the protein surface [84, 85], and the interface mobility is correlated with the interface type, size and obligate nature of the complex . In structural modeling of protein-protein complexes, the coarse-graining approaches are used to model structural flexibility in protein assembly [75, 81, 86, 87]. Low-resolution allows implicit accounting for local conformational flexibility without sampling the internal degrees of freedom, and thus is useful in docking [88, 89].
The number of experimentally determined protein structures accounts only for a fraction of known proteins. Thus docking often has to rely on the modeled structures of the interactors, especially in the case of large protein-protein interaction (PPI) networks. Structures of modeled proteins are typically less accurate than the ones determined by X-ray crystallography or NMR. The goal of the modeling should determine the accuracy of the models. The accuracy of the output complex cannot be higher than the accuracy of the input structures. Thus the necessary level of structural accuracy of the complex determines the required accuracy of the modeling of the individual proteins. The question then is: what is that necessary level of structural accuracy for protein complexes? In protein-protein interactions, many experimental (and theoretical) studies require simple knowledge of the residues at the interfaces (e.g. for further experimental analysis) and have no use for atomic resolution structural details of the complex (specific atom-atom, or even residue-residue contacts across the interface). The same is true for small ligand – protein docking, when the goal is identification of the binding/functional site on the protein. For the interface (binding, functional site) prediction, the high-resolution protein structures, generally, are not needed [90-92]. That has been extensively demonstrated by systematic studies over a number of years [66, 89]. However, when a high-resolution structure of the complex is required, in protein-protein interactions (e.g. for estimation of the binding affinity) or in small ligand – protein docking (e.g. for identification of specific ligands), higher accuracy protein models are needed.
High-throughput modeling for entire genomes requires a computationally tractable methodology. A statistical analysis of target-template sequence alignments for systematic evaluation of potential accuracy in high-throughput modeling of binding sites was performed on a representative set of protein complexes . The modeling was performed in a high-throughput fashion based on standard sequence alignment and comparative modeling, as opposed to more detailed and sophisticated (but also more computationally expensive) multi-template procedures. Overall, ~50% of protein pairs with the interfaces modeled by high-throughput techniques had accuracy suitable for structural modeling of their complexes (Fig. 2a).
Although structural modeling of protein complexes primarily has to rely on modeled structures of the individual proteins, such “double” modeling remains so far largely untested in a systematic way, largely due to the absence of an adequate benchmark set that would contain protein structures with accuracy levels according to a full array of pre-defined root-mean-square-deviation (RMSD) values. Such sets were generated based on crystallographically determined complexes from the Dockground resource [78, 94, 95]. A comprehensive benchmarking of template-based docking by structure alignment  and free docking [97, 98] techniques was performed on a set of 165×6 model protein structures with accuracy levels at 1 - 6 Å Cα RMSD. The results (Anishchenko et al. submitted) show that many docking models fall into acceptable quality category, according to the CAPRI Challenge  criteria, even for highly distorted models (Fig. 2b). The template-based methodology is less sensitive to the inaccuracies of protein models compared to the free docking. However, both can be applied to the structural modeling of the protein interactome.
Proteome-scale modeling of PPI networks [99-102] is essential for modeling of a cell. Templates are available for a significant part of soluble proteins in genomes , including those in known PPIs . The approaches to genome-wide structural modeling of PPIs are either “traditional” template-free docking [105, 106] or the template-based docking [63, 104, 107-111]. The latter, while potentially providing much greater success rate , critically depends on the availability of the templates [63, 104, 108, 109]. The X-ray structures of the proteins were complemented by homology models and the templates for their complexes were detected in PDB . Figure 2c shows the results for five genomes with the largest number of known PPIs. Structural alignments yielded a dramatic increase in the structural coverage of complexes, from the coverage provided by the sequence alignment. The structural templates were found for nearly all (33,537 out of 33,840, or 99%) complexes in which both components could be built. Thus, the limiting factor in interactome modeling is actually the availability of the templates for the individual proteins (more protein-protein templates are still needed for greater accuracy of modeling). Still the free docking is necessary, and its importance growing, for many protein encounters in the crowded cell environment, which are not likely to correspond to energetically stable co-crystallized templates.
The challenge in the development of protein docking methodology is to adequately incorporate internal degrees of freedom into the docking protocols. This includes structural flexibility of the interacting proteins, especially in case of significant conformational changes upon binding, as well as structural inaccuracies of the proteins, especially models. Another grand challenge is to understand and simulate the environment in which proteins interact in vivo. This environment is densely populated, which strongly affects protein diffusion, binding and conformational transitions. For large-scale structural modeling of PPI networks, such approaches have to be high-throughput, taking advantage of new algorithms and hardware resources.
There is growing recognition that the cellular context and the cellular environment have fundamental influences on biochemical processes [112, 113]. Missing in typical in vitro biophysical studies done in dilute solution are the many “bystander” macromolecules, which have considerable consequences on the biomolecules of direct interest.
The many bystander macromolecules together occupy a high fraction of the volume of the given cellular compartment. An early focus was on how this condition, known as macromolecular crowding, impacts thermodynamic and kinetic properties of protein folding, binding, and aggregation. Of particular note are in vitro experiments in which crowding agents are added to mimic bystander macromolecules in cellular compartments [114-116]. Experimental and computational studies have now converged on the conclusion that effects of macromolecular crowding are relatively modest, on the order of 0.5 kcal/mol in energetic terms, for the folding and binding of single-domain proteins, and become progressively greater as the sizes of the reactant species increase, and reach striking magnitudes for protein aggregation [52, 112, 117].
Protein folding under macromolecular crowding has been modeled in two complementary approaches. In the direct simulation approach, one mixes the protein of interest with crowders (i.e., bystander macromolecules), similar to the in vitro experiments with crowding agents. To adequately sample the folding and unfolding transitions while also simulating the movements of the crowders, it was necessary to use coarse-grained representations for the protein and crowders , although some aspects of folding have been studied using an all-atom representation . In the alternative approach [48, 52], now known as postprocessing , one runs simulations of the crowders by themselves and runs separate simulations of the protein at end states, e.g., the folded and unfolded states. In this way, one avoids the expensive simulations of the rare transitions between the end states. One then computes the transfer free energies of the protein in the end states from a dilute solution into the crowder solution.
The basis for the transfer free energy calculations was provided by Widom’s particle insertion method . A brute-force implementation turned out to incur “very significant computational expense” . Recognizing that this problem has much similarity to the docking of a ligand to a protein and the use of the fast Fourier transform (FFT) technique in the latter problem [97, 122], FMAP (FFT-based method for Modeling Atomistic Proteins-crowder interactions) was developed for computing the transfer free energies [123, 124].
To model subcellular problems, a reasonable (perhaps necessary) choice is to represent water (and other small molecules of the solvent) implicitly. A number of groups have carried out simulations of subcellular compartments, modeled with implicit solvent [52, 59, 60]. A focus of these simulation studies is the diffusion coefficient of a tracer protein in these crowded environments. Then one faces the problem of parameterizing the effective protein-crowder interaction energies. These interactions contain hard-core strong repulsion and longer-distance weak attraction or repulsion. The soft attraction can lead to protein-crowder weak association. Parameterizing energy functions is of course not a new problem, but data from in vitro experiments with crowding agents can be very useful. For example, the second virial coefficient in the expansion of the osmotic pressure in terms of macromolecular concentration contains rich information on intermolecular interactions and can be easily measured by techniques such as static light scattering .
Injecting new interest into modeling cellular context and the cellular environment are experimental studies demonstrating emergent behaviors of proteins and nucleic acids under crowded conditions. The first is the nonrandom nature of protein-crowder weak association. In particular, some proteins were found to associate with specific cellular targets. For example, the neural protein tau when injected into X. laevis oocytes binds to microtubules . In E. coli, the MetJ repressor forms extensive nonspecific interactions with genomic DNA . In other cases, there is evidence implicating a specific site of a protein in the nonspecific interactions. Pin1 uses the substrate recognition site for nonspecific interactions. Nonspecific interactions are apparently abrogated when either the substrate recognition site is phosphorylated or a substrate is bound . Similarly, the maltose binding protein (MBP) forms nonspecific interactions with proteins and synthetic polymers, but this ability is weakened or lost when maltose is bound  (Fig. 3). FMAP-enabled calculations are capturing the nonrandom nature of the weak association (Qin and Zhou, to be published).
The weak nonspecific association with bystander macromolecules often can be inferred to impart biological function. For example, the binding of tau to microtubules is thought to be important for the latter’s stability. Nonspecific binding of the MetJ repressor to genomic DNA may facilitate the search for a specific site. Nonspecific association with endogenous proteins via the substrate recognition site may be the mechanism for subcellular localization. For MBP, it has been proposed that nonspecific association with the outer membrane-attached peptidoglycan primes the protein for receiving maltose; binding of maltose releases the protein, allowing it to diffuse to the inner membrane-bound ABC transporter and hand over the maltose for translocation into the cytoplasm  (Fig. 3).
It is remarkable that nonspecific association can be tuned out by phosphorylation or substrate binding , or by ligand binding . Calmodulin gains nonspecific interactions upon binding Ca2+ but loses this ability again upon further binding a substrate peptide . Apparently, nonspecific association can be regulated by some of the same mechanisms, e.g., phosphorylation or ligand or substrate binding, as for specific association.
Another emergent behavior is the formation of mesoscale cellular structures. The cytoskeleton offers a prime example, but other subcellular organizations are being recognized as well. In particular, it is now well known that enzymes in the same metabolic pathway are co-localized , possibly to facilitate substrate channeling between successive enzymes.
Perhaps the most exciting emergent behavior is liquid-liquid phase separation, between the protein-poor cytoplasm and the protein-rich cellular bodies. These bodies, commonly referred to as droplets, are membrane-less organelles and are implicated in many cellular functions, such as for protein or RNA storage [132-136]. Interestingly, phase separation has been achieved in vitro using reconstituted or even designed components . While this protein-rich phase is liquid-like, other proteins can form crystalline assemblies; and macromolecular crowding can drive their formation . Modeling such as that enabled by FMAP (Qin and Zhou, submitted) and other techniques and in vitro experiments mimicking cellular conditions will allow us to reach quantitative understanding of all these emergent behaviors in the cellular context.
To recognize a specific partner, a protein must align its binding interface, usually a small fraction of the total surface, with a similarly small binding interface on the other protein. The goal of docking methods is to identify this specific association as the global minimum of a free energy landscape. However, nonspecific interactions among macromolecules are also important, particularly in a crowded environment of a cell, since the high frequency of such encounters can substantially affect the stability of the equilibrium state [48, 52, 112, 113, 139-141]. Indeed, it was shown in crowding experiments that the energetics of interactions with crowders impacts the formation of specific complexes and of non-native aggregates beyond simple excluded volume effects [140, 141]. Since global docking methods systematically sample the entire conformational space of protein-protein complexes, in principle such methods can be used to study both specific and non-specific associations.
The first step toward modeling nonspecific association is the analysis of encounter complexes . An encounter complex can be thought of as an ensemble of transition states in which the two molecules can rotationally diffuse along each other, or participate in a series of “microcollisions.” A particular type of encounter complex, a late near-native intermediate, referred to as the transient complex, is a key concept in modeling the kinetics of specific association and predicting the association rate constant [49, 143]. A well-studied example of non-specific association is the N-terminal domain of Enzyme I (EIN) and the histidine-containing phosphocarrier protein (HPr) [142, 144]. For a computational study of this interaction we systematically sampled the relative orientations of the two molecules. Fig. 4a shows the interface root mean square deviation (IRMSD) from the native EIN/HPr complex versus the interaction energy score of the docked structures, and reveals 3 large clusters. We note that the interaction energy is given in kcal/mol units, but it does not account either for any entropy loss or for the desolvation of the component proteins, and hence has no absolute thermodynamic meaning. In fact, it was shown that the probability of each cluster of low energy docked structures is proportional to the relative population of the cluster [145, 146], and hence one can use cluster size rather than energy values for selecting putative complex models.
The structures in the largest cluster (Cluster 1, shown in blue in Fig. 4b) overlap with the native state. The structures in this cluster, akin to the aforementioned transient complex, are the results of rigid body rotations and small translations around the native binding mode. The two other clusters (red and magenta in Figures 4b and 4c) consist of structures that can coexist with the native complex. The existence of the three clusters was experimentally verified using NMR paramagnetic relaxation enhancement (PRE), a technique that is exquisitely sensitive to the presence of lowly populated states in the fast exchange regime [144, 147, 148], indicating that these docked structure have physical meaning and represent encounter complexes.
The specific association between EIN and HPr has an equilibrium dissociation constant of 7 μM , whereas the KD value of encounter complexes may be as high as 10 mM . In spite of the large difference in binding affinity, the existence of transitional nonspecific association has biological implications. It is estimated that under cellular conditions at least 1% of HPr exists in form of a tertiary complex HPrnonspecific /EIN/HPr, in which the native EIN/HPr complex nonspecifically binds an additional HPr molecule . The formation of transient HPrnonspecific/EIN/HPr ternary complexes may help EIN compete for the cellular pool of HPr. Intracellular overcrowding and compartmentalization may favor the ternary complex further, possibly making these nonspecific interactions even more important for enhancing enzymatic turnover in vivo .
A large fraction of protein pairs that are present in the cell do not form specific complexes with any measurable affinity, as demonstrated by the Negatome database . Nevertheless, some level of nonspecific association always occurs at high protein concentrations . As an example, Fig. 5a shows the energy landscape of the interaction between two HPr monomers that are known not to form a specific stable homodimer. Since there is no native structure in this case, the IRMSD is calculated from an arbitrary structure in the lowest energy region. In contrast to the specific association between EIN and HPr, resulting in a deep and broad minimum (Fig. 4a), the nonspecific binding results in a higher number of minima with comparable energies, some of them > 50 Å from each other. The small blue spheres in Fig. 5b represent the centers of low energy docked structures of the second HPr molecule, and show that the distribution covers a large fraction of the surface. However, the energy-IRMSD plot still shows some large low energy clusters, indicating that some docked structures are more likely than some others, supporting the nonrandom nature of weak association noted above. Indeed, protein aggregation generally does not lead to the formation of entirely amorphous globules, but to the occurrence of some highly preferred interactions between monomeric proteins, or interactions that differ from those seen in the native ones if the protein tends to form a complex.
The biomedical importance of nonspecific interactions is due to the fact that neurodegenerative diseases such as Alzheimer’s disease, Parkinson’s disease, Huntington’s disease, amyotrophic lateral sclerosis and prion diseases appear to have common cellular and molecular mechanisms including protein aggregation . The aggregates usually consist of fibers containing misfolded protein with a β-sheet conformation, termed amyloid. The likelihood of aggregation is generally increased by increasing protein concentration, which can be caused by genetic dosage alterations. In the case of protein-coding mutations, the altered primary structure can also make the protein more prone to aggregate. Another important factor modulating aggregation is covalent modification, particularly phosphorylation. For example, α-synuclein purified from Lewy bodies in Parkinson’s disease patients is extensively phosphorylated . Some of the factors modulating the interactions have already been discussed in the context of crowding.
In spite of its well-recognized importance, modeling aggregation is challenging, and substantial methodology development is needed. While hydrophobic patches signal aggregation prone regions of proteins , no reliable and computationally feasible methods can predict the stability of aggregates and the rate of aggregation. These problems occur in many areas of molecular interactions, as scoring functions do not provide adequate estimates of the binding free energy, whereas more sophisticated tools such as free energy perturbation require detailed structural information and high computational efforts. It is clear that exploring aggregation in the crowded and inhomogeneous cellular environment is even more difficult. Another difficulty is caused by the limited availability of experimental data. Some data are available on peptides that can form amyloid-fibrils or amorphous β-aggregates and on potential aggregation prone regions in proteins, but aggregation rates upon mutations have been experimentally determined only for a dozen of amyloidogenic proteins .
Cell membranes made up of a wide variety of lipids act as a matrix to host integral membrane proteins, to recruit peripheral membrane proteins, and thus to actively participate in cellular membrane functions together with these proteins. Complexity of biological membrane systems arises from a considerable heterogeneity in the spatial distribution of lipids and proteins on the cell membrane and between the bilayer leaflets. The outer membrane of gram-negative bacteria provides an extreme example, where the lipid component of the outer leaflet is predominantly lipopolysaccharides and those of the inner leaflet are typical phospholipids (Fig. 6) . To a lesser extent, the outer leaflet of the plasma membrane contains more lipids with the phosphatidylcholine head group and sphingolipids (e.g., sphingomyelin) than the inner leaflet, and glycosphingolipids (e.g., gangliosides) exist only in the outer leaflet .
Membrane proteins play important roles in many cellular processes, such as transmembrane signaling [157, 158], transport of ions and small molecules [159-163], energy transduction [164, 165], and cell-cell recognition . They are quantitatively significant as well: 20-30% of the protein-encoding regions of known genomes encode membrane proteins . Furthermore, about 50% of these membrane proteins are considered putative drug targets . Hydrophobic match between the hydrophobic length of the protein transmembrane domain and that of the lipid bilayer has been thought to play an important role in membrane protein function and organization [169, 170]. Responses to an energetically unfavorable hydrophobic mismatch include lipid-induced changes in conformation and association of the transmembrane domains, as well as protein-induced changes in the lipid chain order, and bilayer thickness and curvature. Therefore, membrane proteins require optimal lipid compositions (lipid types and cholesterol concentration) in a bilayer for their optimal function, and membrane protein organization is also largely dependent on lipid compositions [171-178].
Given the aforementioned complexity of cell membranes, membrane proteins, and their distribution and organization, as well as the importance of delicate protein-lipid (or protein-bilayer) interactions in the structural integrity of membrane proteins  and in cellular membrane functions, carrying out molecular dynamics simulations of these complex systems based on realistic all-atom models presents a difficult challenge. Even the construction of initial simulation systems requires knowledge of structural models of individual proteins and lipids, as well as their ratios and locations. Furthermore, considerable computational resources are required to simulate such large systems for a sufficiently long time to obtain meaningful information. Such difficulties are the reason why most all-atom simulation studies of membrane proteins have been limited to either one or a few proteins, and mostly one or two lipid types [180-182]. Simpler low-resolution coarse-grained models have, however, been used to simulate systems with a large number of membrane proteins [183-186]. In addition, the long timescales required present unique challenges in studying both the folding and insertion of transmembrane and peripheral membrane proteins using traditional molecular dynamics simulations. Recently, Tajkhorshid and coworkers developed the highly mobile membrane-mimetic (HMMM) model with accelerated lipid motion by replacing the lipid tails with small organic molecules . The HMMM model provides accelerated lipid diffusion by 1-2 orders of magnitude, and is particularly useful in studying membrane-protein associations [188, 189].
There are various approaches to construct bilayers around membrane proteins [190-197]. Membrane Builder [190-192] (http://www.charmm-gui.org/input/membrane) in CHARMM-GUI  uses lipid-like pseudo atoms that are first distributed and packed around a protein and then replaced by lipid molecules one at a time. This so-called “replacement” method [199, 200] (which essentially corresponds to a reverse coarse-graining operation) allows easy control of the lipid types and the number of each lipid type based on the area per each lipid type (estimated from pure bilayer simulations) in a complex membrane system. Similar packing algorithm is adopted in packmol , MembraneEditor , MemBuilder , and insane , but detailed protocol may have varying degrees of complexity. For example, insane program can only build coarse-grained membrane and the lipid molecules with straight conformations are used, and packmol provides a sophisticated packing algorithm and any lipid molecule can be adopted if provided by a user. InflateGRO2  adopts an approach that removes the lipids that are overlapping to the membrane protein upon insertion. Deleting overlapping lipid molecules can be tricky because membrane proteins may have cavities in the transmembrane region. Therefore, InflateGRO2 adopts a grid-based search that detects protein cavities and assigns scores to the lipids based on the degree of overlap with protein, so the lipids that are ranked high can be deleted. GRIFFIN  and g_membed  adopt similar algorithms, where proteins are inserted into the membrane and overlapping lipids are either pushed away slowly or simply deleted.
Over the years, the complexity of the membrane systems that one has been able to build and simulate has increased, in line with the developmental stages of CHARMM-GUI Membrane Builder. Membrane Builder was first developed in 2007 as a publicly available web resource . The first implementation allowed users to generate an initial configuration of a protein in homogeneous lipid bilayers; three lipid types were available. In 2009, Membrane Builder was further developed to allow generation of membrane-only and a protein-membrane system in heterogeneous bilayers of multiple lipid types ; 35 lipid types were available. In 2014, Membrane Builder was expanded to handle > 180 lipid types including phosphoinositides, cardiolipin, sphingolipids, bacterial lipids, and ergosterol, which make it possible to build biologically realistic membrane systems for many single-celled organisms and models for membranes in the human body . Importantly, Membrane Builder also provides well-validated equilibration and production inputs for many molecular dynamics packages (CHARMM , NAMD , GROMACS , AMBER , OpenMM , and CHARMM/OpenMM) .
In the context of any biomolecular simulation, simulation timescale and molecular force field accuracy have always been challenging issues, and they will be so, as we are all interested in more challenging biological problems that require lager system size and longer simulation time ever. Simulation time and force field accuracy are also coupled as we often see inaccuracies in force fields as simulation timescales of complex systems are extended. In membrane simulations, one might ask the following challenging questions: Can all-atom molecular simulations predict lateral lipid organization and domain formation? Are the current force fields good for liquid ordered (rafts), ripple, or gel phases? Can simulations reveal specific protein-lipid interactions that activate protein functions? Can all-atom modeling and simulation handle interactions of peripheral membrane proteins with specific lipid types on the membrane surface?
With these questions in mind, we would like to turn into challenges and progress of cellular membrane modeling that requires various lipids and general assembly procedures. Having most (phospho- and sphingo-) lipid types covered, the next challenges are in building biological membranes containing glycolipids such as gangliosides, glycophosphatidylinositol (GPI) linkages, and lipopolysaccharide (LPS in Gram-negative bacterial outer membranes) , as the CHARMM force fields already cover a variety of carbohydrates [211-213]. These various types of lipids containing carbohydrates are necessary to model realistic extracellular membrane surface, but it is challenging to model and assemble them together, as glycans come in a diversity of sequences and structures by linking individual sugar units in a multitude of ways. Together Figs. 6 and and77 show the current progress of all-atom modeling and simulation of complex membrane systems [214-216] including LPS and GPI-anchor. It is also now possible to build various glycolipid models in Glycolipid Modeler in CHARMM-GUI (http://www.charmm-gui.org/input/glycolipid), and it will be possible to incorporate Glycolipid Modeler into Membrane Builder in the future to model realistic cellular membranes.
Understanding the spatial organization of chromatin in the cell nucleus is key to gaining insights into the mechanism of gene activities, nuclear functions and maintenance of cellular epigenetic states [217, 218]. Chromosome conformation capture (3C) and related techniques as well as single cell imaging studies have provided a wealth of information on the spatial architecture of the cell nucleus [217, 218]. Ensemble models of 3D structures of chromatins can help decipher physical mechanisms of long-range gene interactions and control of gene expression [219-221]. A challenging task is to infer 3D structures of folded chromosomes from frequency maps measured in 3C-based studies. This bears some resemblance to the problem of inferring protein structures from contact maps obtained from NMR measurements, despite the obvious difference in size and in scale. While chromatin chains are fundamentally different from protein chains and unlikely to fold into a unique native structure, both possess basic physical properties such as constraints on excluded volume, chain connectivity, and spatial confinement. It is likely that techniques developed in studying protein structures will have some relevance in modeling 3D structures of chromosomes. For example, built upon methods developed in protein folding studies [222, 223], the chromosome self-avoiding chain (C-SAC) model and the geometric sequential importance sampling technique were developed (Fig. 8). As a result, the equilibrium ensemble of randomly folded chromosomes in the confined nuclear volume was successfully generated, a challenging task as effective sampling under small volume constraint is extremely difficult. The results explain various experimentally observed scaling properties of spatial distance and looping probability . These results suggest that spatial confinement has dominant effects and offers an alternative interpretation of 3C studies to the earlier fractal globule model and the Strings and Binders Switch (SBS) model [217, 225]. It further suggests that the formation of topological domains can arise spontaneously from basic chain connectivity in severe spatial confinement. It is expected that experimental development such as high-resolution and single-cell Hi-C measurement will provide detailed information to understand how chromosome folding and its dynamic changes related to the control of cellular phenotype during development of individual cells . Further development in modeling 3D chromosome structures will help understand the overall architecture of chromosomes in the nuclear space, identify novel specific spatial interactions among gene elements, and gain mechanistic understanding of changes in the folding landscape of chromosomes which undergoes significant tissue- and developmental stage-specific size and shape changes.
Biological science is on the cusp of a new and transformational way to view living systems – the creation of physical molecular models of the fundamental unit of life, the cell. Developing 3D models to account for experimental observations and to predict emergent biological behaviors is key to gaining mechanistic understanding of cellular processes. We described emerging approaches to the structural modeling on a broad scale, from individual molecules to cell biology. The cross section of these diverse approaches covers 3D molecular cell models based on experimental data, genome-wide structural modeling of protein interactions, atomistic modeling of protein-crowder interactions, nonspecific protein interactions, cellular membrane modeling and simulation, and modeling of chromosomes. The list is far from a complete roster of methodologies needed for structural modeling of a cell, and simply represents one sample of approaches and techniques for such modeling.
Structural modeling of a cell complements computational approaches to cellular mechanisms based on differential equations, graph models, and other techniques to model biological networks, imaging data, etc. The structural modeling along with other computational and experimental approaches will provide a fundamental understanding of life at the molecular level and lead to important applications to biology and medicine.
WI acknowledges grant support from NIH R01 GM092950, NIH U54 GM087519, NSF MCB1516154, NSF MCB1157677, NSF DBI1145987, and NSF IIA1359530; JL thanks Youfang Cao, Gamze Gursoy, Yun Xu, and Jieling Zhao for their work on chromatin folding, and acknowledges grant support from NIH R01 GM079804, NSF MCB1415589, and the Chicago Biomedical Consortium with support from the Searle Funds at The Chicago Community Trust; AJO thanks Graham Johnson, Ludovic Autin, Michel Sanner and David Goodsell for their work on cellPACK, and acknowledges grant support from P41 GM103426 an NIH Research Resource (R. Amaro, Director); HXZ acknowledges grant support from NIH R01 GM088187; SV thanks Dima Kozakov for contributing to research on encounter complexes and nonspecific protein-protein interactions, and acknowledges grant support from NIH R01 GM093147, NIH R01 GM061867, and NSF DBI1147082; IAV thanks Petras Kundrotas and Ivan Anishchenko for their work on modeling of protein interactome, and acknowledges grant support from NIH R01GM074255, NSF DBI1262621 and NSF CNS1337899.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.