|Home | About | Journals | Submit | Contact Us | Français|
The number of solved structures of macromolecules that have the same fold and thus exhibit some degree of conformational variability is rapidly increasing. It is consequently advantageous to develop a standardized terminology for describing this variability and automated systems for processing protein structures in different conformations. We have developed such a system as a ‘front-end’ server to our database of macromolecular motions. Our system attempts to describe a protein motion as a rigid-body rotation of a small ‘core’ relative to a larger one, using a set of hinges. The motion is placed in a standardized coordinate system so that all statistics between any two motions are directly comparable. We find that while this model can accommodate most protein motions, it cannot accommodate all; the degree to which a motion can be accommodated provides an aid in classifying it. Furthermore, we perform an adiabatic mapping (a restrained interpolation) between every two conformations. This gives some indication of the extent of the energetic barriers that need to be surmounted in the motion, and as a by-product results in a ‘morph movie’. We make these movies available over the Web to aid in visualization. Many instances of conformational variability occur between proteins with somewhat different sequences. We can accommodate these differences in a rough fashion, generating an ‘evolutionary morph’. Users have already submitted hundreds of examples of protein motions to our server, producing a comprehensive set of statistics. So far the statistics show that the median submitted motion has a rotation of ~10° and a maximum Cα displacement of 17 Å. Almost all involve at least one large torsion angle change of >140°. The server is accessible at http://bioinfo.mbb.yale.edu/MolMovDB
Solved structures and related structural information on proteins is growing at an exponential rate. This is due chiefly to continuous technological progress in X-ray crystallography, NMR spectroscopy and computer technology. As researchers solve structures at an ever increasing rate, there occurs an obvious need for processing techniques to relate such structures to one another, beyond classification or structural alignment. Protein motions, as an essential link between structure and function, are an obvious area of relationships between protein structures in the databases. Motion is intimately related to the way a structure fulfills a particular function. Protein motions are involved in a wide variety of basic functions, including regulation, transport of metabolites, formation of large assemblies, and cellular locomotion. Examples can be found throughout nature, from local conformational changes involved in the binding of ligands (1) that occur in enzymatic reactions to the complex rearrangement of covalent bonds (2).
Obviously, one of the best ways to represent and communicate protein motions is through ‘movies’, especially when they are made available over the web. There have been a number of previous efforts in this area. Vonrhein et al. (3,4) made a custom movie of calmodulin and placed it on the Web. Similar work has been done by Sawaya et al., who created movies of crystal structures of polymerase beta (5). Ray-traced 3D molecular dynamics simulation of acetylcholinesterase from mutagenesis data have also been made available (6–8). More recently, movies from molecular dynamics simulations of protein folding (plp group) (9,10) have been available on the Internet. Xu et al. used the techniques of normal mode analysis to produce a morph movie of GroEL from structural data (11–14).
In this paper we present a perspective on how protein motions can be put into standardized, consistent terms. We develop a simple model for protein motions involving rigid-body motion of parts, apply our model to actual cases and measure how well it fits. Our approach is embodied in an integrated Web server that provides tools to compare solved conformations of proteins involved in motion, generates statistics to characterize and classify them into a database and automatically makes a morph movie to represent them. In addition, the server presents a database linking protein motions with custom movies of motions available at other sites, along with our own morphs generated automatically by the server upon request by members of the Internet community. Our server and database have been used by Internet users to analyze a number of recent structures including human interleukin 5 (15), bc1 complex (16,17), glycerol kinase (18,19) and lactoferrin (20,21). It has also been used as a source of raw data in visualization tools (22) and in relation to other biological databases (23). The Web server is accessible at: http://bioinfo.mbb.yale.edu/MolMovDB/morph . It is integrated with the Database of Macromolecular Motions (24,25) and is also connected with a variety of tools for aligning protein folds and studying their occurrence in genomes (26–29).
The best way to understand our approach is in terms of the ‘information flow’ diagrammed in Figure Figure1.1. One starts by submitting two or more conformations of a given protein to the server. Given one conformation, a number of online tools and databases, such as PDB, FSSP, SCOP, CATH, CE and VAST can suggest a second conformation. Then, through a variety of transformations, the server classifies the motion in the database and produces an appealing movie.
Solved conformations analysis as performed by the server’s tools requires two kinds of information: (i) 3D atomic coordinates of protein conformations as solved structure files (such as those at the PDB) and, more importantly, (ii) information relating two or more of these solved structures, thus selecting them for analysis. Such information, for instance, could come from the SCOP Database (30,31), from automated searching of databases for proteins related by structure or sequence or from a simple user input form on the Web. A selection scheme is important because the number of ordered pairs of PDB structures is rather large (more than 10 0002). Figure Figure22 diagrams the server in the larger context of data sources.
Once a string of structures has been given to the server, the first step is to establish equivalence (an alignment) between residues in the various proteins. This is necessary because the protein structures compared, while sharing some evolutionary or structural similarity, will, in general, not share the same amino acid sequence. Consequently, an alignment is necessary.
Because the server may be asked to simultaneously compare more than two sequences, an algorithm capable of simultaneously aligning multiple sequences (or structures) and potentially building an evolutionary tree must be used. For this purpose, we have chosen the AMPS algorithm (32–34). In cases in which sequence alignment is inappropriate, such as for highly diverged homologs, we use the technique of structure alignment (28,29). The latter method relies primarily on the use of 3D coordinates (i.e., solved PDB structures of proteins) to produce a sequence alignment otherwise analogous to an alignment produced purely from sequence information. As a result, the structural method is able to generate meaningful sequence alignments from both highly related proteins and completely unrelated proteins sharing similar structural features due to convergent evolution. Sequence alignment is used unless sequence similarity is below a user-defined cut-off, at which point structure alignment is used. The choice of approach (sequence or structural alignment) may also be forced by the user upon morph submission.
One of the major aims of the server is to collect standardized statistics on the proteins involved in motions. Standardized statistics, such as maximum rotation or maximum Cα displacement, are computed with respect to a specific superposition and reference frame, and so the superposition algorithm is central to any conformational analysis tool.
The output of the alignment procedure establishes residue equivalencies that are used in an intelligent superposition of the structures onto one another. Traditional ‘all-atom’ RMS superposition minimizes the RMS difference between Cα atoms in the open and closed conformations. In a simple hinge motion, e.g., calmodulin, such an alignment fits the closed conformation symmetrically inside the open conformation (Fig. (Fig.3).3). Amongst other things, the maximum Cα displacement computed from such a superposition is considerably underestimated from the common sense alignment, and the morph movie gives the impression of motion far more complicated than a simple opening of a hinge. Instead, we perform the superposition with a modified ‘sieve-fit’ procedure (35,36). The procedure is iterative. On each iteration the remaining Cα atoms are superimposed by a standard RMS fit, and then the pair of corresponding Cα atoms furthest apart are eliminated. This is repeated until approximately half of the atoms in the protein have been eliminated. Previously described uses of the ‘sieve-fit’ procedure (36,37) used some sort of cut-off value to determine when to stop the procedure, typically RMS deviation. No single RMS deviation cut-off value has consistently worked well. However, we have found that by stopping the procedure after approximately half the atoms have been discarded, one of the ‘domains’ thus selected generally corresponds approximately to a superset or a subset of a real domain in the structure, and is thus well suited for performing the subsequent axes transformations.
To locate the screw-axis, a ‘fit–refit’ procedure, as described by Lesk and Chothia (38) is used. Following superposition of the starting and ending conformations, we only consider the set of eliminated atoms. We perform an RMS-fit of that set between the starting and ending conformations; the server performs the new superposition (arbitrarily) on the ending conformation. A comparison of the new position of the ending conformation following this latest fit with its position following the ‘sieve-fit’ procedure yields a geometric transformation whose screw axis is (approximately) the axis of the hinge motion, i.e., the location of the hinge, as has been published elsewhere (39). Straightforward calculations allow characterization of the angle of rotation around the hinge axis.
If a significant hinge motion is present, the software uses these transformations to align the Z-axis of the coordinate frame parallel to the hinge axis so that, when the motion is rendered, viewers will look down the screw-axis of the hinge motion. The longest moment of the protein (long axis) is rotated (optionally) so that it is parallel to the Y-axis. Finally, the coordinate frame is translated so that the centroid of the initial conformation is in the center of the field of view.
The software also attempts to locate putative hinge regions using a simple and relatively fast algorithm. The algorithm looks for a persistent transition between the two domains identified by the program. The algorithm constructs a search window, initially with 24 residues. It examines each position along the peptide backbone in this window. If there is a persistent transition (i.e., one-half of the algorithm’s search window belonging to one ‘domain’ and the other half to the other), a hinge is detected. If the program fails to find any hinges along the backbone chain, the window size is reduced by two, and the procedure is repeated until the window size has been shrunk down to 12 residues, at which point the program reports failure. Empirically, this crude but computationally inexpensive algorithm successfully finds many hinge regions, such as the hinge region for calmodulin, which agree well with published residue selections. In other cases, the algorithm comes close, identifying a residue selection that borders on a hinge. Hinges may be displayed graphically via a ‘hinge movie’ identifying the putative hinge region or regions in red.
In related work, Wriggers et al. presented techniques to identify protein domains and common hinges using an adaptive least-squares fitting technique (40); the user is presented with a number of options (spatial connectivity maintenance, significant structural difference filters) to ensure optimal hinge finding. For the remote user’s convenience, our own hinge finder is at present fully automatic and presents no options to the user. It may be advantageous for us to provide such options in the future so that the user can override and improve on the putative hinge initially selected by our algorithm, although this would partially defeat our efforts at standardization. Maiorov et al. (41) has developed a system which detects hinges by large-scale sampling of torsion angle space; this technique, while presumably more accurate, is also much more computationally expensive then our current technique. It may be useful for us to give the user the option of using alternate hinge finding engines in the future.
To illustrate the putative hinge finder, a frame from one such ‘hinge movie’ is given in Figure Figure4,4, with the putative hinge identified in black. Superposition, orientation and hinge-finding are relatively fast steps, requiring a fraction of a second of computer time on our server.
We have modified the X-PLOR package (42) to homogenize the stored coordinates. This problem is non-trivial (43,44). The initial, solved intermediate and final conformations are parsed by X-PLOR and examined for missing non-hydrogen coordinates. These are filled in using energy minimization with the known coordinates of the molecule fixed at their solved positions. If these missing coordinates are available in another solved conformation, the coordinates from the superimposed and rotated conformation are used as an initial guess as to their likely positions. As written, filling-in of missing non-hydrogen coordinates is necessary for the energy minimization subsystems to work robustly with a large number of PDB files. It also ensures homogenized output of PDB files, which is required by the visual rendering subsystem.
The next step is in the dominion of what we refer to as the ‘interpolation engine.’ Once the structures have been homogenized in terms of solved atomic coordinates, interpolation may proceed. Under command of the script, the custom X-PLOR interpolation function is repeatedly called, each time evenly reducing the distance between the current structure and the final structure. When more than two solved conformations are present, the distance between the current structure and a solved intermediate conformation is evenly reduced instead. Each step is followed by a round of energy minimization to correct molecular stereochemistry and enforce rules of chemical reality on the structure. To ensure that the final frames are as accurate as possible, the solved endpoint structures are used for these. When solved intermediates are present, these are inserted as frames at regular intervals. The entire process takes only a few minutes to produce 10 frames running on a 500 MHz Intel Pentium III workstation running Linux.
There are many possible interpolation strategies, and a number of tradeoffs between accuracy, various computational resources, time and others are involved in the choice. For this reason, in addition to our original adiabatic mapping engine, we offer the user two engines based on LSQMAN (45,46) (one Cartesian-based and another based on internal phi, psi coordinates), which are faster but appear to be less realistic. Users wishing to add their own, non-trivial interpolation engines may contact the authors to make arrangements to do so. For example, a user wishing to analyze a very large number of trajectories (10 000 or more from, e.g., samplings from molecular dynamics simulations) might wish to supply a simplified interpolation engine and make other arrangements to allow the computations to be completed in a reasonable amount of time.
We chose our original technique, known in the literature as adiabatic mapping (47) for reasons of computational efficiency. It is a technique that produces chemically reasonable morphs with a modest amount of computational power and thus is most suitable for a Web-based server. This remains the default interpolation engine for the server. Using this engine, the server can produce a realistic interpolation of a protein and have the results rendered and returned to the user in < 3 min on a fast Pentium III machine. Using adiabatic mapping, we have also produced our own morph of the motion in GroEL which, although probably less accurate than the considerably more expensive technique of normal mode analysis (11–14), is probably good enough for most researchers seeking only a visual representation. Nevertheless, we believe that, for many proteins, most real motions will occur along the interpolated trajectories, and the morph server may be used to predict intermediate conformations should they exist. How close our predicted pathways come to reality is perhaps best answered through the emerging technique of time-resolved X-ray crystallography (48,49). Thus, an adiabatic mapping engine is much more suited to our goal of automatically interpolating a large percentage of the motions in our database.
With the intermediate conformations morphed, the molecule is now visually rendered. We have written a Perl script that produces VRML 2.0 (Moving 3D Worlds) code (50,51) on-the-fly from the intermediate PDB files. The VRML 2.0 output is suitable for interactively viewing the moving 3D macromolecule in a VRML 2.0 Internet browser, such as SGI CosmoPlayer 2.0. The advantage of the 3D display format is that the remote Internet user may easily choose a preferred orientation and vantage point.
The molecule is also rendered as a 2D movie in the MultiGif, Quicktime, and MPEG formats, as well as an Adobe Portable Document Format (PDF) (52) page showing the individual frames. Remote adjustment of vantage point and orientation is not possible in the simpler 2D video format, so the molecule is rendered with the screw axis perpendicular to the plane of the display device, as was computed during the orientation process. The molecule is rendered in three display types (53,54): ribbons (with secondary structure indicated), lines (as a simple alpha chain), and ball and stick (showing all individual non-hydrogen atoms). The first two formats are also rendered into a small moving MultiGif icon to afford the database user with a quickly downloaded preview of the larger movies available.
In the process, key standardized statistics are recorded. These include maximum Cα displacement, rotation angle in degrees around putative hinge regions, sequences of the putative hinge regions, average torsion angle change in the hinge region versus the overall average, distance of the putative hinge region from the screw axis, distance of the screw axis from the centroid, a structural comparison score between the two domains and a number of additional, useful statistics, such as the differences in torsion angles at every aligned position and the pseudo CHARMM/X-PLOR (42,55) energy at each point in the morph.
These statistics are detailed enough to perform an automatic preliminary classification of the motion and determine the location of the hinge relative to the transformed axes. (For example, a large rotation angle indicates a probable hinge motion.) A detailed description of our statistical results is given in Table Table11 for five motions. Ranges and averages of some of these statistics after several hundred alignments are given in Table Table22 along with similar but sparse statistics culled manually from the scientific literature for comparison.
For example, over approximately 175 motions submitted for analysis, the median motion has a maximum rotation of 9.5° over a range of 0–150° as computed by our algorithm, whereas the 12 motions culled from the scientific literature had an average rotation of 24° over a range of 5–148°. Similarly, our algorithms found a median maximum Cα displacement of 17 Å ranging from 0 to 81 Å for the submitted motions, whereas 11 motions reported in the scientific literature average 12 Å over a range of 1.5–60 Å. Although most of the structures are very similar in sequence, the server has been able to accommodate sequence identity down to 8% for some motions (see Table Table3).3). Most motions have at least one large torsion angle change (see Table Table44).
The sparseness of manually culled data in Table Table22 is due to the lack of a standardized nomenclature for these statistics in the scientific literature. It is worth noting that a different set of proteins had to be used for each of the manually culled tallies in Table Table2.2. Because these statistics predate the server, they serve as a manual “gold standard” against which the results of the server may be compared. Table Table11 presents a statistical description of motions in the database, a main scientific benefit of the server.
Privacy is a concern with some submissions, so users are afforded the option to either keep their submissions secret until the results have been published or to cause the submission to appear immediately in an index. For each successfully completed morph, the server produces a Web page allowing easy download of the coordinates (as an archive of PDB files or in NMR format) or movies (in a number of video formats), in addition to displaying the molecule in the moving VRML format. The page includes the standardized statistics discussed above generated for the conformations used in the morph. This page may be accessed through a URL containing a special code that is emailed back to the submitting user when the morph is complete; for users seeking to keep their morphs private (for publication reasons), this URL serves as the user’s password, allowing access to the morph page in the server. For public morphs, these pages are also accessible through an index, http://bioinfo.mbb.yale.edu/MolMovDB/movies
The ultimate flow of information is circular. For each motion we either link it via a motion ID to an existing entry in the Macromolecular Motions Database or we generate a new entry in the database. The results of analyzing particular ordered sets of structures (‘strings’ of structures) are entered under an appropriate identifier into the Database of Macromolecular Motions for further reference, and, in many cases, suggest further structures to study and analyze. Each comparison is assigned a unique ID entered into the ‘comparison table’ in the database that references the IDs of the PDB structures involved. These comparisons are, in turn, referenced by entries in the motions database (these references may be generated by comparing the IDs of the PDB structures referenced in each comparison table entry with the PDB structures referenced in each motion table entry.). Because many motions in the database are associated with more than two structures, more than one comparison is often possible and some database entries do reference multiple comparisons.
New movies, which lack a motion entry in the Database of Macromolecular Motions, have an entry automatically created with minimal or no annotation. This is indicated in the entry by setting the annotation level to zero. (Annotation levels range from 0 to 10. A level of ‘0’ indicates the entry was automatically created with no human intervention. ‘10’ indicates significant human intervention, typically in the form of a large amount of descriptive text present in the entry.) The user can annotate the new entry using an easy-to-use edit form displayed in his or her Web browser. Existing entries are also editable by the community through the same Web form with prior authorization from the database’s maintainers. All changes are subsequently reviewed by the maintainers to assure quality control. In this way, the Database of Macromolecular Motions is used to classify and organize morphs submitted to the morph server.
To illustrate the technique of adiabatic mapping as implemented by the server, Figure Figure55 depicts the frames in five automatically generated morphs produced by the server’s adiabatic mapping interpolation engine.
ADH. First is a ‘trivial’ morph, alcohol dehydrogenase (56,57). This morph is considered ‘trivial’ because a true motion is involved, and the endpoint conformations are sufficiently close together that a pathway between the two— not involving chain breaks or clearly distorted geometry— is easy to construct in the mind’s eye. Therefore, one would intuitively expect software that claims to perform morphing to handle this case with similar ease. This is indeed the case, as can be seen in the figure, which depicts the frames generated by the morph server for the morph of alcohol dehydrogenase. The protein has very little movement; the figure shows the frames in the motion generated by our server with an arrow to indicate the region of movement. When the actual animation is played back, the arrow is not necessary, as the eye has evolved to be especially sensitive to motion and easily picks out the movement in the movie.
Recoverin. Recoverin (58–60) is an example of a ‘typical’ morph. The morph is considered ‘typical’ because a true motion is involved, as can be seen in the figure, the motion involves most of the molecule and is therefore qualitatively more extensive than that of a ‘trivial’ motion such as alcohol dehydrogenase. The motion is sufficiently complicated that a simple linear interpolation would produce at least some obvious distortion and physical impossibilities. Nevertheless, the adiabatic mapping interpolation engine produces a realistic morph without chain breaks or clearly distorted geometry.
GroEL. The user should be aware of a number of problems that can be encountered in the adiabatic mapping method. Problems arise for large deformations if the energy minimization methods cannot effectively remove the accumulated stresses (62). These problems are endemic to all adiabatic mapping systems, including our Web server. This problem is illustrated in Figure Figure5,5, which shows a morph of one subunit in GroEL (11–14), a ‘medium difficulty’ morph because of the considerable atomic displacements between the starting and ending conformations. However, this GroEL motion still represents a true protein motion, and the server still produces a fairly realistic interpolation. One means of improving this morph would be to have the user select additional interpolation frames (and, hence, additional energy minimizations). (This is in a sense a ‘feature’ that highlights which motions are sterically more difficult to achieve.)
DT. Our model will, of course, break when fed an ‘impossible’ morph, as shown in Figure Figure5.5. The endpoint conformations are that of diphtheria toxin (DT), not a true motion but rather an example of domain swapping (63,64) in which the domains in DT have been solved while bound in two different configurations. For one conformation to ‘morph’ into another, the easiest physically realistic route would be for one domain to unfold and refold. Indeed, the morph generated by the server does suggest a process of this sort.
While the morph server is unable to generate a physically realistic movie of this ‘motion’, it does suggest that the morph server may be used as a quick visual tool in evaluating the validity of a proposed motion. Comprehensive statistics for all five morphs may be found in Table Table11.
In a majority of cases the structures of a given macromolecule involved in motions have been solved in two or more conformations, so that endpoints for the motion are available. This, in turn, means that automatic conformation comparison tools are possible, which, when applied en masse to the motions database, allow the generation of a consistent, standardized set of statistics characterizing the motions in the database. In the process of analyzing the structures, pathway interpolation is possible as well.
Since the goal of the server was to output only a single interpolated pathway ‘morph’, it is necessary to define more precisely what is desired. Define the ‘optimal morph’ as the most likely (or most frequently taken) pathway between two conformations. In the large dimensional space of macromolecular atomic coordinate space, an infinite number of paths between conformations exist, so that establishing that a given local ensemble of pathways is the most statistically probable is, in general, computationally intractable. A more realistic approach would be simply to find a morph that is a reasonably good reaction coordinate that does not produce any large chemical distortions. This reduced the computational complexity of the problem, yet ensured that the resulting morph would be insightful, yet likely be very similar to the ‘optimal’ morph.
Unlike the adiabatic interpolation engine used in the server, a number of interpolation engines on proteins have taken approaches that do not meet these criteria. With the exception of the simplest motions, simple linear interpolations of atomic coordinates without consideration of physical reality yields intermediates with clearly distorted geometry. In some cases, atoms may be significantly closer than their van der Waals radii would permit, or further apart than a chemical bond would reasonably be expected to allow. A more sophisticated approach to morph movies not currently taken by the server due to its stringent computational requirements, but one which might be added in the future, involves the use of normal mode analysis, such as was done on GroEL by Xu et al. (11–14).
We have developed an integrated set of protein conformation comparison tools on the Web for use in conjunction with the Macromolecular Motions Database or as a stand-alone, publicly accessible server. When solved endpoint structures are available, the server can produce a useful comparison of the structures involved in protein motions. The server also implements a database of protein motions accessible on the Web or generated by Internet users through our server; this database is integrated into the Molecular Motions Database.
The server collects a number of statistics on the motion, including maximum Cα displacement and maximum rotation around the putative hinge, which are useful both in analyzing and classifying individual proteins and in generating a statistical picture of motions in the motions database as a whole. The software also homogenizes the incoming structures, attempting to solve for missing atoms using a molecular dynamics algorithm. The server then uses an adiabatic mapping technique to generate a visually rendered interpolated pathway, or ‘morph’, of the motion or evolution of the protein. The homogenized endpoint coordinates and the generated intermediate coordinates are made available for download.
The software presents the visual representation, statistics, orientation, alignment and interpolated coordinates to the user. At user option, these results may become public immediately or remain private until paper publication. Through an easy-to-use Web form, the user is afforded an opportunity to create a descriptive entry in the Database of Macromolecular Motions for the protein structures involved, referencing the morph results, as well. We have found the server useful in the analysis of protein motions and anticipate that use of the server will help standardize statistics and nomenclature for protein motions subsequently presented in the scientific literature.
The authors gratefully acknowledge the financial support of the Keck Foundation and the National Science Foundation (Grant DBI-9723182). Numerous people have also either contributed entries or information to the database and morph server or have given us feedback on what the user community wants. The authors also wish to thank Informix Software, Inc. for providing a grant of its database software.