The best way to understand our approach is in terms of the ‘information flow’ diagrammed in Figure . One starts by submitting two or more conformations of a given protein to the server. Given one conformation, a number of online tools and databases, such as PDB, FSSP, SCOP, CATH, CE and VAST can suggest a second conformation. Then, through a variety of transformations, the server classifies the motion in the database and produces an appealing movie.
Figure 1 Diagram of our approach. The information flow from databases, through the server, and then back again to databases is broken down into its component steps. Experimental data in the PDB and other databases is converted into a motion entry in the Database (more ...)
Solved conformations analysis as performed by the server’s tools requires two kinds of information: (i) 3D atomic coordinates of protein conformations as solved structure files (such as those at the PDB) and, more importantly, (ii) information relating two or more of these solved structures, thus selecting them for analysis. Such information, for instance, could come from the SCOP Database (30
), from automated searching of databases for proteins related by structure or sequence or from a simple user input form on the Web. A selection scheme is important because the number of ordered pairs of PDB structures is rather large (more than 10 0002
). Figure diagrams the server in the larger context of data sources.
Figure 2 (Left) Here, the information flow may be visualized as a series of linked Web pages. Users submit new motions to the server via either the Server Submission Form or via a simplified interface through the Structural Alignment Server’s submission (more ...)
Once a string of structures has been given to the server, the first step is to establish equivalence (an alignment) between residues in the various proteins. This is necessary because the protein structures compared, while sharing some evolutionary or structural similarity, will, in general, not share the same amino acid sequence. Consequently, an alignment is necessary.
Because the server may be asked to simultaneously compare more than two sequences, an algorithm capable of simultaneously aligning multiple sequences (or structures) and potentially building an evolutionary tree must be used. For this purpose, we have chosen the AMPS algorithm (32
). In cases in which sequence alignment is inappropriate, such as for highly diverged homologs, we use the technique of structure alignment (28
). The latter method relies primarily on the use of 3D coordinates (i.e., solved PDB structures of proteins) to produce a sequence alignment otherwise analogous to an alignment produced purely from sequence information. As a result, the structural method is able to generate meaningful sequence alignments from both highly related proteins and completely unrelated proteins sharing similar structural features due to convergent evolution. Sequence alignment is used unless sequence similarity is below a user-defined cut-off, at which point structure alignment is used. The choice of approach (sequence or structural alignment) may also be forced by the user upon morph submission.
One of the major aims of the server is to collect standardized statistics on the proteins involved in motions. Standardized statistics, such as maximum rotation or maximum Cα displacement, are computed with respect to a specific superposition and reference frame, and so the superposition algorithm is central to any conformational analysis tool.
The output of the alignment procedure establishes residue equivalencies that are used in an intelligent superposition of the structures onto one another. Traditional ‘all-atom’ RMS superposition minimizes the RMS difference between Cα atoms in the open and closed conformations. In a simple hinge motion, e.g., calmodulin, such an alignment fits the closed conformation symmetrically inside the open conformation (Fig. ). Amongst other things, the maximum Cα displacement computed from such a superposition is considerably underestimated from the common sense alignment, and the morph movie gives the impression of motion far more complicated than a simple opening of a hinge. Instead, we perform the superposition with a modified ‘sieve-fit’ procedure (35
). The procedure is iterative. On each iteration the remaining Cα atoms are superimposed by a standard RMS fit, and then the pair of corresponding Cα atoms furthest apart are eliminated. This is repeated until approximately half of the atoms in the protein have been eliminated. Previously described uses of the ‘sieve-fit’ procedure (36
) used some sort of cut-off value to determine when to stop the procedure, typically RMS deviation. No single RMS deviation cut-off value has consistently worked well. However, we have found that by stopping the procedure after approximately half the atoms have been discarded, one of the ‘domains’ thus selected generally corresponds approximately to a superset or a subset of a real domain in the structure, and is thus well suited for performing the subsequent axes transformations.
Figure 3 Superposition of a calmodulin-like protein undergoing a hinge motion. Structures 1 and 2 indicate the closed and open conformations, respectively. Compare ‘Global Fit’, the superposition produced by a traditional least-squares fit of (more ...)
Orientation and hinge location
To locate the screw-axis, a ‘fit–refit’ procedure, as described by Lesk and Chothia (38
) is used. Following superposition of the starting and ending conformations, we only consider the set of eliminated atoms. We perform an RMS-fit of that set between the starting and ending conformations; the server performs the new superposition (arbitrarily) on the ending conformation. A comparison of the new position of the ending conformation following this latest fit with its position following the ‘sieve-fit’ procedure yields a geometric transformation whose screw axis is (approximately) the axis of the hinge motion, i.e., the location of the hinge, as has been published elsewhere (39
). Straightforward calculations allow characterization of the angle of rotation around the hinge axis.
If a significant hinge motion is present, the software uses these transformations to align the Z-axis of the coordinate frame parallel to the hinge axis so that, when the motion is rendered, viewers will look down the screw-axis of the hinge motion. The longest moment of the protein (long axis) is rotated (optionally) so that it is parallel to the Y-axis. Finally, the coordinate frame is translated so that the centroid of the initial conformation is in the center of the field of view.
The software also attempts to locate putative hinge regions using a simple and relatively fast algorithm. The algorithm looks for a persistent transition between the two domains identified by the program. The algorithm constructs a search window, initially with 24 residues. It examines each position along the peptide backbone in this window. If there is a persistent transition (i.e., one-half of the algorithm’s search window belonging to one ‘domain’ and the other half to the other), a hinge is detected. If the program fails to find any hinges along the backbone chain, the window size is reduced by two, and the procedure is repeated until the window size has been shrunk down to 12 residues, at which point the program reports failure. Empirically, this crude but computationally inexpensive algorithm successfully finds many hinge regions, such as the hinge region for calmodulin, which agree well with published residue selections. In other cases, the algorithm comes close, identifying a residue selection that borders on a hinge. Hinges may be displayed graphically via a ‘hinge movie’ identifying the putative hinge region or regions in red.
In related work, Wriggers et al.
presented techniques to identify protein domains and common hinges using an adaptive least-squares fitting technique (40
); the user is presented with a number of options (spatial connectivity maintenance, significant structural difference filters) to ensure optimal hinge finding. For the remote user’s convenience, our own hinge finder is at present fully automatic and presents no options to the user. It may be advantageous for us to provide such options in the future so that the user can override and improve on the putative hinge initially selected by our algorithm, although this would partially defeat our efforts at standardization. Maiorov et al.
) has developed a system which detects hinges by large-scale sampling of torsion angle space; this technique, while presumably more accurate, is also much more computationally expensive then our current technique. It may be useful for us to give the user the option of using alternate hinge finding engines in the future.
To illustrate the putative hinge finder, a frame from one such ‘hinge movie’ is given in Figure , with the putative hinge identified in black. Superposition, orientation and hinge-finding are relatively fast steps, requiring a fraction of a second of computer time on our server.
Figure 4 Putative hinge movie. A frame from a ‘hinge movie’ of ras protein (PDB ID 4Q21 morph intermediate frame) showing the putative hinge regions as identified by the server. The server identifies 71:82 and 118:129 as putative hinge regions (more ...)
We have modified the X-PLOR package (42
) to homogenize the stored coordinates. This problem is non-trivial (43
). The initial, solved intermediate and final conformations are parsed by X-PLOR and examined for missing non-hydrogen coordinates. These are filled in using energy minimization with the known coordinates of the molecule fixed at their solved positions. If these missing coordinates are available in another solved conformation, the coordinates from the superimposed and rotated conformation are used as an initial guess as to their likely positions. As written, filling-in of missing non-hydrogen coordinates is necessary for the energy minimization subsystems to work robustly with a large number of PDB files. It also ensures homogenized output of PDB files, which is required by the visual rendering subsystem.
The next step is in the dominion of what we refer to as the ‘interpolation engine.’ Once the structures have been homogenized in terms of solved atomic coordinates, interpolation may proceed. Under command of the script, the custom X-PLOR interpolation function is repeatedly called, each time evenly reducing the distance between the current structure and the final structure. When more than two solved conformations are present, the distance between the current structure and a solved intermediate conformation is evenly reduced instead. Each step is followed by a round of energy minimization to correct molecular stereochemistry and enforce rules of chemical reality on the structure. To ensure that the final frames are as accurate as possible, the solved endpoint structures are used for these. When solved intermediates are present, these are inserted as frames at regular intervals. The entire process takes only a few minutes to produce 10 frames running on a 500 MHz Intel Pentium III workstation running Linux.
There are many possible interpolation strategies, and a number of tradeoffs between accuracy, various computational resources, time and others are involved in the choice. For this reason, in addition to our original adiabatic mapping engine, we offer the user two engines based on LSQMAN (45
) (one Cartesian-based and another based on internal phi, psi coordinates), which are faster but appear to be less realistic. Users wishing to add their own, non-trivial interpolation engines may contact the authors to make arrangements to do so. For example, a user wishing to analyze a very large number of trajectories (10 000 or more from, e.g., samplings from molecular dynamics simulations) might wish to supply a simplified interpolation engine and make other arrangements to allow the computations to be completed in a reasonable amount of time.
We chose our original technique, known in the literature as adiabatic mapping (47
) for reasons of computational efficiency. It is a technique that produces chemically reasonable morphs with a modest amount of computational power and thus is most suitable for a Web-based server. This remains the default interpolation engine for the server. Using this engine, the server can produce a realistic interpolation of a protein and have the results rendered and returned to the user in < 3 min on a fast Pentium III machine. Using adiabatic mapping, we have also produced our own morph of the motion in GroEL which, although probably less accurate than the considerably more expensive technique of normal mode analysis (11
), is probably good enough for most researchers seeking only a visual representation. Nevertheless, we believe that, for many proteins, most real motions will occur along the interpolated trajectories, and the morph server may be used to predict intermediate conformations should they exist. How close our predicted pathways come to reality is perhaps best answered through the emerging technique of time-resolved X-ray crystallography (48
). Thus, an adiabatic mapping engine is much more suited to our goal of automatically interpolating a large percentage of the motions in our database.
With the intermediate conformations morphed, the molecule is now visually rendered. We have written a Perl script that produces VRML 2.0 (Moving 3D Worlds) code (50
,51) on-the-fly from the intermediate PDB files. The VRML 2.0 output is suitable for interactively viewing the moving 3D macromolecule in a VRML 2.0 Internet browser, such as SGI CosmoPlayer 2.0. The advantage of the 3D display format is that the remote Internet user may easily choose a preferred orientation and vantage point.
The molecule is also rendered as a 2D movie in the MultiGif, Quicktime, and MPEG formats, as well as an Adobe Portable Document Format (PDF) (52
) page showing the individual frames. Remote adjustment of vantage point and orientation is not possible in the simpler 2D video format, so the molecule is rendered with the screw axis perpendicular to the plane of the display device, as was computed during the orientation process. The molecule is rendered in three display types (53
): ribbons (with secondary structure indicated), lines (as a simple alpha chain), and ball and stick (showing all individual non-hydrogen atoms). The first two formats are also rendered into a small moving MultiGif icon to afford the database user with a quickly downloaded preview of the larger movies available.
In the process, key standardized statistics are recorded. These include maximum Cα displacement, rotation angle in degrees around putative hinge regions, sequences of the putative hinge regions, average torsion angle change in the hinge region versus the overall average, distance of the putative hinge region from the screw axis, distance of the screw axis from the centroid, a structural comparison score between the two domains and a number of additional, useful statistics, such as the differences in torsion angles at every aligned position and the pseudo CHARMM/X-PLOR (42
) energy at each point in the morph.
These statistics are detailed enough to perform an automatic preliminary classification of the motion and determine the location of the hinge relative to the transformed axes. (For example, a large rotation angle indicates a probable hinge motion.) A detailed description of our statistical results is given in Table for five motions. Ranges and averages of some of these statistics after several hundred alignments are given in Table along with similar but sparse statistics culled manually from the scientific literature for comparison.
Comprehensive statistics for alcohol dehydrogenase, reoverin, DNA polymerase beta, GroEL and diphtheria toxin as reported by the server
Comparison of statistics between automatically gathered (server gathered) and manually gathered statistics for maximum Cα displacement and maximum rotation
For example, over approximately 175 motions submitted for analysis, the median motion has a maximum rotation of 9.5° over a range of 0–150° as computed by our algorithm, whereas the 12 motions culled from the scientific literature had an average rotation of 24° over a range of 5–148°. Similarly, our algorithms found a median maximum Cα displacement of 17 Å ranging from 0 to 81 Å for the submitted motions, whereas 11 motions reported in the scientific literature average 12 Å over a range of 1.5–60 Å. Although most of the structures are very similar in sequence, the server has been able to accommodate sequence identity down to 8% for some motions (see Table ). Most motions have at least one large torsion angle change (see Table ).
Morph similarity score statistics
The sparseness of manually culled data in Table is due to the lack of a standardized nomenclature for these statistics in the scientific literature. It is worth noting that a different set of proteins had to be used for each of the manually culled tallies in Table . Because these statistics predate the server, they serve as a manual “gold standard” against which the results of the server may be compared. Table presents a statistical description of motions in the database, a main scientific benefit of the server.
Integration with database
Privacy is a concern with some submissions, so users are afforded the option to either keep their submissions secret until the results have been published or to cause the submission to appear immediately in an index. For each successfully completed morph, the server produces a Web page allowing easy download of the coordinates (as an archive of PDB files or in NMR format) or movies (in a number of video formats), in addition to displaying the molecule in the moving VRML format. The page includes the standardized statistics discussed above generated for the conformations used in the morph. This page may be accessed through a URL containing a special code that is emailed back to the submitting user when the morph is complete; for users seeking to keep their morphs private (for publication reasons), this URL serves as the user’s password, allowing access to the morph page in the server. For public morphs, these pages are also accessible through an index, http://bioinfo.mbb.yale.edu/MolMovDB/movies
The ultimate flow of information is circular. For each motion we either link it via a motion ID to an existing entry in the Macromolecular Motions Database or we generate a new entry in the database. The results of analyzing particular ordered sets of structures (‘strings’ of structures) are entered under an appropriate identifier into the Database of Macromolecular Motions for further reference, and, in many cases, suggest further structures to study and analyze. Each comparison is assigned a unique ID entered into the ‘comparison table’ in the database that references the IDs of the PDB structures involved. These comparisons are, in turn, referenced by entries in the motions database (these references may be generated by comparing the IDs of the PDB structures referenced in each comparison table entry with the PDB structures referenced in each motion table entry.). Because many motions in the database are associated with more than two structures, more than one comparison is often possible and some database entries do reference multiple comparisons.
New movies, which lack a motion entry in the Database of Macromolecular Motions, have an entry automatically created with minimal or no annotation. This is indicated in the entry by setting the annotation level to zero. (Annotation levels range from 0 to 10. A level of ‘0’ indicates the entry was automatically created with no human intervention. ‘10’ indicates significant human intervention, typically in the form of a large amount of descriptive text present in the entry.) The user can annotate the new entry using an easy-to-use edit form displayed in his or her Web browser. Existing entries are also editable by the community through the same Web form with prior authorization from the database’s maintainers. All changes are subsequently reviewed by the maintainers to assure quality control. In this way, the Database of Macromolecular Motions is used to classify and organize morphs submitted to the morph server.