To understand basic biological processes such as cell division, cell signaling, development, metabolism, and cell death, detailed knowledge of the three-dimensional structures of the active participants is necessary. Structures of a large number of proteins have been determined experimentally using primarily X-ray crystallography and NMR spectroscopy. There are currently more than 50,000 entries in the Protein Data Bank (PDB)1
, which archives experimentally determined structures of proteins and protein complexes, as well as nucleic acids and other biological macromolecules. However, a list of sequences of proteins in the PDB such that no two sequences are more than 20% identical to each other, contains only about 4000 sequences2
. Many of these proteins can be grouped further into about 1000 folds in 2000 superfamilies3
Since it was first recognized that proteins can share similar structures4
, computational methods have been developed to build models of proteins of unknown structure based on related proteins of known structure5
. Most such modeling efforts, referred to as homology modeling or comparative modeling, follow a basic protocol: 1) for a target
sequence of unknown structure, identify a template
structure with sequence related to the target and align the target sequence to the template sequence and structure; 2) for core secondary structures and all well-conserved parts of the alignment, borrow the backbone coordinates of the template according to the sequence alignment of the target and template; 3) build side chains onto the backbone model according to the target sequence; 4) for segments of the target sequence for which coordinates cannot be borrowed from the template because of insertions and deletions in the alignment (usually in loop regions of the protein) or because of missing coordinates in the template, rebuild these regions using loop modeling methods or other ab initio structure prediction methods; 5) refine the structure, modeling likely differences in the relative positions of α helices, β sheet strands, and other elements of structure.
The identification step necessarily involves sequence alignment, but even once a template has been identified and aligned to the target, a number of different methods may be used to improve the alignment, including fold recognition methods6
and profile-profile alignment7
. Manual editing based on visualization of the template structure is frequently used to improve the alignment. Steps 3 and 4, side-chain and backbone modeling, may be coupled, since certain backbone conformations may be unable to accommodate the required side chains in any low-energy conformation. The refinement step involves moving all parts of the structure, including the backbone model produced in Step 2, allowing them to adjust to the new sequence. For instance, two helices packed against each other may move apart to accommodate larger side chains in the target than in the template. Many methods have been proposed to perform each of the steps in the homology modeling process. Other procedures based on reconstructing structures (rather than perturbing a starting structure) by satisfying spatial restraints using distance geometry8
or molecular dynamics and energy minimization9-12
have also been developed. The popular program Modeller is one of these9
Homology modeling of proteins has been of great value in interpreting the relationships of sequence, structure, and function. In particular, orthologous proteins usually show a pattern of conserved residues that can be interpreted in terms of three-dimensional models of the proteins. Orthologues are genes and proteins in two different organisms that have descended from a single common ancestor without duplication. Conserved residues often form a contiguous active site or interaction surface of the protein, even if they are distant from each other in the sequence. With a structural model, a multiple alignment of orthologous proteins can be interpreted in terms of the constraints of natural selection in terms of protein folding, stability, dynamics, and function. Paralogues, on the other hand, arise from gene duplication and subsequent divergence of sequence and function. For paralogous proteins, three-dimensional models can be used to interpret the similarities and differences in the sequences in terms of the related structure but usually different functions of the proteins concerned.13
In many cases, there are significant insertions and deletions and amino acid changes in the active or binding site between paralogues. Indeed, homology models can serve to help us identify which protein belongs to which functional group by the conservation of important residues in the active or binding site14
. A number of groups have used comparative modeling to predict protein function.15-19
Another important use of homology modeling is to understand functional changes due to point mutations in protein sequences that arise either by natural processes or by experimental manipulation. The human genome project has produced significant amounts of data concerning polymorphisms and other mutations potentially related to differences in susceptibility, prognosis, and treatment of human disease. There are now many such examples, including the Factor V/Leiden R506Q mutation20
that causes increased occurrence of thrombosis, mutations in the serine protease HTRA2 associated with Parkinson's disease21
, and BRCA1 for which many sequence differences are known, some of which may lead to breast cancer22
. At the same time, there are many polymorphisms in important genes that have no discernible effect on those who carry them. At least for some of these, there may be some effect that has yet to be measured in a large enough population of patients, and therefore the risk of cancer, heart disease, or other illness to these patients is unknown. This is yet another important application of homology modeling, since a good model may indicate readily which mutations pose a likely risk and which do not23
Homology models may also be used in computer-aided drug design, especially when a closely related template structure is available for the target sequence. In such cases, the active site may be sufficiently conserved such that a model of the protein provides a reasonable target for computer programs that can suggest the most likely compounds that will bind to the active site. This has been used successfully in the early development of HIV protease inhibitors24,25
and in the development of anti-malarial compounds that target the cysteine protease of P. falciparum26
. It has been used recently for high-throughput computational screening of alpha glucosidase inhibitors for treatment of diabetes27
In this paper, we provide protocols for using two tools for protein structure prediction. The first is SCWRL, which is a computer program for side-chain prediction28,29
. SCWRL is perhaps the most popular program for modeling of side chains (2,636 licenses in 72 countries as of 12 August 2008), and offers a good tradeoff between accuracy and speed as described in a recent independent benchmark30
. SCWRL may be used for homology modeling by inputting a sequence different from the input backbone coordinates. It may also be used to complete a structure that is missing some or all of its side chains. This may happen if some side chains are not present in a PDB structure because of lack of electron density, and yet it would be useful to have at least a predicted position for these residues. SCWRL can be used to build mutations into proteins by converting one side-chain type into another, although it does not model any changes in backbone structure that might result.
The second program we describe is MolIDE31
. MolIDE provides a graphical interface to the basic protocol of homology modeling, except for the final refinement step. That is, it provides for identification and alignment of the target to a template, modeling of the backbone according to this alignment, building of side chains of the target sequence, and loop modeling of insertion-deletion regions. For many purposes, such modeling is sufficient, since it provides rough locations of all amino acids within the protein: whether they are on the surface or buried, in active sites or not, and proximity in space versus proximity in sequence, etc. The refinement step is quite difficult and time consuming in computational resources and may provide little added benefit for many users of homology models. This is an area of active research32
SCWRL itself has been designed to be an easy-to-use command-line program that builds side chains onto an input backbone structure. The current version, SCWRL3.029
, is based on a graph-theory algorithm that represents the interactions of side chains within a protein as a graph. It then uses the graph and an energy function to determine the lowest energy conformations of all of the side chains. Most side-chain types in proteins have a limited number of discrete conformations referred to as rotamers
. These arise from steric repulsions of atoms separated by three or four covalent bonds. The shortest side chains, serine, threonine, valine, and cysteine, have only three available conformations. Leucine and isoleucine have nine conformations, although some of these are high in energy and are rare in proteins. The longer side chains have many possible conformations, although in practice many of these are also high in energy and many such residues are exposed to the solvent and sample many conformations in a dynamically active structure. SCWRL uses a backbone-dependent rotamer library33,34
, which expresses the frequency of rotamers as a function of the backbone dihedral angles ϕ and Ψ for each of the amino acid types. It also contains information on the average side-chain dihedral angles in a backbone-dependent manner. SCWRL3.0 uses the 2002 version of the backbone-dependent rotamer library35
. This library was based on 850 chains in the PDB with resolution better than or equal to 1.7 Å. Residues with high B-factors or atomic clashes were removed from the data set, as suggested by Lovell et al.36
, and the terminal dihedrals of Asn, Gln, and His residues were flipped if there was a clear hydrogen bond formed by doing so37
MolIDE performs homology modeling of single proteins in a visual environment. The basic protocol, described step-by-step below, is shown in (Please note that the description for MolIDE starts at step 5 in the PROCEDURES section). Its first step is to perform a database search with the program PSI-BLAST38
. PSI-BLAST is an iterative program that searches a database for sequences similar to the target or query sequence. Usually the sequence database for this search is the non-redundant database (“nr
”) of protein sequences provided by the NCBI39
; currently this database contains over 6 million sequences. The first iteration of PSI-BLAST is just like the BLAST program, finding the most closely related sequences in the database related to the query. PSI-BLAST creates a multiple sequence alignment of the target sequence with these sequences. This multiple sequence alignment is then transformed into a sequence profile
, which expresses the frequency of each of the 20 amino acid types in each column of the multiple sequence alignment. That is, for each residue of the query, the profile contains a numerical value for each of the 20 amino acids, depending on how often that residue type is found aligned to the query residue amongst homologues.
The second iteration of PSI-BLAST searches the database with the profile instead of just the query sequence, and scores each sequence in the database by how well it aligns to the profile. For instance, if the profile shows that a particular position in the multiple sequence is 100% glycines, then the profile will score glycine in a database sequence very highly, and all other residue types either neutrally or negatively. If another column is a mixture of hydrophobic residues, then all hydrophobic residues will be scored positively but charged residues will be scored quite negatively. Some positions in the multiple sequence alignment contain most or all of the 20 amino acids, and such positions will score all of the amino acids neutrally. The second iteration will produce a new list of hits. Many of these will be the same as in the first round, but the alignments may be different. True hits will usually have longer alignments, and the expectation values or E-values will be much better. An E-value for a hit is the number of hits expected with a raw score the same as or better than the hit; thus a very small E-value (<0.001) indicates a high statistical significance. In the second round, many new hits with good E-values will be found that were not found in the first round. PSI-BLAST then builds a new multiple sequence alignment and profile from the hits in the second iteration, and then performs a search with this profile. The number of iterations is controlled by the user from within MolIDE. We have added to the most recent version of MolIDE (version 1.6) the ability to open the PSI-BLAST output file against the nr database. This produces a table that is sortable by E-value, sequence identity, starting and stopping residues of the alignment, protein name, and species. As such, MolIDE provides a graphical interface for PSI-BLAST searches of large databases such as nr, in addition to tools for structure prediction.
Once PSI-BLAST has created these profiles, each profile is used to search a database of sequences of proteins of known structure, called pdbaa
. We provide access to this database from our website each week as new structures are added to the PDB2
. This PSI-BLAST of the PDB runs automatically after the search of nr
to create the profiles. The resulting files, one for each round of PSI-BLAST search of nr
, contain lists of possible template structures for modeling the target sequence as well as alignments between the target and the template sequences. The version of PSI-BLAST provided in MolIDE is modified from that provided by NCBI. In particular, it outputs a profile matrix for each round of PSI-BLAST with a different filename (e.g. file1, file2,
etc.) rather than overwriting a single filename. It also outputs a profile after the last search round of PSI-BLAST, while NCBI's version does not.
It is helpful at this point to perform a prediction of secondary structure within MolIDE using PSIPRED40
. The PSIPRED predictions can be used to identify folded domains of the proteins in regions where there is significant amounts of predicted secondary structure, and disordered regions where little secondary structure is predicted. Also, the PSIPRED predictions will later be displayed in the alignments of the target sequence to a template. In this situation, they can be used to help determine whether the template is correct (at least some agreement of predicted and experimental secondary structure is expected) and to identify regions that may be misaligned (poor agreement of major secondary structure elements). PSIPRED uses the sequence profiles produced by PSI-BLAST to predict the positions of α helices and β sheets in the target sequence.
The next step in modeling is to choose a template by opening the list of hits from the PDB. MolIDE parses each PSI-BLAST output file and creates a table of hits, including their experimental source (XRAY or NMR), the resolution of X-ray structures, the E-value, sequence identity, the beginning and ending points in the target sequence, and the alignment length. This table is sortable by any of these elements, which provides a rapid way of finding the highest resolution structure or the one with the highest sequence identity. Quite frequently, a multi-domain protein may have homologues of known structure that cover non-overlapping regions of the target sequence. The table can be sorted by beginning or ending residue numbers of the target sequence in order to locate templates that cover the domain of interest among a list of possibly hundreds of hits. The template for modeling is chosen based on the best combination of a number of factors. First, it must cover the region or regions of interest in the target protein. From among those hits, usually the best E-value or highest sequence identity template is chosen. When there are a number of templates with about the same evolutionary distance from the target sequence, the one with highest resolution is usually the best choice of template. Another consideration may be the number and positions of gaps in the sequence alignment. Often one template or another may contain a ligand or binding partner of interest while others will not. This ligand may be nucleic acid, small organic molecules, ions, or other proteins. Our database program ProtBuD41
can be used to search within a protein family for particular ligands. Structures of such complexes may be used to identify important binding residues in the model of the target.
With a simple click within the list of templates, MolIDE downloads the template structure from the PDB and then displays the structure of the template and the sequence alignment of the target and template. The predicted secondary structure of the target is shown above the target sequence and the experimental secondary structure of the template is shown below the template sequence. At this point, the alignment can be manually edited. PSI-BLAST alignments are reasonably accurate at sequence identities above 30%42
, but even then the positions of gaps of insertions or deletions in the alignment may be placed within regular secondary structures of the template. The visualization tool within MolIDE allows the user to see the placement of deletions from the structure (deleted residues marked by red balls) and insertions into the structure (marked by yellow balls at the point of insertion). The gaps can be moved within the alignment as described in the protocol and the changes are marked in real time on the structure. For deletions from the structure, it is best if the end points of the deletion are relatively close to one another in space on the template structure.
Once the alignment has been edited, the modeling process consists of three steps performed within the MolIDE graphical interface. The first is simply to copy the backbone coordinates to a new file and renumber the sequence and the residue names to the target sequence according to the alignment. Only aligned residues are copied. If the residue types are the same at a given position in the target and template sequences, the side-chain coordinates are also copied to the new file. If the template PDB contains modified residues (selenomethionine or phosphorylated residues), only the backbone is copied. The second step is to run SCWRL to build the side chains of the target sequence onto the model backbone. SCWRL is able to preserve the Cartesian coordinates of side chains that are conserved, and this generally results in more accurate side-chain prediction in homology modeling. The third step is to model each loop in turn. MolIDE uses the program Loopy43
which is a relatively fast program for loop structure prediction. It is one of the few stand-alone loop-modeling programs. To model the first loop, the user selects positions for the left and right anchors of the region to be remodeled. These are positions in the structure that will be kept fixed while the residues in between will be modeled by Loopy. The left anchor can be chosen as the last residue of the secondary structure preceding the gap, while the right anchor is the first residue of the secondary structure following the gap. Alternatively, longer or shorter regions may be chosen in order to choose the anchor positions as the closest conserved residues to the gap. Once the anchors are set, the loop can be modeled with a click. The anchors are then set around the next gap in the same manner and so on, until all the insertion and deletion regions have been remodeled with Loopy.
If a protein is very long (>500 amino acids) and contains multiple domains and long disordered regions, it is sometimes helpful to use target sequences of the single domains of interest. Many proteins contain long regions that are intrinsically disordered. Several web servers are available to predict these regions, including DISOPRED44
. Such regions can be removed from the target sequence before the PSI-BLAST search step.
The homology modeling procedure provided by MolIDE is simple and straightforward. It produces models assuming that aligned regions of the target to the template do not change backbone conformation. While this is not in general true, it is a reasonable approximation when the sequence identity is above 30%. Even below this value, the model is still useful in understanding the relative positions of residues in the protein. In any sequence alignment, some regions are more conserved than others, and these regions usually have functional or structural significance. Thus, even such a simple modeling procedure will predict whether residues are on the surface or buried: mutation of buried residues may lead to unfolding, while mutations on the surface may abolish binding to other molecules. For protein-protein interactions, a patch of conserved surface residues may be close together on the surface but far apart in the sequence. Such a conserved patch may be used to locate likely binding surfaces, which can be tested using site-directed mutagenesis.
It should be noted that there are web servers that will also produce homology models, including SwissModel45
as well as databases of models, including ModBase46
. These are certainly valid alternatives to performing homology modeling with SCWRL and MolIDE. However, there are many choices in homology modeling that a user may wish to make with consideration of particular biological questions in mind, and MolIDE is designed to allow the user to make these choices while handling the nitty-gritty computational steps with a few clicks. This is especially true in the choice of template. Some templates may be preferred because they contain ligands of interest, including other proteins, small molecules, or nucleic acids. Some templates may have better conservation near residues that the user is interested in, for instance given existing mutation data. Other templates may have fewer insertions or deletions in regions of interest, for instance near an interface. Further, MolIDE provides user interaction during the model-building process that may be highly beneficial. This is especially true of manual editing of the target-template alignment using additional information that the user may possess, including multiple sources for target-template alignment and visualization of the template structure. Choosing the positions where loop modeling begins and ends (the loop anchors) by visualizing the structure may also lead to better structure predictions.