|Home | About | Journals | Submit | Contact Us | Français|
SCWRL and MolIDE are software applications for prediction of protein structures. SCWRL is designed specifically for the task of prediction of side-chain conformations given a fixed backbone usually obtained from an experimental structure determined by X-ray crystallography or NMR. SCWRL is a command-line program that typically runs in a few seconds. MolIDE provides a graphical interface for basic comparative (homology) modeling using SCWRL and other programs. MolIDE takes an input target sequence, and uses PSI-BLAST to identify and align templates for comparative modeling of the target. The sequence alignment to any template can be manually modified within a graphical window of the target-template alignment and visualization of the alignment on the template structure. MolIDE builds the model of the target structure based on the template backbone, predicted side-chain conformations with SCWRL, and a loop-modeling program for insertion-deletion regions with user-selected sequence segments. SCWRL and MolIDE can be obtained at http://dunbrack.fccc.edu/Software.php.
To understand basic biological processes such as cell division, cell signaling, development, metabolism, and cell death, detailed knowledge of the three-dimensional structures of the active participants is necessary. Structures of a large number of proteins have been determined experimentally using primarily X-ray crystallography and NMR spectroscopy. There are currently more than 50,000 entries in the Protein Data Bank (PDB)1, which archives experimentally determined structures of proteins and protein complexes, as well as nucleic acids and other biological macromolecules. However, a list of sequences of proteins in the PDB such that no two sequences are more than 20% identical to each other, contains only about 4000 sequences2. Many of these proteins can be grouped further into about 1000 folds in 2000 superfamilies3.
Since it was first recognized that proteins can share similar structures4, computational methods have been developed to build models of proteins of unknown structure based on related proteins of known structure5. Most such modeling efforts, referred to as homology modeling or comparative modeling, follow a basic protocol: 1) for a target sequence of unknown structure, identify a template structure with sequence related to the target and align the target sequence to the template sequence and structure; 2) for core secondary structures and all well-conserved parts of the alignment, borrow the backbone coordinates of the template according to the sequence alignment of the target and template; 3) build side chains onto the backbone model according to the target sequence; 4) for segments of the target sequence for which coordinates cannot be borrowed from the template because of insertions and deletions in the alignment (usually in loop regions of the protein) or because of missing coordinates in the template, rebuild these regions using loop modeling methods or other ab initio structure prediction methods; 5) refine the structure, modeling likely differences in the relative positions of α helices, β sheet strands, and other elements of structure.
The identification step necessarily involves sequence alignment, but even once a template has been identified and aligned to the target, a number of different methods may be used to improve the alignment, including fold recognition methods6 and profile-profile alignment7. Manual editing based on visualization of the template structure is frequently used to improve the alignment. Steps 3 and 4, side-chain and backbone modeling, may be coupled, since certain backbone conformations may be unable to accommodate the required side chains in any low-energy conformation. The refinement step involves moving all parts of the structure, including the backbone model produced in Step 2, allowing them to adjust to the new sequence. For instance, two helices packed against each other may move apart to accommodate larger side chains in the target than in the template. Many methods have been proposed to perform each of the steps in the homology modeling process. Other procedures based on reconstructing structures (rather than perturbing a starting structure) by satisfying spatial restraints using distance geometry8 or molecular dynamics and energy minimization9-12 have also been developed. The popular program Modeller is one of these9.
Homology modeling of proteins has been of great value in interpreting the relationships of sequence, structure, and function. In particular, orthologous proteins usually show a pattern of conserved residues that can be interpreted in terms of three-dimensional models of the proteins. Orthologues are genes and proteins in two different organisms that have descended from a single common ancestor without duplication. Conserved residues often form a contiguous active site or interaction surface of the protein, even if they are distant from each other in the sequence. With a structural model, a multiple alignment of orthologous proteins can be interpreted in terms of the constraints of natural selection in terms of protein folding, stability, dynamics, and function. Paralogues, on the other hand, arise from gene duplication and subsequent divergence of sequence and function. For paralogous proteins, three-dimensional models can be used to interpret the similarities and differences in the sequences in terms of the related structure but usually different functions of the proteins concerned.13 In many cases, there are significant insertions and deletions and amino acid changes in the active or binding site between paralogues. Indeed, homology models can serve to help us identify which protein belongs to which functional group by the conservation of important residues in the active or binding site14. A number of groups have used comparative modeling to predict protein function.15-19
Another important use of homology modeling is to understand functional changes due to point mutations in protein sequences that arise either by natural processes or by experimental manipulation. The human genome project has produced significant amounts of data concerning polymorphisms and other mutations potentially related to differences in susceptibility, prognosis, and treatment of human disease. There are now many such examples, including the Factor V/Leiden R506Q mutation20 that causes increased occurrence of thrombosis, mutations in the serine protease HTRA2 associated with Parkinson's disease21, and BRCA1 for which many sequence differences are known, some of which may lead to breast cancer22. At the same time, there are many polymorphisms in important genes that have no discernible effect on those who carry them. At least for some of these, there may be some effect that has yet to be measured in a large enough population of patients, and therefore the risk of cancer, heart disease, or other illness to these patients is unknown. This is yet another important application of homology modeling, since a good model may indicate readily which mutations pose a likely risk and which do not23.
Homology models may also be used in computer-aided drug design, especially when a closely related template structure is available for the target sequence. In such cases, the active site may be sufficiently conserved such that a model of the protein provides a reasonable target for computer programs that can suggest the most likely compounds that will bind to the active site. This has been used successfully in the early development of HIV protease inhibitors24,25 and in the development of anti-malarial compounds that target the cysteine protease of P. falciparum26. It has been used recently for high-throughput computational screening of alpha glucosidase inhibitors for treatment of diabetes27.
In this paper, we provide protocols for using two tools for protein structure prediction. The first is SCWRL, which is a computer program for side-chain prediction28,29. SCWRL is perhaps the most popular program for modeling of side chains (2,636 licenses in 72 countries as of 12 August 2008), and offers a good tradeoff between accuracy and speed as described in a recent independent benchmark30. SCWRL may be used for homology modeling by inputting a sequence different from the input backbone coordinates. It may also be used to complete a structure that is missing some or all of its side chains. This may happen if some side chains are not present in a PDB structure because of lack of electron density, and yet it would be useful to have at least a predicted position for these residues. SCWRL can be used to build mutations into proteins by converting one side-chain type into another, although it does not model any changes in backbone structure that might result.
The second program we describe is MolIDE31. MolIDE provides a graphical interface to the basic protocol of homology modeling, except for the final refinement step. That is, it provides for identification and alignment of the target to a template, modeling of the backbone according to this alignment, building of side chains of the target sequence, and loop modeling of insertion-deletion regions. For many purposes, such modeling is sufficient, since it provides rough locations of all amino acids within the protein: whether they are on the surface or buried, in active sites or not, and proximity in space versus proximity in sequence, etc. The refinement step is quite difficult and time consuming in computational resources and may provide little added benefit for many users of homology models. This is an area of active research32.
SCWRL itself has been designed to be an easy-to-use command-line program that builds side chains onto an input backbone structure. The current version, SCWRL3.029, is based on a graph-theory algorithm that represents the interactions of side chains within a protein as a graph. It then uses the graph and an energy function to determine the lowest energy conformations of all of the side chains. Most side-chain types in proteins have a limited number of discrete conformations referred to as rotamers. These arise from steric repulsions of atoms separated by three or four covalent bonds. The shortest side chains, serine, threonine, valine, and cysteine, have only three available conformations. Leucine and isoleucine have nine conformations, although some of these are high in energy and are rare in proteins. The longer side chains have many possible conformations, although in practice many of these are also high in energy and many such residues are exposed to the solvent and sample many conformations in a dynamically active structure. SCWRL uses a backbone-dependent rotamer library33,34, which expresses the frequency of rotamers as a function of the backbone dihedral angles ϕ and Ψ for each of the amino acid types. It also contains information on the average side-chain dihedral angles in a backbone-dependent manner. SCWRL3.0 uses the 2002 version of the backbone-dependent rotamer library35. This library was based on 850 chains in the PDB with resolution better than or equal to 1.7 Å. Residues with high B-factors or atomic clashes were removed from the data set, as suggested by Lovell et al.36, and the terminal dihedrals of Asn, Gln, and His residues were flipped if there was a clear hydrogen bond formed by doing so37.
MolIDE performs homology modeling of single proteins in a visual environment. The basic protocol, described step-by-step below, is shown in Figure 1 (Please note that the description for MolIDE starts at step 5 in the PROCEDURES section). Its first step is to perform a database search with the program PSI-BLAST38. PSI-BLAST is an iterative program that searches a database for sequences similar to the target or query sequence. Usually the sequence database for this search is the non-redundant database (“nr”) of protein sequences provided by the NCBI39; currently this database contains over 6 million sequences. The first iteration of PSI-BLAST is just like the BLAST program, finding the most closely related sequences in the database related to the query. PSI-BLAST creates a multiple sequence alignment of the target sequence with these sequences. This multiple sequence alignment is then transformed into a sequence profile, which expresses the frequency of each of the 20 amino acid types in each column of the multiple sequence alignment. That is, for each residue of the query, the profile contains a numerical value for each of the 20 amino acids, depending on how often that residue type is found aligned to the query residue amongst homologues.
The second iteration of PSI-BLAST searches the database with the profile instead of just the query sequence, and scores each sequence in the database by how well it aligns to the profile. For instance, if the profile shows that a particular position in the multiple sequence is 100% glycines, then the profile will score glycine in a database sequence very highly, and all other residue types either neutrally or negatively. If another column is a mixture of hydrophobic residues, then all hydrophobic residues will be scored positively but charged residues will be scored quite negatively. Some positions in the multiple sequence alignment contain most or all of the 20 amino acids, and such positions will score all of the amino acids neutrally. The second iteration will produce a new list of hits. Many of these will be the same as in the first round, but the alignments may be different. True hits will usually have longer alignments, and the expectation values or E-values will be much better. An E-value for a hit is the number of hits expected with a raw score the same as or better than the hit; thus a very small E-value (<0.001) indicates a high statistical significance. In the second round, many new hits with good E-values will be found that were not found in the first round. PSI-BLAST then builds a new multiple sequence alignment and profile from the hits in the second iteration, and then performs a search with this profile. The number of iterations is controlled by the user from within MolIDE. We have added to the most recent version of MolIDE (version 1.6) the ability to open the PSI-BLAST output file against the nr database. This produces a table that is sortable by E-value, sequence identity, starting and stopping residues of the alignment, protein name, and species. As such, MolIDE provides a graphical interface for PSI-BLAST searches of large databases such as nr, in addition to tools for structure prediction.
Once PSI-BLAST has created these profiles, each profile is used to search a database of sequences of proteins of known structure, called pdbaa. We provide access to this database from our website each week as new structures are added to the PDB2. This PSI-BLAST of the PDB runs automatically after the search of nr to create the profiles. The resulting files, one for each round of PSI-BLAST search of nr, contain lists of possible template structures for modeling the target sequence as well as alignments between the target and the template sequences. The version of PSI-BLAST provided in MolIDE is modified from that provided by NCBI. In particular, it outputs a profile matrix for each round of PSI-BLAST with a different filename (e.g. file1, file2, etc.) rather than overwriting a single filename. It also outputs a profile after the last search round of PSI-BLAST, while NCBI's version does not.
It is helpful at this point to perform a prediction of secondary structure within MolIDE using PSIPRED40. The PSIPRED predictions can be used to identify folded domains of the proteins in regions where there is significant amounts of predicted secondary structure, and disordered regions where little secondary structure is predicted. Also, the PSIPRED predictions will later be displayed in the alignments of the target sequence to a template. In this situation, they can be used to help determine whether the template is correct (at least some agreement of predicted and experimental secondary structure is expected) and to identify regions that may be misaligned (poor agreement of major secondary structure elements). PSIPRED uses the sequence profiles produced by PSI-BLAST to predict the positions of α helices and β sheets in the target sequence.
The next step in modeling is to choose a template by opening the list of hits from the PDB. MolIDE parses each PSI-BLAST output file and creates a table of hits, including their experimental source (XRAY or NMR), the resolution of X-ray structures, the E-value, sequence identity, the beginning and ending points in the target sequence, and the alignment length. This table is sortable by any of these elements, which provides a rapid way of finding the highest resolution structure or the one with the highest sequence identity. Quite frequently, a multi-domain protein may have homologues of known structure that cover non-overlapping regions of the target sequence. The table can be sorted by beginning or ending residue numbers of the target sequence in order to locate templates that cover the domain of interest among a list of possibly hundreds of hits. The template for modeling is chosen based on the best combination of a number of factors. First, it must cover the region or regions of interest in the target protein. From among those hits, usually the best E-value or highest sequence identity template is chosen. When there are a number of templates with about the same evolutionary distance from the target sequence, the one with highest resolution is usually the best choice of template. Another consideration may be the number and positions of gaps in the sequence alignment. Often one template or another may contain a ligand or binding partner of interest while others will not. This ligand may be nucleic acid, small organic molecules, ions, or other proteins. Our database program ProtBuD41 can be used to search within a protein family for particular ligands. Structures of such complexes may be used to identify important binding residues in the model of the target.
With a simple click within the list of templates, MolIDE downloads the template structure from the PDB and then displays the structure of the template and the sequence alignment of the target and template. The predicted secondary structure of the target is shown above the target sequence and the experimental secondary structure of the template is shown below the template sequence. At this point, the alignment can be manually edited. PSI-BLAST alignments are reasonably accurate at sequence identities above 30%42, but even then the positions of gaps of insertions or deletions in the alignment may be placed within regular secondary structures of the template. The visualization tool within MolIDE allows the user to see the placement of deletions from the structure (deleted residues marked by red balls) and insertions into the structure (marked by yellow balls at the point of insertion). The gaps can be moved within the alignment as described in the protocol and the changes are marked in real time on the structure. For deletions from the structure, it is best if the end points of the deletion are relatively close to one another in space on the template structure.
Once the alignment has been edited, the modeling process consists of three steps performed within the MolIDE graphical interface. The first is simply to copy the backbone coordinates to a new file and renumber the sequence and the residue names to the target sequence according to the alignment. Only aligned residues are copied. If the residue types are the same at a given position in the target and template sequences, the side-chain coordinates are also copied to the new file. If the template PDB contains modified residues (selenomethionine or phosphorylated residues), only the backbone is copied. The second step is to run SCWRL to build the side chains of the target sequence onto the model backbone. SCWRL is able to preserve the Cartesian coordinates of side chains that are conserved, and this generally results in more accurate side-chain prediction in homology modeling. The third step is to model each loop in turn. MolIDE uses the program Loopy43 which is a relatively fast program for loop structure prediction. It is one of the few stand-alone loop-modeling programs. To model the first loop, the user selects positions for the left and right anchors of the region to be remodeled. These are positions in the structure that will be kept fixed while the residues in between will be modeled by Loopy. The left anchor can be chosen as the last residue of the secondary structure preceding the gap, while the right anchor is the first residue of the secondary structure following the gap. Alternatively, longer or shorter regions may be chosen in order to choose the anchor positions as the closest conserved residues to the gap. Once the anchors are set, the loop can be modeled with a click. The anchors are then set around the next gap in the same manner and so on, until all the insertion and deletion regions have been remodeled with Loopy.
If a protein is very long (>500 amino acids) and contains multiple domains and long disordered regions, it is sometimes helpful to use target sequences of the single domains of interest. Many proteins contain long regions that are intrinsically disordered. Several web servers are available to predict these regions, including DISOPRED44. Such regions can be removed from the target sequence before the PSI-BLAST search step.
The homology modeling procedure provided by MolIDE is simple and straightforward. It produces models assuming that aligned regions of the target to the template do not change backbone conformation. While this is not in general true, it is a reasonable approximation when the sequence identity is above 30%. Even below this value, the model is still useful in understanding the relative positions of residues in the protein. In any sequence alignment, some regions are more conserved than others, and these regions usually have functional or structural significance. Thus, even such a simple modeling procedure will predict whether residues are on the surface or buried: mutation of buried residues may lead to unfolding, while mutations on the surface may abolish binding to other molecules. For protein-protein interactions, a patch of conserved surface residues may be close together on the surface but far apart in the sequence. Such a conserved patch may be used to locate likely binding surfaces, which can be tested using site-directed mutagenesis.
It should be noted that there are web servers that will also produce homology models, including SwissModel45 as well as databases of models, including ModBase46. These are certainly valid alternatives to performing homology modeling with SCWRL and MolIDE. However, there are many choices in homology modeling that a user may wish to make with consideration of particular biological questions in mind, and MolIDE is designed to allow the user to make these choices while handling the nitty-gritty computational steps with a few clicks. This is especially true in the choice of template. Some templates may be preferred because they contain ligands of interest, including other proteins, small molecules, or nucleic acids. Some templates may have better conservation near residues that the user is interested in, for instance given existing mutation data. Other templates may have fewer insertions or deletions in regions of interest, for instance near an interface. Further, MolIDE provides user interaction during the model-building process that may be highly beneficial. This is especially true of manual editing of the target-template alignment using additional information that the user may possess, including multiple sources for target-template alignment and visualization of the template structure. Choosing the positions where loop modeling begins and ends (the loop anchors) by visualizing the structure may also lead to better structure predictions.
Any computer running Windows XP or Vista or Linux may be used. PSI-BLAST will run more quickly with 512 MBytes of RAM or more.
All of the software and various components described here can be downloaded from http://dunbrack.fccc.edu as described in the procedures below.
The sequence databases required can be downloaded from our web site and publicly available websites as described in the procedures below.
1 | From the SCWRL webpage, http://dunbrack.fccc.edu/SCWRL3.php, follow the link labeled “Download” and fill out the license form. SCWRL is free to non-profit institutions. Commercial institutions should contact Roland.Dunbrack@fccc.edu. Fill out the form and click the “I agree” button at the bottom of the page. This leads to a verification page for the input information. Click “Send request.” The request is sent to the Fox Chase Cancer Center for approval. On approval, the user will receive an e-mail message with the subject heading “SCWRL3.0 Download.” Click the link in this e-mail message to obtain SCWRL3.0 for various platforms, including Windows (both XP and Vista), Linux, Mac OS X, SGI Irix, and SunOS. Click “download” to begin downloading of an archive that contains the SCWRL program and the binary rotamer library file used by SCWRL.
The installation kits for each operating system have the following names respectively:
2 | The procedure for installing SCWRL is slightly different on Windows (Option A) and the Unix-related platforms (Option B).
Double-click on scwrl3_win.msi and follow the instructions in the installer. By default, the installer will place SCWRL3 in the folder C:/FCCC/scwrl3_win/. This is where MolIDE expects to find SCWRL, and so placing it in this location makes installation of MolIDE simpler.
(i) Move the archive to a location on your hard drive where you want to keep the SCWRL program. From a terminal window, give the following commands (for example, for the Linux distribution):
gzip -d scwrl3_lin.tar.gz
tar -xvf scwrl3_lin.tar
where “scwrl_path/” is the name of the directory that contains the file scwrl3_lin.tar.gz. Ordinarily on Linux systems, a typical directory for SCWRL might be /usr/local/bin.
This previous step will create a new directory, scwrl3_lin/, and unpacks four files and a folder into that directory:
On the command line of the terminal window, now type:
This command will modify the executable file scwrl3_ and move it to the filename scwrl3. This executable now contains within it the location of the rotamer library file, BBDep.bin. The executable can be moved elsewhere on the computer and can be executed in any directory, as long as the rotamer library remains in its location where ./setup was run. If you decide to move the BBDep.bin to a different location, repeat the installation procedure for that directory, beginning with uncompressing the kit.
3 | SCWRL is a command-line program. That is, a command must be issued from a console window on Windows (Option A) or a terminal window on Unix systems (Option B) or a
Open a console window by selecting “Command Prompt” from the “Start” menu. On some systems, the Command Prompt may be found by selecting “All Programs” from the “Start” menu, and then selecting “Accessories” and then “Command Prompt.”
Unix systems vary on how to open a terminal window. On Mac OS X, for example, the Terminal program is located in the Utilities folder within the Applications folder. Consult your system administrator for assistance if necessary.
4 | SCWRL may be used in various ways using some optional flags on the command line. SCWRL can predict side-chain conformations for an input backbone structure without modification of the sequence. In this case, SCWRL removes all side-chain atoms from the input file (if any), and rebuilds all of the side chains according to the residue names of the backbone atom coordinates (Option A). SCWRL can predict side chain conformation in the presence of non-protein atoms, such as ions, ligands, and nucleic acids. The ligand atoms are treated only with a simple steric repulsive energy function, so that the predicted side chains will not overlap the ligand atoms. SCWRL determines the element (N, C, Zn, Mg, etc.) from the atom name, and assigns a radius to each atom based on its element type. The procedure for modeling in the presence of ligands is as described in Option B. SCWRL can change the sequence of the input file by reading an additional file containing the new sequence. This sequence file should contain one-letter codes for the new sequence, and must contain exactly the same number of residues that the input PDB file contains. If it does not, SCWRL will report an error. The new sequence is placed on the backbone, retaining the input chain identifiers (A, B, C, etc.) and residue numbering. The input file may contain multiple chains, as long as the input sequence file contains the new sequence for each chain in the same order as the input coordinate file. The input sequence file also may contain information to indicate whether the Cartesian coordinates of the side chain in the input file should be kept. This is useful in homology modeling, since better predictions will usually be produced when conserved side chains (same residue type in the target and template sequences) are kept fixed during side-chain prediction (Option C).
(for Unix systems)
scwrl_path\scwrl3.exe -i inputpdbfile -o outputpdbfile > logfile
The filenames can be any desired names for the input file (“inputpdbfile”), the output PDB-format file (“outputpdbfile”), and the log file (“logfile”). So for example, the actual commands on Windows or Unix systems, respectively, might be:
scwrl_path/scwrl3 -i inputpdbfile -o outputpdbfile > logfile
The log file will contain some output from SCWRL about how the prediction problem was solved, including details about the graph theory algorithm process as described in the paper29. For most users, the information in this file is not relevant.
C:\FCCC\scwrl3_win\scwrl3.exe -i myfile.pdb -o mymodel.pdb > mylog.log
/usr/local/bin/scwrl3_lin/scwrl3 -i myfile.pdb -o mymodel.pdb > mylog.log
(for Unix-based systems)
scwrl_path\scwrl3.exe -i inputpdbfile -o outputpdbfile -f framefile > logfile
where framefile contains the ligand coordinates.
scwrl_path/scwrl3 -i inputpdbfile -o outputpdbfile -f framefile > logfile
If the lower-case residue type does not agree with the input PDB file residue type in the same position within the structure, then the input PDB file residue type will be used and the coordinates will be predicted by SCWRL.
(for Unix-based systems)
scwrl_path\scwrl3.exe -i inputpdbfile -o outputpdbfile -s sequencefile > logfile
where sequencefile contains the new sequence.SCWRL can combine the optional sequencefile and framefile by using both flags on the command line. It has two further flags, -u and -d. The flag -u tells SCWRL not to predict disulfides. This is useful for proteins in reducing environments, especially if they contain cysteines in proximity of one another around a metal ion such as zinc. The other option is -d, which tells SCWRL to print out a file with the dihedral angles of the predicted structure in a file called outputpdbfile.dihed. The order of flags is not important.
scwrl_path/scwrl3 -i inputpdbfile -o outputpdbfile -s sequencefile > logfile
5 | From the MolIDE webpage, http://dunbrack.fccc.edu/molide, follow the link labeled “Download” and fill out the license form. MolIDE is free to both non-profit and commercial institutions. Fill out the form and click the “I agree” button at the bottom of the page. This leads to a verification page for the input information. Click “Send request.” The request is sent to the Fox Chase Cancer Center for approval. On approval, the user will receive an e-mail message with the subject heading “MolIDE Download.” Click the link in this e-mail message to obtain MolIDE for either Windows or Linux. Click “download” to begin downloading of an archive that contains MolIDE and its associated files. Because of incompatibilities in the wxWindows framework used in MolIDE to build a Windows/Linux cross-platform graphical user interface, MolIDE is not available for other systems, such as Mac OS X, although it can be installed on Intel-based Macs that are running Windows, and run from the Windows operating system.
The installation kits for each operating system have the following names
6 | Installing and setting up MolIDE on Windows (Option A) and Linux systems (Option B) follow different procedures. The procedure on Windows is significantly simpler.
where “molide_path/” is the name of the directory that contains the file molide1.6_lin.tar.gz. Ordinarily on Linux systems, a typical directory for MolIDE might be /usr/local/bin.
gzip -d molide1.6_lin.tar.gz
tar -xvf molide1.6_lin.tar
!Caution. If a later version of wxWindows is already installed, the system may return error messages with the previous command. To override this, the flag “-f” can also be given in the rpm command. However, this may compromise other programs on your system that may use wxWindows. It is unlikely that most users will face this problem, because wxWindows is not that common on Linux machines.
rpm -U *.rpm
! Caution If you already have the NCBI package for BLAST installed on your machine, make back-up copies of the file .ncbirc in your home directory, if it exists. It will be overwritten by setup.
7 | MolIDE uses PSI-BLAST38, PSIPRED40, SCWRL29, and Loopy43 to perform the basic steps in homology modeling. MolIDE comes with PSI-BLAST, PSIPRED, and Loopy in default locations. However, SCWRL must be installed as a separate step since it requires a separate license. It does not matter which program is installed first. After installing MolIDE on each system, SCWRL must be installed if it is not already installed and the location of SCWRL must be set within MolIDE. To obtain and install SCWRL, follow the instructions above. On Windows, the SCWRL installer will place SCWRL within C:\FCCC\scwrl3_win. This default location is already set in MolIDE on Windows. In both Windows and Linux, from within the Tools menu, select “Options” and then “Scwrl.” The location of the executable for SCWRL can be entered into the box using the “browse” button. If it is already correct, then this step can be skipped.
8 | MolIDE depends on two sequence databases for producing homology models. The first is the non-redundant protein sequence database, “nr,” from NCBI, currently about 6 million sequences. This sequence database is used to produce sequence profiles for the target sequence based on multiple sequence alignment of many homologues. The second is the PDB protein sequence database, “pdbaa,” which must be obtained from our website. This version of pdbaa is different from NCBI's version in a number of ways. It is more up-to-date than NCBI's, and contains additional information on the header line for each sequence, including experiment type, resolution, sequence length, and R-factors. It also has distinct names for different sequences in a single PDB file, based on the gene name for that protein.
The PDB is updated weekly on Wednesdays, and the pdbaa database is updated on our website within a couple of days. The pdbaa database within MolIDE therefore can be updated as often as weekly when MolIDE is in use. Over 100 structures are added to the PDB every week. To update pdbaa, select from the Tools menu, “Update DB”. A window appears with the option to update either “PDBAA” or “NR.” Select “PDBAA,” click “OK”, and then “Download.” The PSI-BLAST formatted files will be installed in the appropriate location.
After MolIDE is installed for the first time, download the nr database via the “Update DB” option in the Tools menu. The nr sequence database will be downloaded from the NCBI ftp site, and automatically formatted for PSI-BLAST using the NCBI program formatdb, included with MolIDE. Depending on download speed, it may take 30 minutes or more to download nr, and 10 minutes or more to uncompress it and format it for PSI-BLAST. All of this will be done automatically by MolIDE. The nr database can be updated periodically. A monthly update is more than sufficient.
Other databases may be used instead of the nr database from NCBI. For instance the UniRef databases from UniProt are suitable. To use the uniref100 database from uniprot:
!Caution. The PDB sequence database pdbaa must be obtained from our website for use in MolIDE rather than NCBI's file of the same name.
9 | Prepare a target sequence for modeling. This should be placed in a single file in FASTA format with the extension “.seq”. Such a sequence can be obtained from NCBI by keyword search. NCBI's site can format a sequence in FASTA format. The target sequence file should look something like this:
>P53 [Homo sapiens]
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAA PRVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKT CPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRN TFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVHVCACPGR DRRTEEENLRKKGEPHHELPPGSTKRALSNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALEL KDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD
A FASTA-formatted sequence file includes “>Name” as the first line in the file with the sequence following, starting on the next line. The sequence can be spread over as many lines as necessary and the lines can be of any length. Other text can follow the name on the first line, and there should only be one sequence in the file. Spaces and numbers within the sequence will be ignored.
Within MolIDE, open a sequence file via the File menu, selecting “Open” and then “Sequence.” In what follows, we describe choosing items under one of the menus with the following notation, as in this case: “File->Open->Sequence”.
10 | While a sequence file is open, run a multiple-round PSI-BLAST by selecting PSIBLAST from the Tools menu. PSI-BLAST is run first against the non-redundant protein sequence database (or a database of the user's choosing that can be set under Tools->Options->PSIBLAST) with a customized version of PSI-BLAST that comes with MolIDE. This version of PSI-BLAST outputs profiles after every round (including the last round), each with a unique name. Once the non-redundant sequence database search is completed, the PDB is automatically searched with each of the profiles and a separate PDB alignment file is created.
You can change other parameters used during the PSI-BLAST runs using Tools->Options->PSIBLAST. For instance, for common protein domains such as kinases or immunoglobulins, it may be desirable to use a stricter E-value cutoff for creating the profiles in PSI-BLAST. If you know that you have close homologues to your target sequence in the PDB, then rather than searching nr to create profiles, you can use pdbaa for this step as well.
Once PSI-BLAST is finished, close the PSI-BLAST run window by clicking “Done.” The PSI-BLAST output of the search against the non-redundant protein database can be opened by choosing “Open->Seq HITS Alignment”. A window opens with a table of all the hits found in the non-redundant database. This is shown in Figure 3.
11 | Run PSIPRED to predict the secondary structure of the target by selecting Tools->PSIPRED. PSIPRED uses the output PSI-BLAST sequence profiles from the nr database search. PSIPRED should therefore be run only after the PSI-BLAST run in the previous step is completed. A secondary structure prediction for each round of PSI-BLAST is created.
To view the secondary structure predictions, select File->Open->Sec Struct Pred. After choosing a “.psipred” file, you are given the option to display all of the predictions based on the matrices generated by PSI-BLAST after each round. These are displayed in a single window with each prediction on a separate line. A predicted sheet is colored in green and a predicted helix is in red. Predicted coil regions (loops and unstructured regions) are depicted in gray. A screenshot of predicted secondary structure from several rounds of PSI-BLAST is also shown in Figure 3.
The intensity of the color is proportional to the prediction confidence; the darker the color, the higher the prediction confidence. This view allows you to see if the secondary structure prediction changes as more remotely related sequences are added to the profiles. For proteins with few close relatives, the predictions may be more accurate in later rounds as distantly related sequences provide information on likely secondary structure patterns. However, for proteins with a good number of close relatives (>50), the addition of distantly related sequences with potentially large structural changes (additional secondary structure or missing secondary structure) may degrade the secondary structure prediction.
12 | Open the PSI-BLAST file containing the alignments of your query sequence with sequences of proteins from PDB. There will be a separate hits file for each round of PSI-BLAST. Open a file of hits from the PDB by selecting “Open->PDB Hits Alignment”. Only files with the correct extension, .pdbout, will be shown.
The results are displayed as a table, as shown in Figure 4. To sort the table by some feature, click on the column header of that feature, such as E-value, sequence identity, or starting and ending positions of the alignment. Clicking again on the same column header will reverse the sorting order for that column. In a multi-domain protein, for instance, it is common to have many available templates for one domain, but only a few templates for a different domain. These are hard to locate in the raw PSI-BLAST output text file, but the sorting feature in MolIDE makes them easy to find.
Because there is a file of hits for each round of PSI-BLAST, one may ask which file should be used. There are no strict rules for determining this. A rough rule of thumb would be the earliest iteration that produces a complete alignment of the target sequence or domain of the target to a known structure. For example, if the target is a single-domain protein, then the earliest round that aligns the entire target domain to a complete domain of known structure should be used. In many cases, the first or second round may only align a central well-conserved region, but not the whole domain. In that case, later rounds should be examined. This can be determined by observing the alignment and the template structure simultaneously, as detailed in the next step.
13 | While sorting the table enables the user to locate the best templates by resolution, sequence identity, and other features, viewing the alignment of the target to a possible template is a key step in this choice. From the hits table, double-click on the hit number in the first column for a template of interest. The alignment of the target to that structure in the PDB will be extracted from the PSI-BLAST output file, and saved in a separate file in the same directory where the sequence file resides. The extension of this file is .alnonet, which stands for “alignment with onetemplate.”
At the same time, a window will appear for downloading the coordinates of the template structure from the PDB. Click “Download.” The default ftp server, ftp.wwpdb.org, should work well (at times it may be slow), but the user can change the ftp server by selecting Tools->Options->Servers from the Tools menu. MolIDE uses the XML-format files from the PDB and converts these to PDB format. It also extracts information from the XML file on how the template sequence (numbered from 1 to N, its length) corresponds to residues in the coordinates. PDB files are sometimes missing coordinates due to disorder, and the coordinates may be numbered starting on any number. This information is critical in converting a template into a target model given an alignment, and MolIDE handles it automatically from information contained in the XML format. This information is not present in the PDB's PDB-format files. For a PDB entry with accession code 1abc, for example, the coordinates will appear in file 1abc.pdb and the sequence-coordinate residue correspondence will appear in file 1abc.sc.
Once the XML file is downloaded, read, and converted, a window appears with a view of the target-template sequence alignment, the secondary structure prediction of the target, and the experimental secondary structure of the template. Above the alignment, the backbone of the template structure appears. The whole template protein is displayed in gray, while the part of the structure used in the alignment is displayed in green. Insertions in the target (target longer than template) are marked by 2 adjacent yellow spheres on the template structure Cα atoms surrounding the insertion point. Deletions from the template (target shorter than template) are represented by red spheres on the Cα atom of residues to be deleted from the template structure.
Viewing the alignment and the structure simultaneously allows the user to determine visually whether the alignment covers an entire template domain and to identify where insertions and deletions are located on the structure. MolIDE's viewer is quite simple and is not suitable for making images for publication. Its purpose is to examine and edit the target-template sequence alignment. The PDB files produced by MolIDE can be read into any molecular viewer, such as PyMol (W. L. Delano, http://pymol.org).
To manipulate the structure view in MolIDE:
|Left_button_drag||Rotates the structure|
|Right_button_drag||Zoom in and out (Z-direction)|
|Middle_button_drag||Move in plane of screen (X-Y plane)|
|Double_left_click||On an atom in the structure displays the residue numbers at the bottom of the window|
|Left_click||Show spacefill in template viewer|
|Middle_click||Spacefill additional residues in viewer|
In the View Menu, the options are:
|Backbone||Displays connected Cα atoms|
|Spacefill||Displays spheres on each atom|
|Aligned fragment||Displays only that part of the template structure that aligns with the template|
|Whole template||Displays the entire template with the part that is aligned in green and the rest in gray|
14 | Generally it is a good idea to edit the target-template sequence alignment manually. An example of alignment editing is shown in Figure 5. Deletions from the structure are least disruptive if the N- and C-terminal endpoints of the deletion are nearby each other in space. Insertions are best placed in the middle of loop regions, not immediately next to regular secondary structure. The correspondence of predicted secondary structure of the target and the experimental secondary structure of the template can be used to guide the alignment. Often PSI-BLAST may fail to align some regions correctly, so if there is other information available, on conserved residues for instance, then the alignment can be edited accordingly. For sequence identities below 30%, it is advisable to seek alternative alignments from servers that provide profile-profile alignments, which are generally more accurate than PSI-BLAST. One of the best and most usable of these is FFAS7. The MolIDE alignment can be edited according to the alignment provided by FFAS.
Moving the mouse over the alignment will display in the status bar at the bottom of the window the sequence numbers for query and template sequences, as well as the corresponding PDB coordinate residue number in the template PDB. The color-coding scheme for the secondary structure of the template is the same one used for the secondary structure prediction (helix=red; sheet=green). The third column of the status bar displays the number of identities in the alignment.
To move a gap over several residues, delete it first, then move to the place of insertion and insert the appropriate number of gap characters as follows:
These operations can be performed on either the target sequence or the template sequence. Only gap characters can be inserted or deleted.
15 | Once the alignment editing is completed, choose “Copy backbone” from the Tools menu. This step produces a file with extension .model that contains a model of the target sequence based on the aligned residues in the current target-template alignment window. Side chains for conserved residues (identical and aligned in the target and template alignment) are also copied to the model.
16 | Select “Build Side Chains (SCWRL)” from the Tools menu. Click “Run SCWRL” in the window that appears. The conserved side chains are left in the original conformation from the template crystal structure. This option can be changed with Tools->Options->Scwrl. SCWRL should run very quickly (seconds). If it takes a long time (>5 minutes), the run should be canceled in the window, and another template selected. This occurs when the backbone of the template will not accommodate the target side chains very easily.
17 | Loop building is done by first selecting residues for the left and right anchors. These are residues that will be kept fixed while the intervening sequence is modeled using the Loopy program43. It is usually a good idea to allow at least 2-3 residues on either side of the insertion or deletion to move during the loop-building process. One option is to make the left and right anchors the last and first residues of the flanking secondary structures respectively. However, if part of a long loop is well conserved, it may be better to select a smaller region that contains less conserved segments. Loopy will sometimes be unable to build a loop if the loop length is too short and the distance to be spanned by the predicted loop is too large. In this case the anchors should be reset (cleared) and then selected again further apart and Loopy should be run again.
Also note that if residues are missing from the structure due to poor electron density, they will be marked with blue squares below the template sequence. These regions should also be rebuilt with Loopy.
To build loops: Right_Click on a Query residue in the sequence alignment will display a pop-up menu:
!Caution. Click on the query sequence not the template sequence to select anchors, to reset the anchors, and to build the loop.
After choosing the loop's anchor residues, proceed with “Build Loop”. Click “Run Loopy” in the window that appears. Loopy should take less than a minute or so to build the loop for loops up to lengths 15 residues.
Proceed with loop building of each insertion-deletion region in turn until all the insertions/deletions/missing residues are modeled. The model is contained in a PDB-format file. The file name follows this convention: ProteinName_x_TemplatePDBChain_y.pdb where x is the round number of PSI-BLAST run and y is the fragment number of the query sequence that is aligned with that particular template PDB. This file is first generated after the side chains are built with SCWRL3. It is subsequently overwritten by Loopy output after each loop is built. When all loops are built, this file will contain the final homology model.
18 | It may be desirable to remodel the side chains in the presence of a ligand using SCWRL on the command line. MolIDE creates a second sequence file when the “Copy backbone” command is given. This sequence file contains the complete target sequence over the region of the target-template alignment. So once loops are built, this sequence file can be used as input to SCWRL. To perform this step, first copy and paste the ligand coordinates of interest from the template PDB file produced by MolIDE into a new file, called a frame file (also see SCWRL instructions above). The sequence file has extension .s3seqall. To remodel the side chains with SCWRL, type this command in the console window (on Windows) or the terminal window (on Linux):
scwrl_path\scwrl3.exe -i inputpdbfile -o outputfile -f framefile -s file.s3seqall > logfile
scwrl_path/scwrl3 -i inputpdbfile -o outputfile -f framefile -s file.s3seqall > logfile
where framefile contains the ligand coordinates. An example of this is shown in Figure 6.
To give a brief idea about the length of the homology modeling procedure with SCWRL and MolIDE, we list below the number of minutes required in each step. The estimated time is based on our modeling experience on a machine with an AMD Dual Core Processor and 2GB RAM, given a query sequence of 200 ~ 500 residues.
Step 1: 2 ~ 5 minutes.
Step 2: 1 ~ 2 minutes.
Step 3: < 1 minute.
Step 4: 1 ~ 2 minutes.
Step 5: 2 ~ 5 minutes.
Step 6: 1 ~ 2 minutes.
Step 7: 1 ~ 5 minutes.
Step 8: 20 ~ 40 minutes
Step 9: < 1 minute.
Step 10: 10 ~ 20 minutes.
Step 11: < 1 minute.
Step 12: 1 ~ 2 minutes.
Step 13: 1 ~ 2 minutes.
Step 14: 5 ~ 10 minutes.
Step 15: < 1 minute.
Step 16: 1 ~ 2 minutes.
Step 17: 1 ~ 10 minutes.
Step 18: 1 ~ 2 minutes.
The accuracy of protein structure prediction depends critically on sequence similarity between the target and the template. When the sequence identity is higher than 30%, usually most or all of the alignment is correct, and the relative positions of structural elements are therefore reliable. Below 30%, this is no longer true, and there may be significant changes in structure between the template structure and the target structure (if it were known). On native backbones SCWRL3 is able to predict about 83% of side chains with the first dihedral angle of the side chain (χ1) within 40° of the experimental structure29. At 50% sequence identity between target and template and keeping conserved side chains fixed according to the template structure, SCWRL3 predicts about 72% of side chains correctly (unpublished data).
MolIDE provides a simple and fast modeling procedure based on sequence alignments with PSI-BLAST. At sequence identities above 30%, PSI-BLAST alignments are reasonably accurate42. Below this value, there may be poorly conserved regions that are not accurately aligned, or the alignment may not be complete. In this case, it may be advisable to use profile-profile alignment methods to obtain a more accurate alignment. For instance, the FFAS server7 produces more accurate alignments than PSI-BLAST. The alignment that MolIDE produces can then be edited to conform to that provided by FFAS or any other server or program.
Also, MolIDE does not take account of any structural changes, other than side chains and loops, between the target and template. Therefore parts of the modeled structure produced by alignments with large and/or frequent gaps and/or low sequence conservation are therefore quite suspect. At lower sequence identities, the backbone model will not be very accurate. In these cases, SCWRL may predict side-chain conformations with significant steric overlaps with other side chains or the backbone. An energy minimization with CHARMM48 or other programs will remove these steric overlaps, although the resulting model will not likely be any closer to the target structure, if it were known. However, sequence conservation may vary substantially in different parts of the alignment. Often a few well-conserved motifs are noticeable in the alignment, and these are likely to be modeled reasonably well based on the template structure.
It is possible that some users may have trouble downloading the database and PDB XML files because of local IT security policies. In the case of the nr database, the FASTA formatted file can be manually downloaded from NCBI by putting this address into a web browser:
The file must first be uncompressed (ungzipped). On Linux, this is accomplished by typing:
gzip -d nr.gz
On Windows, the user must obtain software for uncompressing files such as FreeZip (http://www.versiontracker.com/dyn/moreinfo/win/10360) and follow the instructions in the program.
Once downloaded and uncompressed, the file should be put in the MolIDE database directory, C:\FCCC\MolIDE\db on Windows or molide_path/db on Linux. Then the database can be formatted using the program formatdb distributed with MolIDE from the console window in Windows and a command-line terminal window on Linux:
C:\FCCC\MolIDE\bin_aux\NCBI\formatdb.exe -i nr -t nr
molide/path/bin_aux/NCBI/formatdb -i nr -t nr -o
The PDBAA file can be downloaded using a browser from this address:
and the same procedure followed to format the database for PSI-BLAST.
If running PSI-BLAST fails for some reason, copy the PSI-BLAST command given in the PSI-BLAST runtime window, and paste it into a console window (Windows) or terminal window (Linux), and hit “return.” PSI-BLAST will now run, and any error messages it sends to the window will now be visible. These messages may help to diagnose the problem.
For users at some institutions, the IT security setup may not allow MolIDE to access the PDB's ftp site for the XML files. In this case, the user can go to http://www.rcsb.org, search for the PDB code, and download the “PDBML/XML gz” or “PDBML/XML text” files and place them in the working directory. In this case, clicking on the row number in the PDB Hits Alignment (.pdbout) table will use the manually downloaded XML file instead of going to the ftp server to get it. The rest of the processing is the same.
This work was supported by NIH grants R01-HG02302 and R01-GM84453 (to R.L.D) and P30-CA06927 to Fox Chase Cancer Center. We thank Mark Andrake and Radka Stoyanova for testing MolIDE 1.6.