|Home | About | Journals | Submit | Contact Us | Français|
Over the last years, large scale proteomics studies have generated a wealth of information of biomolecular complexes. Adding the structural dimension to the resulting interactomes represents a major challenge that classical structural experimental methods alone will have difficulties to confront. To meet this challenge, complementary modeling techniques such as docking are thus needed. Among the current docking methods, HADDOCK (High Ambiguity-Driven DOCKing) distinguishes itself from others by the use of experimental and/or bioinformatics data to drive the modeling process and has shown a strong performance in the critical assessment of prediction of interactions (CAPRI), a blind experiment for the prediction of interactions. Although most docking programs are limited to binary complexes, HADDOCK can deal with multiple molecules (up to six), a capability that will be required to build large macromolecular assemblies. We present here a novel web interface of HADDOCK that allows the user to dock up to six biomolecules simultaneously. This interface allows the inclusion of a large variety of both experimental and/or bioinformatics data and supports several types of cyclic and dihedral symmetries in the docking of multibody assemblies. The server was tested on a benchmark of six cases, containing five symmetric homo-oligomeric protein complexes and one symmetric protein-DNA complex. Our results reveal that, in the presence of either bioinformatics and/or experimental data, HADDOCK shows an excellent performance: in all cases, HADDOCK was able to generate good to high quality solutions and ranked them at the top, demonstrating its ability to model symmetric multicomponent assemblies. Docking methods can thus play an important role in adding the structural dimension to interactomes. However, although the current docking methodologies were successful for a vast range of cases, considering the variety and complexity of macromolecular assemblies, inclusion of some kind of experimental information (e.g. from mass spectrometry, nuclear magnetic resonance, cryoelectron microscopy, etc.) will remain highly desirable to obtain reliable results.
Proteins are the wheels and millstones of the complex machinery that underlies human life. Catalyzing a huge diversity of chemical processes, proteins work in close association with other biomolecules: nucleic acids, sugars, lipids, and other proteins. This huge network of protein interactions enables the cell to respond quickly to changes in the environment, such as temperature, oxygen, or nutrient concentration. However, to fully understand this network, insights at the atomic level are needed.
In the wake of the elucidation of the human genome (1, 2), many structural genomics projects are solving the structures of what is now becoming a considerable fraction of the human proteome (3). These projects are now moving to the next level, which is solving the atomic resolution structures of protein complexes. However, this is a challenge that is considerably greater than obtaining the structures of single proteins. First of all, a protein can take part in 10 interactions on average; thus, the number of complexes is expected to be at least an order of magnitude larger than the proteome, and their composition can even vary over time. Second, associations between subunits in protein complexes are often weak and reversible, which make purification and crystallization difficult. Finally, there are some very well studied classes of interactions, such as enzyme-inhibitor, antibody-antigen, and GTPase-GAP (GTPase-activating protein) interactions, but these classes represent binary interactions between proteins. In contrast, many of the most important functions in the cell are carried out by large, dynamic molecular assemblies, such as the ribosome, the proteasome, the spliceosome, RNA polymerases, and the nuclear pore complex (4, 5). For such assemblies, high resolution methods such as x-ray crystallography and NMR spectroscopy often provide atomic level information at the level of individual subunits or subcomplexes, but they typically encounter difficulties at the level of the full complex.
Fortunately, low resolution information about protein complexes can often be obtained. Affinity purification (6, 7) followed by mass spectrometry is a high throughput technique to study the composition of a complex. However, dissociation inside the mass spectrometer can be a problem for transient or unstable complexes in which case chemical cross-linking can help. Once the composition of the complex is known, there is a variety of experimental techniques available to obtain structural information on the complex. The most detailed information can be gathered by using data obtained from various NMR experiments, for example chemical shift perturbations (8) or residual dipolar couplings (9); unfortunately, NMR is limited to complexes that are fairly small in size, making its applicability in the context of large assemblies less suited. Techniques that provide information about the shape of a protein complex, such as small angle x-ray scattering (SAXS),1 cryoelectron tomography, and single molecule cryoelectron microscopy (cryo-EM), are more suited to characterize large complexes. Unfortunately, all of these techniques suffer from limitations in resolution that are either fundamental or caused by structural heterogeneities of the complex.
A well known approach to obtain information on residues at an interface is site-directed mutagenesis (10). In principle, a loss of binding affinity indicates that the mutated residue mediates the interaction, although the reverse is not true. Also, one must take care of secondary effects, such as unfolding or conformational change caused by the mutation. Apart from that, very detailed information about interface residues can be obtained by extensive mutagenesis experiments, such as alanine scanning and double mutant cycles. Mass spectrometry offers the opportunity to get peptide level or residue level information about protein interfaces by accurate mass measurements of peptides from the protein complex, generated either a priori through proteolytic cleavage, or inside the mass spectrometer (MS/MS). For example, interface residues can be identified as residues that undergo slower hydrogen/deuterium exchange upon complex formation. This process can be monitored at the peptide level by mass spectrometry (or in smaller complexes, at the residue level by NMR), although this method is very sensitive to noise caused by conformational changes upon binding. In the same way, radical probe MS (RP-MS) uses differences in oxidation of residues by hydroxyl radicals generated in the mass spectrometer to identify interface residues. Finally, chemical cross-linking followed by MS can provide direct information about residue contact sites between different binding partners of the complex. Several cross-linking reagents can provide complementary information. However, it has been reported that the cross-linkers may disrupt the structure of the protein complex and that care should therefore be taken to interpret the results (11).
There is a need for computational approaches to translate this low resolution information into atomic resolution models that can provide functional and mechanistic insights. One of the most promising approaches is docking, the prediction of the structure of a complex starting from the free, unbound structures of its constituents. In recent years, docking methods have made much progress in the blind prediction of the structure of protein complexes as seen in the recent rounds of the critical assessment of prediction of interactions (CAPRI) experiment (12, 13). Most docking methods are ab initio, which means that experimental data are not required. However, it is possible in several ab initio methods to use experimentally determined interface residues in the docking: in MolFit (14, 15) and ATTRACT (16, 17), it is possible to up-weight the interaction scores of interface residues; in ZDOCK (18, 19), it is possible to block non-interface residues; and in PatchDock (20, 21), ZDOCK, pyDock (22, 23), and several other methods, it is possible to filter the docking results based on experimental information. Next to purely ab initio approaches, there are also methods that make use of different types experimental information, for example PROXIMO (24), based on RP-MS data, and MultiFit (25), a hybrid fitting/docking approach based on electron microscopy data.
A method that distinguishes itself from the variety of above mentioned docking approaches is HADDOCK (26–28). In HADDOCK, the docking can be driven by a variety of experimental data using information about interface, contacts, and relative orientations inside a complex simultaneously. Originally developed for NMR data, HADDOCK is able to deal with a large variety of experimental data as shown in Table I. Interface residues are defined as “active residues” that are believed to participate in the formation of the interface, and “passive residues” are those that are possibly at the interface; other kinds of data can be entered directly. (See the original HADDOCK studies (26–28) and “Materials and Methods” for more details.) HADDOCK has performed very well in translating these data into structures and structural models. More than 60 Protein Data Bank structures calculated using HADDOCK have been deposited to date as experimental structures in the Protein Data Bank (29). Moreover, HADDOCK has shown a strong performance in CAPRI. Finally, HADDOCK is a general purpose program that can integrate many kinds of data, but even with a single source of data it is able to perform as well as more specialized programs: for example, HADDOCK was able to closely reproduce the NMR-calculated E2A-HPr complex using only chemical shift perturbation data. For the ribonuclease S-protein-peptide complex (Protein Data Bank code 1J80 (30)) for which RP-MS data are available, PROXIMO was able to closely reproduce the crystal structure (root mean square deviation (r.m.s.d.) of the top scoring model from the reference crystal structure is 1.26 Å); using the same data, HADDOCK could get even closer with an r.m.s.d. of only 0.68 Å from the crystal structure (results not shown).
Most docking methods are designed to deal with just two molecules, making their application limited with regard to large macromolecular assemblies. In most programs, multicomponent complexes can be assembled by adding each component one at a time, whereas simultaneous docking of the whole complex is typically not possible. Recently five ab initio docking programs (MolFit (31, 32), ClusPro (33), Rosetta (34), M-ZDOCK (35), and SymmDock (36)) gave birth to specific versions for the prediction of the symmetric multimers. Among these programs, MolFit, ClusPro, and Rosetta perform a rotational/translational search about the proper symmetry axes. These programs can deal with different types of cyclic and dihedral symmetries. Different than the other two, Rosetta is able to assemble complexes having helical and icosahedral symmetries. M-ZDOCK and SymmDock are suited for the prediction of macromolecules with cyclic symmetries. However, the ability to deal with arbitrary large molecular assemblies is currently rare. CombDock (37), which was developed by the team of SymmDock, can build hetero-oligomer complexes, but it does not have a symmetry option. Only HADDOCK can deal with molecular complexes that are hetero-oligomers or homo-oligomers with arbitrary symmetry operators between and within each component.
The flexibility of HADDOCK comes at a price: it requires the user to have the structure calculation program CNS (38) installed and a considerable degree of expertise in its usage and molecular modeling in general, and it requires a cluster of computers. To alleviate this problem and to open up HADDOCK for a wide community, we have recently developed the HADDOCK web server (27). The server offers multiple web interfaces, ranging from very simple and user-friendly to very powerful and flexible, exposing the full range of HADDOCK options to the expert user. However, up until now, the HADDOCK server was unable to deal with more than two molecules. Here we present a novel web interface for multibody docking of complexes. Like the HADDOCK program itself, the server supports the docking of up to six molecules simultaneously; all HADDOCK options, including symmetry restraints, are made available to the user. Even larger assemblies can in principle be modeled if the docking is performed in an incremental way. Here we demonstrate the performance of the multibody server on a small benchmark comprising complexes of various symmetries and increasing numbers of components (from three to five). To drive the docking, bioinformatics interface predictions and/or available experimental information were used. The HADDOCK server is available on line. http://haddock.chem.uu.nl.
HADDOCK uses experimental and/or bioinformatics data to drive the complex formation in silico. The experimental and/or prediction data are used to define active and passive residues. Active residues are described as the identified interface residues, and passive residues correspond to their solvent-accessible neighbors. These are used to define a network of ambiguous interaction restraints (AIRs) between the molecules to be docked. An AIR defines that a residue on the surface of a biomolecule should be in close vicinity to another residue or group of residues on the partner biomolecule when they form the complex. By default, this is described as an ambiguous distance restraint between all atoms of the source residue to all atoms of all target residue(s) that are assumed to be in the interface of the complex (Fig. 1). The effective distance between all those atoms, diABeff, is calculated as follows.
Here NAatom indicates all atoms of the source residue on molecule A, NresB indicates the residues defined to be at the interface of the target molecule B, and NBatom indicates all atoms of a residue on molecule B. The 1/r6 summation somewhat mimics the attractive part of a Lennard-Jones potential and ensures that the AIRs are satisfied as soon as any two atoms of the biomolecules are in contact. The AIRs are incorporated as an additional energy term to the energy function that is minimized during the docking. The ambiguous nature of these restraints easily allows experimental data that often provide evidence for a residue making contacts to be used as a driving force for the docking. As such, the AIRs define a network of restraints between the possible interaction interface(s) of the molecules to be docked without defining the relative orientation of the molecules, minimizing the necessary search through conformational space needed to assemble the interfaces.
The docking protocol in HADDOCK consists of three stages: (i) rigid body energy minimization (it0), (ii) semiflexible refinement in torsion angle space (it1), and (iii) a final explicit solvent refinement (water). In the last two stages, flexible segments are typically defined automatically based on the identified intermolecular contacts. The solutions are ranked at the end of each docking stage based on the following HADDOCK scoring functions.
The weighted parameters that are used in different stages of the scoring are van der Waals (EvdW), electrostatics (EElec), restraint violation (EAIR), desolvation (EDesolv) (39), symmetry restraint energies and buried surface area (BSA). The solutions are clustered using a 7.5-Å cutoff based on their pairwise r.m.s.d. values, and the cluster ranks are determined according to the average energy of the four best structures of each cluster.
HADDOCK can deal with biomolecules having cyclic (C2, C3, or C5) symmetries or any combination thereof. This also allows dealing with dihedral symmetries because dihedral symmetry can be interpreted as a combination of cyclic symmetry pairs (e.g. D2 symmetry is a combination of six C2 symmetry pairs (see Table II)). The symmetry restraints can be applied both within and between molecules. Compared with other docking programs supporting symmetric molecules, the unique characteristic of HADDOCK is that it applies symmetry on the molecules while docking them simultaneously. For the generation of symmetric complexes, two types of restraints should be used in combination: non-crystallographic symmetry (NCS) (40) and distance symmetry restraints (41, 42), both available within CNS.
NCS restraints force two (or more) monomers to be identical without defining any symmetry operation between them. This is achieved through minimization of the following potential energy function.
This energy term is calculated after superposition of the monomers onto the first monomer. In the potential expression, A is the number of atoms, M is the number of monomers, kNCS is a constant, (x′am, y′am, z′am) are the Cartesian coordinates of the ath atom on the mth monomer, and (a, a, a) corresponds to the average position of ath atom with respect to the superimposed coordinates (41). Using NCS restraints in HADDOCK only requires the user to define pairs of segments on which the NCS restraints will be applied. These can belong either to the same molecule or to separate molecules, allowing to define both intra- and intermolecular symmetries. The only requirement is that the number and type of atoms should be identical in both segments.
The implementation of this type of symmetry in HADDOCK is based on the symmetry distance restraints defined by Nilges (41, 42) for the NMR structure calculation of symmetrical dimers. The symmetry is imposed by requiring that pairs of intermolecular distances between all symmetric Cα atoms should have identical values. In the case of a dimer composed of molecules A and B, this condition can be illustrated as follows.
Δ is summed over all distances between Cα atoms of the defined segments. Here the idea is to minimize Δ so that the symmetric distances between the monomers are equal to each other. This is illustrated in Table II. The major advantage of this approach is that it does not require knowledge of the position of the symmetry axis, and it can be applied to different symmetries (C2, C3, or C5 as shown in Table II) and oligomeric proteins. The symmetric pairs should be defined as explained above for NCS restraints.
All test cases, except for the protein-DNA complex, were docked using the multibody web interface of HADDOCK. The procedure followed to dock the protein-DNA complex differs from the generic multibody docking protocol in the sense that two subsequent docking rounds were performed: in the second round custom-built DNA models that captured the conformational changes in the DNA from the first docking are used as starting structures. This approach allows modeling rather large deformations in the DNA and is explained comprehensively in a recent work of van Dijk et al.2
In four of the test cases (Protein Data Bank codes 1QU9, 1OUS, 1VIM, and 1VPN), the interface information was obtained through the consensus interface prediction server CPORT3 using the “very sensitive” option. In the case of Protein Data Bank code 1URZ, a former CAPRI target, we used the same interface definition as was used previously in CAPRI (43). For the protein-DNA complex, Protein Data Bank code 3CRO, sequence conservation and experimental data (mutagenesis and ethylation interference) were used to define the protein-DNA interaction site.
The interface information was converted into AIRs via the setup page of the HADDOCK web site. The generated AIR files together with the input structures were then supplied to the multibody server as an input for the docking. To favor compactness of the solution, center of mass restraints were enabled. For each complex, the proper combination of NCS and symmetry restraints were defined. Sampling of 180° rotated solutions was disabled. The number of structures was increased to 5000, 400, and 400 for it0, it1, and water, respectively. All other parameters were left at their default settings.
The models were evaluated according to the CAPRI criteria (13). For a complex to be classified as acceptable (one star), its interface root mean square deviation (i-r.m.s.d.) from the complex had to be lower than 4 Å, or its ligand r.m.s.d. (l-r.m.s.d.) had to be lower than 10 Å. In addition, the fraction of native contacts (fnat) had to be ≥0.1. For good predictions (two stars), the criteria were i-r.m.s.d. ≤2 Å or l-r.m.s.d. ≤5 Å and fnat ≥0.3. For high quality predictions (three stars), the criteria were i-r.m.s.d. ≤1 Å or l-r.m.s.d. ≤1 Å and fnat ≥0.5. A cluster was considered one/two/three star(s) if at least one of its top four members was of one-/two-/three-star quality or better.
We have compiled a benchmark of six multimer assemblies. The complexes are homomeric with different numbers of components and various symmetries (see Table III). One of them corresponds to a dimeric protein-DNA complex. In four cases, the docking was performed starting from the separated components of the crystal structure (“bound docking”). In one case (1URZ), the starting structures correspond to the dimeric form of the complex, whereas the trimeric form had to be predicted; this complex corresponds to a viral envelope protein that was a target in CAPRI (target 10). For the protein-DNA complex (3CRO), the docking was performed from the unbound conformation of the monomers and a canonical B-DNA model. In summary, our benchmark consists of four bound cases and two unbound cases.
For modeling of the benchmark complexes, we made use of the new multibody interface of the HADDOCK web server. The web server provides a user-friendly interface that gives full control over the various HADDOCK parameters and supports a wide range of experimental restraints (Fig. 2). This interface is freely accessible to non-profit users requiring “guru” access. It allows the simultaneous docking of up to six molecules and supports several types of cyclic symmetries (C2, C3, or C5) and any type of dihedral symmetry that can be expressed as a combination of the available cyclic symmetry pairs (see “Materials and Methods”). Our server is the first to support cyclic and dihedral symmetries at the same time and to allow simultaneous docking of up to six molecules.
The performance of our multibody docking approach was demonstrated for six complexes (Table III) using a combination of experimental and/or bioinformatics predictions. For four of the complexes (1QU9, 1OUS, 1VIM, and 1VPN), active and passive residues were defined based on consensus bioinformatics interface predictions from CPORT (see “Materials and Methods”). For the other two complexes, a combination of experimental and predicted information was used. The list of active and passive residues for each complex is given in Table IV. Using the above information, the HADDOCK multibody server produced and ranked at the top high quality models, demonstrating the excellent performance of our approach. Both the top ranked models and the top ranked clusters according to the HADDOCK score contained at least a medium quality (two-star) prediction (see Table V). Furthermore, analysis of the results showed that the imposed symmetries are fulfilled.
In four cases, bound docking was performed, including a trimer (1QU9), two tetramers (1OUS and 1VIM), and a pentamer (1VPN). For each of them, the first ranked HADDOCK model corresponds to a high quality prediction. Considering the increased docking complexity due to the large interaction space to be sampled in the case of multicomponent systems, this demonstrates an outstanding performance. In the two unbound cases consisting of a CAPRI target and a protein-DNA complex, good results (two-star quality predictions) were also obtained (see Table V). The performance of the protein-DNA docking (3CRO) and the ability of HADDOCK to catch the conformational changes in the DNA demonstrate that the excellent capabilities of HADDOCK are not limited to just protein-protein complexes. An overlay of the top predictions onto their respective reference crystal structures is shown in Fig. 3.
In the structural characterization of biomolecules, most of the developments on the modeling side are limited to rather “small” binary systems often only applicable to proteins. Just a few molecular docking programs can deal with multibody assemblies, and they are generally restricted to the prediction of symmetric homomeric complexes (32–37). So far, HADDOCK (26, 28, 43) is the only molecular docking program that is able to perform simultaneous docking of multibody complexes up to six components. A multibody docking server was recently released on the HADDOCK web portal, allowing the users, through a user-friendly interface, to exploit the full range of experimental data supported by HADDOCK and to fully customize the docking process. The performance of this web server was evaluated in this study against a benchmark set of six multimeric complexes, including a protein-DNA complex. Cyclic or dihedral symmetries, which are present in the large majority of homomers (44), were defined combined with the interface information derived from experimental evidences and/or predictions made by our consensus interface prediction program, CPORT. The results show that HADDOCK is able to generate native to near-native predictions for all cases with i-r.m.s.d. values for the best model ranging between 0.7 and 2.2 Å. Although we could produce excellent results even with the inclusion of bioinformatics predictions, which usually contain a considerable amount of false positives, one should always keep in mind that the information supplied to HADDOCK should be of high quality. This is because the complexity of the interaction space is larger in the case of multibody docking compared with the two-body docking.
The vast quantity of low resolution experimental data that could further be used in HADDOCK paves the route for the prediction of large macromolecular complexes. By combining distance and interaction restraints from low resolution methods with molecular docking, architectural or even atomic models might be generated. These restraints can be derived from a variety of experimental measurements including MS of intact complexes, chemical cross-linking, cryo-EM, SAXS, fluorescence resonance energy transfer, and analytical ultracentrifugation. One very recent addition to this series of biophysical tools is ion mobility separation (IM) coupled to MS (45). IM is an established technique for studying shape and conformation of small molecules and individual proteins. When coupled with MS, mass and subunit composition of a protein complex can be determined simultaneously with its overall topology and shape (46, 47). The cross-sections of amyloid oligomers formed in the early steps of amyloid fibril formation calculated by IM-MS (48) could be used as a restraint in data-driven docking to discriminate between quaternary topologies for a specific oligomeric state. This can be done for example by inferring a radius of gyration restraint from this cross-section measurement or by predicting the cross-sections from the docking models and using the experimental data as a filter.
In today's proteomics era, large scale screening techniques are used to characterize protein-protein interactions (PPIs) in vivo (49, 50). Despite the massive number of interactions detected by protein complex purification techniques using MS (originally either by high throughput MS protein complex identification (51) or by high throughput mass spectrometry protein complex identification coupled to tandem affinity purification (52), systematic yeast two-hybrid screening (53–55), complementary mapping techniques (e.g. protein fragment complementation assay (56), and in vitro proteome chip screening (57))), the interactome coverage remains low, roughly 50 and 10% for the yeast and human interactomes, respectively (58). This becomes apparent in the rather limited overlap between various data sets obtained with different approaches (59). This can be explained by a limit in proteome coverage (up to 70% for the best approaches) and by the inherent high fraction of false positives (previous estimations mention that more than half of all current high throughput data are spurious (59)). It also highlights the difficulties encountered by some methods for certain types of interactions, strengthening the complementarities between the different techniques. Finally, proteomics data sets derived to map PPIs, even when a similar detection method is used (7, 60), have a limited overlap (only 18%) (61).
Computational methods to predict protein assemblies could in principle play a complementary role in the study of interactomes, providing additional insights with leverage of the structural models. But can present scoring functions used in protein-protein docking methods characterize the binding affinity of a macromolecular complex, a requisite to predict interactomes? To answer this question, we have tested nine of the currently best performing scoring functions against a large set of high quality binding affinity data derived from the literature (62). The results (data not shown) reveal that scoring is orthogonal to binding affinity prediction (the highest calculated r2 was 0.09!) even though scoring functions are successfully being used in discriminating native structure from decoys. Hence, even if structural modeling tools and molecular docking approaches can significantly improve the selection accuracy of PPI networks (63), these computational methods need to be optimized for both purposes, e.g. annotation and prediction of PPIs.
By combining a variety of experimental approaches, one can easily increase our knowledge about biologically relevant interaction (64). The experimental information can guide large scale docking studies to upgrade the information contained in interactome maps by adding the three-dimensional structural dimension to the PPIs. Moving toward systems biology, computational methods could aim at predicting how the proteome is wired and how dynamic changes in the interactome occur in response to different environmental factors. In that regard, mass spectrometry techniques that determine the composition and stoichiometry of macromolecular complexes will be of indispensable value.
But how far are we from a high throughput method to screen for protein complex structures? Recently, we have linked HADDOCK to MTMDAT, an automated software for the analysis of mass spectrometry data (65), creating effectively a pipeline for high throughput, MS-based structural modeling of complexes.4 This pipeline allows feeding automatically into HADDOCK the interface information identified by MS from digestion experiments. This is only one example of how experiments and modeling can be coupled, and we expect that many other related applications will be developed in the future to open the route to large scale annotation of interactomes.
The Dutch BiG Grid project with financial support from the Netherlands Organization for Scientific Research (NWO) is acknowledged for the use of the computing and storage facilities. We thank Dr. Marc van Dijk (Utrecht University) for providing the data for the protein-DNA complex discussed in this work.
* This work was supported by the Netherlands Organization for Scientific Research (VICI Grant 700.56.442 to A. B.) and the European Community (FP6 integrated Project SPINE2-COMPLEX Contract 032220 and FP7 e-Infrastructure “e-NMR” I3 project, Grant 213010).
2 M. van Dijk and A. M. J. J. Bonvin (2010). Pushing the limits of what is achievable in protein-DNA docking. Benchmarking HADDOCK's performance. Nucl. Acid Res., in press (2010).
3 S. de Vries and A. M. J. J. Bonvin, submitted manuscript.
4 J. Hennig, S. J. de Vries, K. D. M. Hennig, M. Sunnerghagen, and A. M. J. J. Bonvin, manuscript in preparation.
1 The abbreviations used are: