|Home | About | Journals | Submit | Contact Us | Français|
Over the last fifteen years, the role of NMR spectroscopy in the lead identification and optimization stages of pharmaceutical drug discovery has steadily increased. NMR occupies a unique niche in the biophysical analysis of drug-like compounds because of its ability to identify binding sites, affinities, and ligand poses at the level of individual amino acids without necessarily solving the structure of the protein-ligand complex. However, it can also provide structures of flexible proteins and low-affinity (Kd > 10-6 M) complexes, which often fail to crystallize. This article emphasizes a throughput-focused protocol that aims to identify practical aspects of binding site characterization, automated and semi-automated NMR assignment methods, and structure determination of protein-ligand complexes by NMR.
The incorporation of structure-based knowledge in drug design has led to a three-fold increase in identification of high potency compounds (IC50 < 100 nM) from fragment leads (Hajduk and Greer, 2007). Nuclear magnetic resonance spectroscopy (NMR) is an invaluable tool for drug discovery, particularly with the growth of fragment-based screening in the pharmaceutical industry. The SAR (structure-activity relationships) by NMR technique (Shuker et al., 1996) refocused the niche of NMR in drug design from structure determination to characterization and optimization of lead compounds (Peng et al., 2001). With steady progress in production of isotope-labeled proteins (Frederick et al., 2007; Jeon et al., 2005; Tyler et al., 2005a; Tyler et al., 2005b; Zhao et al., 2004), instrument sensitivity (Hajduk et al., 1999; Jensen et al., 2010), and methods for rapid data collection and analysis (Frydman et al., 2002; Rovnyak et al., 2004a; Rovnyak et al., 2004b; Schanda and Brutscher, 2005), NMR is being utilized earlier in the process of identifying lead molecules because of its ability to quickly recognize and characterize ligand binding regardless of target protein function (Hajduk and Greer, 2007).
This chapter presents a workflow for spectroscopists interested in examining protein-ligand interactions with an emphasis on identifying the binding site and solving the structure of the complex. We first present an efficient process for spectral acquisition, processing, and assignment of a target protein or complex. We then describe an approach to protein-based compound screening that includes experimental validation of in silico docking results and quantitative affinity measurements to prioritize further studies. Finally, we detail how to solve the structure of protein-ligand complexes. Various approaches toward completely automating NMR structure determination have been recently reviewed and discussed and are beyond the scope of this chapter (Baran et al., 2004; Guntert, 2009; Markley et al., 2009a; Williamson and Craven, 2009). Instead, we focus on the semi-automatic approach used in our laboratory to determine protein structures and on its application to solving protein-ligand complexes.
Before NMR data collection begins, 15N-labeled proteins must first be screened by 1D and 2D NMR to assess folding, aggregation state, stability, ligand binding, and to detect unstructured regions (see the chapter in this volume for information on NMR sample preparation methods). If necessary, individual protein domains can be isolated from other regions of the protein to increase solubility and reduce aggregation (Peterson et al., 2007; Waltner et al., 2005). Proteins that are judged amenable to NMR structure determination are then produced in [U-15N,13C]–labeled form and put into a general workflow for NMR-based structure determination as outlined in Figure 1.
A standardized scheme for acquisition of 2D and 3D spectra sufficient for NMR structure determination is outlined in Table I. On a 600 MHz spectrometer equipped with a cryogenic probe, these experiments can be acquired in 7-10 days for proteins under 25 kDa at concentrations > 0.5 mM. This time is significantly shortened (to as little as 30 h) if the protein concentration is high enough that data can be recorded without signal averaging or using nonlinear sampling methods. Continuous data collection on a single sample reduces spectral variations, resulting in improved peak picking and automated chemical shift assignment later on. In the case of protein-ligand complexes, several additional experiments, outlined in Section 2.5, are necessary for ligand assignment. Additionally, intermolecular distance constraints for protein-ligand complexes are determined from one additional experiment, the 3D F1-13C/15N-filtered/F3-13C-edited NOESY-HSQC. This experiment selects for (edits) protons attached to 13C in the F3 dimension, and selects against (filters) protons bound to 13C and 15N in the F1 dimension. Thus, if unlabeled ligand is bound to labeled protein, this experiment will yield only intermolecular NOEs between protein and ligand protons.
As outlined in Table I, all data for the assignment and structure analysis are acquired in three blocks, starting with the NOESY spectra, followed by backbone triple-resonance experiments and then TOCSY-based experiments that provide sidechain assignment information. 1D 1H and 2D 15N–1H HSQC spectra are collected before and after every 3D experiment to provide a record of sample and instrument stability over the data collection period. Standard data acquisition parameters, including up-to-date pulse calibrations and spectral parameters, are stored in a common repository on each NMR instrument, so that data acquisition for each new protein can be started rapidly after only a few routine steps (VT adjustment, 2H lock, 1H/15N/13C tuning, shimming, 1H 90° pulse calibration). Accurate chemical shift referencing in the direct and indirect dimensions using 4,4-dimethyl-4-silapentane-1-sulfonic acid (DSS) is important for consistency between spectra (Markley et al., 1998). Standard 2D and 3D pulse sequences are provided in pulse sequence libraries from the NMR spectrometer manufacturers and are also available from the BioMagResBank (BMRB) website (http://www.bmrb.wisc.edu/tools/choose_pulse_info.php).
As with data collection, we use a standardized set of scripts to convert and process the time-domain NMR data using NMRpipe (Delaglio et al., 1995). Conversion of the Bruker-formatted data to NMRpipe format is accomplished with the bruk2pipe script, supplied with NMRpipe. For processing, we typically zero-fill the direct 1H dimension to 1024 complex points and apply both forward-backward linear prediction and zero-filling in each indirect dimension to 128 or 256 complex points. In depth analysis of spectral processing is beyond the scope of this text and interested readers are directed to Hoch and Stern's treatise (Hoch and Stern, 1996). After transformation, the resulting frequency-domain spectra are converted to XEASY format (Bartels et al., 1995); however NMRpipe is compatible with any of the other commonly used spectral analysis programs such as Sparky (Goddard and Kneller, 2001), CARA (Keller, 2004/2005), or NMRView (Johnson and Blevins, 1994).
Reduced dimensionality methods for data collection encode multiple indirect frequencies simultaneously rather than independently, enabling the acquisition of four- or higher-dimensional spectra in a short time and reducing the number of necessary experiments for structure determination. Examples include the G-matrix Fourier transform (GFT) method (Kim and Szyperski, 2003; Shen et al., 2005) and HIFI-NMR (Eghbalnia et al., 2005). Non-uniform sampling (NUS), in which data is collected for only a subset of all incremented evolution periods at non-fixed intervals, is another approach to decreased data collection times that is especially promising for larger proteins(Barna et al., 1987; Rovnyak et al., 2004a; Rovnyak et al., 2004b). The recently released TopSpin 3.0 Bruker software package incorporates NUS as part of a routine workflow, including optimized sampling schemes and data processing via Multi-Dimensional Decomposition (Orekhov et al., 2003). Alternatively, processing via maximum entropy reconstruction (Mobli et al., 2007) has been made more accessible to the general protein NMR community by use of the Rowland NMR Toolkit Script Generator, available as a web server at: http://sbtools.uchc.edu/nmr/nmr_toolkit/.
Once all 2D and 3D data have been processed, the position and intensity of peaks in each experiment must be recorded for submission to an automated chemical shift assignment protocol. Peak picking is one step for which there are relatively few reliable automated programs. As has been discussed by others this is due to issues with noise, artifacts, and strong peak overlap (Altieri and Byrd, 2004; Williamson and Craven, 2009). Because accurate peak picking is crucial for attaining good results with automated assignment programs, most labs traditionally use manual or semi-automated methods assisted by software such as NMRView (Johnson and Blevins, 1994), XEASY (Bartels et al., 1995), or Sparky (Goddard and Kneller, 2001). AUTOPSY (Koradi et al., 1998) is one of the few commonly used automated peak picking programs; however, even with this sophisticated algorithm, peak lists may still require further editing especially in regions of overlapped peaks (Malmodin et al., 2003). It remains to be seen whether recent developments such as the PICKY automated peak picking program (Alipanahi et al., 2009b) or more error-tolerant backbone assignment programs such as MARS (Jung and Zweckstetter, 2004) and IPASS (Alipanahi et al., 2009a) will have an effect on currently established peak picking methods.
Our method for picking peaks of the 3D heteronuclear spectra is a semi-automated approach that uses the automated peak picker of XEASY. The basic steps of our peak picking protocol are outlined below:
With the advent of numerous semi-automated and automated assignment protocols over the last decade, the backbone resonance assignment process, utilizing standard triple-resonance NMR data, has become a relatively routine process in many laboratories. What once took weeks can now be completed in a day or less for proteins under 25 kDa. Automated resonance assignment programs include GARANT (Bartels et al., 1996; Bartels et al., 1997), AutoAssign (Zimmerman et al., 1994; Zimmerman et al., 1997), MARS (Jung and Zweckstetter, 2004), ABACUS (Grishaev et al., 2005b), MATCH (Volk et al., 2008), and PINE (Bahrami et al., 2009), as well as many others which have been previously reviewed (Altieri and Byrd, 2004; Baran et al., 2004; Grishaev et al., 2005a). The programs PINE and AutoAssign have the additional advantage of being readily available as web-based servers: http://pine.nmrfam.wisc.edu/ and http://nmr.cabm.rutgers.edu/autoassign/, respectively. Furthermore, the AutoAssign web server has the option to submit jobs simultaneously to PINE, the results of which can be shown in a comparison chart where the user can easily identify resonance assignment differences between the two programs.
Much of our past work has utilized GARANT for backbone assignments, but PINE is a convenient new alternative that generates equally good results. PINE accepts peak lists from up to 13 3D heteronuclear experiments in either XEASY or Sparky format. The output of PINE is a probabilistic assignment for every residue. This means that it is possible for some residues to have more than one candidate spin system for their chemical shifts. In these cases, an associated probability is assigned to each candidate spin system assigned to a residue. Assigned XEASY peak lists are obtained using PINE as follows:
In contrast to the backbone assignments, side chain assignments typically require greater manual intervention due to incomplete data and degeneracy of aliphatic chemical shifts. With ideal C(CO)NH, H(CCO)NH and HCCH-TOCSY peak lists, both PINE and GARANT are capable of generating automatic sidechain assignments. In reality, however, at least 90% completeness of ordered regions is required for automated structure calculations and, therefore, nearly complete assignments require manual analysis (Jee and Guntert, 2003). First, the partial aliphatic sidechain assignments obtained from PINE are verified. From these assignments, the side chain 1H-13C crosspeaks in the patterns expected for an HCCH-TOCSY spectrum can be generated using GARANT. This peak list forms the starting point for manual completion of all aliphatic sidechain assignments; depending on the completeness of HCCH-TOCSY assignments the manual processing can be completed within one day. Finally, aromatic side chain assignments are completed manually using the 13C aromatic NOESY and the previously determined aliphatic side chain assignments.
The assignment strategy for small molecules is complicated by the lack of isotopic enrichment of 15N and 13C. In the free state, assignment of all protons in the small molecule can be completed using a set of 2D homonuclear spectra that include 2D 1H-TOCSY, 2D 1H-NOESY and/or 2D 1H-ROESY. However, ligand binding to its target protein induces chemical shift changes in both the protein and ligand with varying magnitudes that are dependent on the local chemical environment. To assign an unlabeled ligand in the bound state, one can use a 2D F1/F2-13C/15N-doubly-filtered NOESY spectrum (Ikura and Bax, 1992) as previously described (Alam et al., 1998; Peterson et al., 2010). In the doubly-filtered NOESY spectrum, only NOE cross peaks involving 12C-attached protons as the source and destination are present in the final spectrum. Since 12C-attached protons only occur within the small molecule, the doubly-filtered NOESY spectrum will only show intramolecular NOEs for the ligand. These NOEs can then be used to sequentially assign the proton shifts for the small molecule.
Because the completeness and accuracy of the chemical shift assignments is crucial for the robustness of the subsequent steps of structure determination, the assignments should be checked for errors before proceeding with structure determination. TALOS and CYANA, programs described in Sections 4.2.1 and 4.2.2, automatically check for chemical shifts that are significantly outside the typical range for the particular atom/residue and also report statistics on the completeness of the assignments. A variety of other programs available for validating chemical shift referencing and/or assignments have recently been evaluated (Wang et al., 2010).
Fragment-based drug design is most effective for identifying relatively weak (Kd > 10-6 M) compounds with the potential for development into high-affinity molecules. NMR is an optimal technique to test the quality of these fragment leads because it can provide atom-specific information that is most sensitive to low-affinity interactions. A basic outline of how NMR can contribute to the rational drug development process is outlined in Figure 2. Once the chemical shift assignments of the target protein are known (see Section 2), [U-15N,13C]-enriched protein can be titrated with a putative ligand or binding partner and the resulting changes in chemical shift monitored by 2D correlation spectroscopy. This procedure is called chemical shift mapping and can be used to determine binding constants, stoichiometry, and ligand position. 2D 1H-15N heteronuclear single-quantum coherence (HSQC) spectroscopy is the most common method for mapping backbone chemical shifts because every residue type (except proline) contains a highly sensitive probe to changes in local magnetic environment. HMQC(Mandal and Majumdar, 2004) and TROSY (Pervushin et al., 1997) are alternative pulse schemes that provide advantages in certain cases and supply equivalent spectral information. As most protein-ligand interactions are sidechain mediated, 2D 1H-13C HSQCs can provide additional information, but the sidechain signals suffer from overlap in large target proteins. Thus selective 13C-labeling of Ile, Leu, and Val residues can significantly improve resolution without compromising sensitivity (Meyer and Peters, 2003).
Titrations are typically performed two ways: 1) add known ligand volumes to a target-protein sample, or 2) prepare multiple samples with a constant protein concentration and increasing ligand concentrations. The former method consumes substantially less protein, but suffers from less accurate binding constants due to changing protein concentrations. Using cryoprobe technology without a sample-handling robot, three unique compounds can typically be analyzed per day using this scheme:
NMR validation can be the most time-consuming aspect of screening large numbers of compounds. To improve throughput, compounds can be grouped together and simultaneously screened by NMR. The most promising mixtures are then analyzed individually. Alternatively, “chromatographic” NMR experiments have been described to use 2D diffusion information to identify high-affinity small molecules from mixtures (Gonnella et al., 1998; Gounarides et al., 1999; Hajduk et al., 1997a; Hajduk et al., 1997b). The gain in efficiency of mixture analysis strategies is difficult to assess; Ross and colleagues argue the high binding success rate (~1:3) of compounds suggests the extra effort spent to deconvolute spectra should be redirected to initially preparing more reagents for individual analysis (Ross et al., 2000). Automated sample handling and rapid 2D NMR acquisition schemes, such as SOFAST-HMQC (Schanda and Brutscher, 2006; Schanda et al., 2005) and single-scan spatial-encoding (Frydman et al., 2002; Gal et al., 2006), can be expected to increase the throughput of NMR-based ligand screening in the near future.
As many small molecules are hydrophobic, the solubility of ligands is often improved by reconstitution in an organic solvent; thus, the solvent itself may interact with the target protein. These interactions may be non-saturable but can produce perturbations of substantial magnitude, and may even reveal the locations of druggable sites in the protein (Byerly et al., 2002; Dalvit et al., 1999; Liepinsh and Otting, 1997). Solvent perturbations are relatively simple to detect by performing a solvent-only titration for comparison with spectra reporting on candidate ligands (Fig. 3B).
Another facet of non-specific binding is the orientation or mode that a compound adopts when associated with a target protein. Lead compounds that exhibit multiple-mode binding should be modified to maintain a single-mode prior to optimization. Analysis of titration lineshapes can quickly identify binding modes (Reibarkh et al., 2006); single-mode ligands with micromolar affinity should produce line-broadening at substoichiometric concentrations that sharpen upon saturation. Protein resonances that do not sharpen generally reflect multiple-mode ligand binding. However, this does not preclude the possibility that ligands can modulate the oligomeric state (Veldkamp et al., 2005) or conformation of the protein (Vajpai et al., 2008).
As chemical shift perturbations can result from both ligand-binding and ligand-induced conformational changes, discrimination between direct and indirect effects is crucial to effective chemical shift mapping. Residues with the largest changes in chemical shift are presumed to be specifically involved in the binding interaction as ligand-induced conformation changes are rare in low-affinity interactions (Cavanagh, 2007). However, the binding site of ligands interacting with target proteins with large conformation changes can be indentified by the use of differential chemical shifts. Fesik and colleagues outline a protocol to compare chemical shift perturbations of highly similar compounds (Medek et al., 2000). The chemical shift changes resulting from conformation change should be similar while ligand-induced perturbations will differ by chemical uniqueness. We performed an similar analysis of differential chemical shifts between sulfated and unsulfated peptides to identify the sulfotyrosine recognition site of CXCL12 in the absence of explicit structural data (Veldkamp et al., 2006).
The determination of binding constants during compound screening is crucial to ranking compounds for optimization and lead selection. The ability to quantitate affinity by NMR is highly dependent on the binding kinetics. The following is a general outline of the theoretical principals for determining binding reactions on the NMR time scale [for detailed texts on NMR kinetics, see (Cavanagh, 2007; Levitt, 2008)]. Assuming a simple two-state equilibrium, for a protein (P) binding to a ligand (L) to form a complex (PL), with second-order binding kinetics,
where kon and koff are the rate constants of association and dissociation, respectively. The rate constant for exchange (kex), or the chemical exchange lifetime (τex = 1/kex), is calculated as
Comparison of the exchange rate constant (kex) with the difference in angular frequency (Δω) between an unbound (ωP) and bound (ωPL) resonance,
categorizes the binding interaction into three basic exchange regimes:
The particular exchange regime a binding reaction belongs to can be readily identified by observation of each resonance throughout a titration; it is important to note that each atom can exhibit its own rate of exchange for a particular binding reaction.
Protein resonances in fast exchange do not change significantly in intensity and continuously exhibit chemical shift changes through the titration. Fast-intermediate exchange is similar to fast-exchange with the exception of diminished intensity and linewidth as the chemical shift changes from unbound and bound. Resonances that significantly diminish, possibly below detection, and then reappear at a different chemical shift with increasing ligand concentrations are in intermediate exchange. In practice, resonances in intermediate exchange may not reappear because it is not possible to get high enough ligand concentrations depending on solubility and protein concentration. Slow exchange is characterized by a resonance that decreases in intensity but does not change chemical shift with greater ligand concentration; at the same time, a resonance with a unique chemical shift will begin to increase in intensity. Visual inspection of binding kinetics provide a qualitative interpretation of binding affinity as fast exchange is weak (Kd > 10-6 M), intermediate exchange is generally between 10 and 100 μM, and slow exchange is strong (Kd <10-6 M).
To facilitate high-throughput quantitative analysis we will only consider binding reactions in fast exchange. We presume this is valid because most compounds screened will have affinities weaker than 10-6 M; however, to minimize error the need to ensure fast-exchange binding kinetics cannot be understated. Interested readers are directed toward excellent reviews on quantitating slow and intermediate exchange kinetics (Feeney et al., 1979; Fielding, 2003). This is especially important during lead optimization as compounds are expected to possess high affinities. HSQC chemical shift changes are quantified for each titration point using an adjusted chemical shift value (Δ) of the form,
in which ΔδH and ΔδN are the chemical shift change of amide proton and nitrogen, respectively. Many values for the scaling factor, c, have been proposed ranging from 0.1 to 0.25 (Geyer et al., 1997; Mulder et al., 1999); we typically employ a value of 0.2 to balance the contributions of 1H and 15N to the magnitude of the perturbation. More complex formulations that use scaling factors specific to each amino acid have also been proposed (Geyer et al., 1997; Schumann et al., 2007). A similar range of scaling factors has been proposed for the measurement of combined 1H-13C shift perturbations (Schumann et al., 2007). The increase in adjusted chemical shift perturbation as a function of ligand concentration is then fitted by non-linear regression using an equation that accounts for ligand depletion, single-site binding, and non-specific binding:
in which Δ is the adjusted chemical shift change, Δmax is the maximum adjusted chemical shift change at saturation, [PT] is total protein concentration, Kd is the equilibrium dissociation constant, [LT] is the total ligand concentration, and NS is non-specific binding.
The chemical features of a protein-ligand interface that are responsible for a given biological function define a pharmacophore. They include the hydrogen bond vectors, hydrophobic volumes, and steric properties of active ligands that identify potential inhibitors. It is important to note that pharmacophores do not correspond to common functional groups, and therefore occupy a unique niche in structure-based drug design. A pharmacophore model is often used as a pre-filter for in silico screening because it can be evaluated more rapidly than explicit docking and may be less prone to susceptible to false positive hits. In a ligand-based computational process, the pharmacophore definition can be enhanced by NMR titration and structure data. Specifically, using the method for comparative chemical shift perturbation analysis (Medek et al., 2000) outlined in Section 3.2, one can identify the binding site of a series of related lead compounds. Once the binding modes of several compounds have been determined by comparative analysis, the pharmacophore elucidation can be performed by any number of programs [as reviewed by (Leach et al., 2010)]. Structural knowledge of a protein provides invaluable information of exclusion regions (Greenidge et al., 1998; Rella et al., 2006; Tintori et al., 2008); knowledge of binding site volumes accelerates compound docking considerably as only molecules that will fit are examined.
Both traditionally druggable proteins (enzymes), with distinct binding pockets, and protein-protein interfaces, with relatively flat surfaces, are being successfully targeted for small-molecule inhibitors (Veldkamp et al., 2010; Wells and McClendon, 2007). An integral part of rational drug design is identification of biologically relevant binding surfaces or epitopes. When target protein structures are known, knowledge of epitopes are invaluable for directed compound docking and reducing the computational load of screening chemical libraries. Traditionally, epitope definition utilized structures solved by crystallography; however, NMR is an established alternative that is especially suited for solving weak complexes that are frequently not crystallizable. When full structure determination of the complex is not feasible, NMR can still yield epitope information from HSQC titrations. In the absence of high-resolution structural data, ligand binding epitopes can be discerned through site-directed mutagenesis or antibody inhibition.
A variety of docking programs have been described for computational screening of chemical libraries, including AutoDock (Morris et al., 1998), DOCK (Ewing et al., 2001), eHiTs (Zsoldos et al., 2006), FlexX (Rarey et al., 1996), and Fred (McGann et al., 2003). In collaboration with other experts, we have successfully implemented epitope-directed compound screening against cytokine-receptor signaling systems using both the DOCK and AutoDock packages. These programs can be used for screening of both private and public chemical libraries. For example, we recently used the NMR structure of the chemokine CXCL12 in complex with a 38-residue peptide of the CXCR4 receptor (Veldkamp et al., 2008) to identify a small molecule ligand for CXCL12 that inhibits CXCR4-mediated calcium flux (Veldkamp et al., 2010). Specifically, we used the CXCL12-binding site of two CXCR4 residues to define an epitope. Using DOCK, we targeted 1.4 million compounds from the ZINC database (Irwin and Shoichet, 2005), a free virtual library of commercially available small molecules, at this site. In an NMR-based assay of the top-ranked molecules, four of the five tested compounds bound specifically to the targeted epitope, including one with a Kd value of 64 μM. The success of this study and others (Arkin and Wells, 2004; Berg, 2008; Fuller et al., 2009) demonstrates the value of NMR binding data to identify epitopes for virtual screening within the rational drug development process.
Solution NMR structure calculations rely mainly on two types of geometric constraints: (1) distance constraints based on analysis of 3D 15N- and 13C-edited NOESY spectra and (2) dihedral angle constraints predicted from chemical shifts. When disulfide bonds, bound metal ions or other cofactors of known geometry are present, additional constraints are typically included in the form of generic upper and lower distance limits to enforce the covalent crosslinks or coordinating bonds. The use of other restraints, such as residual dipolar couplings (RDCs) and pseudocontact shifts, can be directly incorporated into CYANA 3.0, an updated version of the CYANA structure determination program (Herrmann et al., 2002). However, in our experience these constraint types are not essential to obtain high quality structural models for most diamagnetic globular domains under ~25 kDa.
After the chemical shift assignments have been completed, constraints for the backbone and ψ dihedral angles are generated using the program TALOS+ (Shen et al., 2009). Alternatively, 3J-scalar coupling measurements can be used to estimate dihedral angles from empirically-defined relationships (Karplus, 1959); however, our streamlined approach employs chemical shift-derived dihedral angle constraints as they can be are obtained without the need for additional spectral acquisition. TALOS+, like the original program TALOS (Cornilescu et al., 1999), compares the HN, N, C′, Hα, Cα, and Cβ chemical shifts from the protein against a database of chemical shifts and backbone geometry in tripeptide segments from known protein structures. The TALOS+ database currently contains 200 proteins. As with the original TALOS program, the ten best database matches for each residue (based on up to 18 individual chemical shifts: six atoms each from three amino acids) are compared to see if they cluster in a consistent region of the Ramachandran ( vs. ψ angle) plot. TALOS+ utilizes two additional parameters for making predictions. The first is a neural network component whose output is used as an additional empirical term in the conventional TALOS database search. The second is an estimated backbone order parameter S2 derived from the chemical shifts using the random coil index method (RCI; (Berjanskii and Wishart, 2005)), which biases against the prediction of backbone dihedral angles in flexible regions. An interface for inspecting and modifying the results is provided, after which the acceptable predictions are converted into a dihedral angle constraint file for use in structure calculations (Cornilescu et al., 1999). TALOS+ also provides an additional feature to pre-check chemical shift referencing and possible chemical shift errors. Tabular chemical shift data in the TALOS input format is easily generated either by using the CYANA macro ‘taloslist’ for chemical shifts in the CYANA/XEASY(.prot) format or with the bmrb2talos script supplied with the TALOS+ software. We have further simplified the data conversion and TALOS+ calculation process with an in-house script (available upon request).
Distance restraints derived from proton-proton nuclear Overhauser effects (NOEs) are the primary NMR data used to define the secondary and tertiary structure of a protein. NOEs develop due to through-space dipole-dipole interactions rather than through-bond interactions. Thus, they contain information on the distances between hydrogen atoms separated by 5 Å or less in space even though the residues may be distal in primary sequence. The intensity of a NOE, that is the volume of the corresponding crosspeak in a NOESY spectrum, is proportional to the inverse of the sixth power of the distance between the two nuclei due to averaging caused by rotational motion. However, because the NOE is not always a precise reflection of the internuclear distance, structure calculation programs generally translate NOESY crosspeak intensities into upper bounds on interatomic distances rather than fixed distance restraints.
With the availability of a variety of automated assignment programs, the generation of NOE distance constraints from NOESY spectra is a relatively robust and automatic process, at least for the calculation of initial structures. These structures can then be used, if necessary, to expand and correct the preliminary list of NOE assignments for input into another round of structure calculations. The process continues in an iterative fashion, making adjustments to the input parameters and experimental constraints until an acceptable final ensemble of structures is obtained. Programs used for automated assignment include NOAH (Mumenthaler and Braun, 1995), ARIA (Linge et al., 2003a; Nilges, 1995; Rieping et al., 2007), CYANA (Guntert, 2003, 2004; Herrmann et al., 2002), PASD (Kuszewski et al., 2004; Kuszewski et al., 2008), and AutoStructure (Huang et al., 2005; Huang et al., 2006), all of which have been reviewed previously (Guntert, 2003; Williamson and Craven, 2009). Although the automated NOE assignment methods require NOESY peak lists that are accurately referenced relative to the chemical shift assignments, some programs can deal robustly with noisy peak lists. It has been demonstrated that the NOEASSIGN module of CYANA, version 2.1, can yield reliable NOE distance constraints even when one-half to two-thirds of the crosspeaks in the initial NOESY peak lists are noisy or artifactual (Lopez-Mendez and Guntert, 2006), while PASD can tolerate up to 80% incorrectly picked peaks (Kuszewski et al., 2004).
The NOEASSIGN module of CYANA is our preferred tool for automated NOESY peak assignments. Interproton NOE distance constraints generated by NOEASSIGN are derived from three different 3D NOESY spectra: 15N-edited NOESY-HSQC, 13C-aliphatic-edited NOESY-HSQC, and 13C-aromatic-edited NOESY-HSQC. An additional experiment, the 3D F1-13C/15N-filtered/F3-13C-edited NOESY, is added for protein-ligand structures. Because the NOEASSIGN module can cope with a high number of artifacts in the peak lists, only minor manual editing should be necessary at this stage as long as the NOESY spectral referencing matches that of the 15N-HSQC/HNCO and HCCH-TOSCY spectra used to obtain the chemical shift values. Because automated NOE assignment is inseparable from the structure calculation process, the details are described in Section 4.2.
Over the past six years, we have successfully used NOEASSIGN-generated distance constraints, in combination with dihedral angle constraints, to solve more than 20 unique structures with CYANA (de la Cruz et al., 2007; Lytle et al., 2004; Lytle et al., 2006; Peterson et al., 2007; Peterson et al., 2006; Peterson et al., 2005; Tuinstra et al., 2007; Vinarov et al., 2004; Waltner et al., 2005). This list includes not only single domain proteins but also symmetrical homodimers (Peterson et al., 2010; Peterson et al., 2006) and protein-ligand complexes (Veldkamp et al., 2008). The NOEASSIGN algorithm introduced in CYANA 2.0 is an improved version of the original CANDID algorithm (Herrmann et al., 2002) that is based on a probabilistic treatment of the NOE assignment process. An overall probability of the correctness of possible NOE assignments is calculated as the product of the probabilities for each of the criteria for assignment: agreement between the chemical shift values and the peak positions, compatibility with a preliminary 3D structure, and network-anchoring, i.e. the extent of embedding in the network of all other NOE assignments. Automated NOE assignment and structure calculation by the CYANA torsion angle dynamics (TAD) algorithm (Guntert et al., 1997) are combined in an iterative process that comprises seven cycles of automated NOE assignment, followed by a final structure calculation using only unambiguously assigned distance constraints. The structural result of a given cycle is used to guide the NOE assignments in the following cycle, with the precision of the structure normally improving with each subsequent cycle. In the first two cycles, “constraint combination” is applied to medium- and long-range distance restraints in order to reduce the impact of erroneous distance restraints on the resulting structure. In all seven cycles, ambiguous distance constraints are used to generate conformational restraints from NOESY cross peaks with multiple possible assignments.
In our experience the NOEASSIGN routine, given the proper input, consistently results in a converged set of initial structures for monomeric proteins. In favorable cases, the result is a well-defined structural ensemble, with backbone RMSD < 1 Å and target function < 5 Å2. Structures can typically be obtained in less than an hour and often under thirty minutes, depending on the size of the protein and available computing power. For example, for a 160-residue protein computation time is 50 minutes using an 8-core Apple Mac Pro or about 15 minutes with a 16 node/32 processor Linux cluster.
An example script for running a CYANA calculation with the automated NOE assignment module can be found on the CYANA homepage (http://www.cyana.org/wiki/index.php/Main_Page). The order of the tolerance parameters used by NOEASSIGN is (1) direct 1H, (2) indirect 1H, and (3) indirect 15N/13C. We typically use values of 0.03 ppm for both 1H dimensions and 0.3 ppm for the 15N and 13C dimensions. By leaving the calibration parameter line blank, these parameters are determined automatically such that the median of the upper distance limits for each peak list equals the value of the reference distance variable dref, which has a default value of 4.0 Å. In each cycle, and in the final structure calculation, 100 conformers are calculated using the standard simulated annealing schedule with 10000 torsion angle dynamics steps per conformer. The 20 conformers with the lowest final target function values are analyzed using a variety of statistics, including overall target function, restraint totals, and ensemble coordinate precision (atomic r.m.s.d). In our protocol, NOEASSIGN is initially run in triplicate with identical input files (sequence file, prot file, aco file, and peak lists), changing only the value of ‘randomseed’ for each run. The results of the three NOEASSIGN calculations are then evaluated. While each final structure should converge to fairly similar RMSD and target function values, as reported in the final.ovw file, it is more important to assess whether all three structural ensembles display the same overall tertiary fold by using a structure visualization program, such as PyMol or Molmol (Koradi et al., 1996). The presence of any significant deviations is indicative of misassigned NOEs, which may reflect inaccuracies in the chemical shift assignment list. These misassignments can have a significant impact on the structure that may go undetected as refinement proceeds; therefore, it is necessary that they be corrected at the outset. Constraint combination applied during the first two NOEASSIGN cycles helps to minimize these potential structural distortions, but it is not a panacea. It is also important to inspect the statistics of the initial structural ensemble using the ‘cyanatable’ script, because a poorly defined structure ensemble from cycle 1 may ultimately converge in later cycles on a precise but inaccurate final ensemble. NOEASSIGN output can be further evaluated based on the following two criteria for a successful calculation: 1) Less than 25% of the long-range NOEs have been discarded by the automated NOESY assignment routine for the final structure calculation and 2) the backbone RMSD to the mean for the structure bundle of cycle 1 is < 3 Å (Guntert, 2004).
When a satisfactory, consistent ensemble has been obtained using NOEASSIGN, it is typically necessary inspect and edit the NOESY peak assignments in XEASY (starting from the peaklists generated in cycle 7), focusing on the largest consistent distance and angle violations listed in the *.ovw file as well as any disordered regions or inconsistent conformations between replicate ensembles. Erroneous or missing chemical shift assignments should also be corrected by looking for spin systems (NOESY strips) that remain largely unassigned. NOEASSIGN calculations may then be repeated to satisfy the two criteria defined above. After correcting and verifying all peak lists and chemical shift assignments, a lack of convergence in cycle 1 could indicate the presence of a homodimeric assembly that was not apparent from earlier analysis of the spectra or hydrodynamic measurements of oligomeric state. After obtaining a satisfactory ensemble of structures by NOEASSIGN, manual structure refinement is performed in order to prepare the ensemble from torsion angle dynamics for a final Cartesian-space refinement calculation in explicit water and subsequent validation steps.
Manual structure refinement proceeds iteratively, using CYANA to calculate an ensemble of structures based on the assigned peak lists (an example script can be found at the CYANA homepage). At the outset of manual refinement, the calibration of NOE intensity/distance conversions should be optimized to decrease both the RMSD and target function. Initial runs are performed with automated calibration in which the parameters for the backbone, sidechain, and methyl calibration functions are calculated automatically by the CYANA macro ‘calibration’. The calculated parameters found in the output file are then used as a starting point for fine-tuning the calibration. This is accomplished by testing a grid of calibration values (at least three for each peaklist; nine calculations altogether) to determine the combination that results in the lowest target function and RMSD. The calibration should also be checked by comparing known interproton distances within secondary structures from well-resolved X-ray structures (Hur and Karplus, 2005) to the corresponding distances in the upper limit distance constraint (.upl) file.
Iterative rounds of CYANA calculations are then performed until no further improvement can be made in the structure. Between each CYANA calculation, the NOESY peak lists are inspected and peak assignments are added, removed, or corrected based on the results of the structure calculations. The consistently violated constraints should be corrected by reassigning crosspeaks, referring to the structure as necessary when there are multiple possibilities. PyMol or Molmol can be used to obtain a list of all atoms within 5.0 Å of an atom of interest. Special attention should be given to verifying the aromatic side chain assignments, since these residues often occupy central locations in the hydrophobic core. In addition to adding and correcting NOE assignments, any consistently violated dihedral angle constraints should be reevaluated or eliminated. Since TALOS constraints rely on chemical shift-based predictions with a finite error rate, a conservative approach preserves only those dihedral angle constraints that correspond to regions of recognizable secondary structure.
Automated NOE structure refinement of protein-ligand complexes using CYANA requires the creation of a molecular topology description for each component of the complex that is not found in the standard CYANA library (e.g. amino acids and nucleotides). The library definition or topology file for a small molecule ligand contains a description of atom types and nomenclature, covalent connectivities, dihedral angle definitions, and standard geometry of the small molecule. The requirement for such topology files in solving the structures of protein-ligand complexes has increased in parallel with structure-based drug design and was initially addressed by the PRODRG server (Schuttelkopf and van Aalten, 2004) (http://davapc1.bioch.dundee.ac.uk/prodrg/). The PRODRG server takes a PDB file or simple text drawings and generates topology files for GROMOS, GROMACS, WHAT_IF, REFMAC5, CNS, O, SHELX, HEX, and MOL2. However, these formats are not suitable for a CYANA topology file. The pdb2reslib module of Olivia (http://fermi.pharm.hokudai.ac.jp/olivia/) addresses this need and allows the user to generate a CYANA topology file from a PDB file. In this section, we describe the process of generating the CYANA topology description for an arbitrary ligand molecule.
Ligand library entries can be generated using a PDB coordinate file as a convenient, widely available description of molecular structure. If a PDB file is not readily available, one can be created by importing a simplified molecular input line entry specification (SMILES) string (Weininger, 1988; Weininger et al., 1989) for the molecule into a graphical chemical editor such as Marvin Sketch (http://www.chemaxon.com/products/marvin/marvinsketch/). The ligand model can then be exported in standard Protein Data Bank format and converted to the CYANA library format using the pdb2reslib module of Olivia. After conversion, the new ligand definition should be inspected and edited to add missing dihedral angles (or remove unnecessary angles), adjust atom nomenclature and numbering, and adjust dihedral angle naming prior to use in CYANA (Figure 4). The edited entry can then be appended to the standard CYANA library file with a unique residue identifier (e.g. DRG). Finally, because molecular dynamics calculations in torsion angle space requires that all components of the system be linked by covalent or virtual bonds in a single chain, the ligand must be incorporated into the protein complex by appending linker residues (e.g. PL, LL, LL2, etc.) and the ligand residue to the sequence (.seq) file, which CYANA will automatically interpret and use to link the ligand to the polypeptide chain.
The structure of a protein-ligand complex is highly valuable in the early stages of structure-based drug design, where lead compounds that interact weakly with their targets may not readily crystallize. In general, the NMR structure of a protein-ligand complex is solved in the same way as the protein itself, as described in the previous sections. The availability of chemical shift and NOE assignments for the ligand-free protein, along with ligand-induced chemical shift perturbation data, simplifies the assignment process for the protein complex. However, structure determination of a protein-ligand complex presents a challenge similar to what we have previously described for NMR structure determination of symmetric protein homodimers (Markley et al., 2009b). As with a protein homodimer, the identification and assignment of intermolecular NOEs between protein and ligand is the principal hurdle, as they are indistinguishable from intramolecular protein NOEs in standard isotope-edited NOESY spectra. Chemical shift degeneracy and their low relative abundance add to the difficulty of detecting NOE contacts that define protein-ligand interface. To overcome these problems, we acquire a 3D F1-13C/15N-filtered, F3-13C-edited NOESY spectrum (Stuart et al., 1999) on a sample of 15N/13C labeled protein saturated with unlabeled ligand (Figure 5). This experiment detects only NOEs arising between protons directly bound to 13C (protein) and protons bound to any non-15N or 13C atom (ligand) while suppressing all other cross-peaks by isotope filtering and editing. Constraints generated from the filtered NOESY spectrum can therefore be unambiguously assigned to protons on the small molecule ligand. Additionally, CYANA 3.0 now allows the user to specify NOEs as intermolecular by setting the ‘color’ value of the peak list (column 5) to 9. If 13C/15N labeled ligand is available, a 3D F1-13C/15N-filtered, F3-13C-edited NOESY spectrum can be collected on a sample containing unlabeled protein and labeled ligand, thereby providing a complementary dataset that further disambiguates atom assignments for intermolecular NOEs. We employed this approach to solve the structures of two protein-peptide complexes, for which [U-15N/13C] peptide ligands were produced using standard methods for expression and labeleing in E. coli (Tyler et al., 2010; Veldkamp et al., 2008).
The final stages of refinement are reached when the desired goals for precision (e.g. backbone RMSD below ~ 0.6 Å), agreement with experimental constraints (none violated by more than 0.5 Å or 5°), and good torsion angle geometry (no residues in disallowed regions of /ψ space) have been achieved. Typically, this requires manual inspection and editing of the NOE assignments that result in consistently violated constraints as we have previously described (Markley et al., 2009b). NMR structures defined by > 15 – 20 non-trivial constraints/residue typically exhibit high coordinate precision, with RMSD values of 0.5 – 0.7 Å for backbone atoms and 0.8 – 1.2 Å for all non-hydrogen atoms. However, it is important to assess the local mobility of the polypeptide backbone using 15N relaxation measurements such as 15N-1H heteronuclear NOE, since flexible regions will often be devoid of long-range NOEs. Consequently, the local NOE constraint density in flexible regions is expected to be low and no extra effort should be made to identify long-range NOE constraints for those residues. Once a consistent set of experimental constraints has been obtained, and the resulting structural ensemble meets the basic acceptance criteria, a final refinement calculation is performed in explicit water with physical force field parameters.
To simplify the water refinement routine, we have developed a script that converts the output of a CYANA calculation into input for water refinement and initiates the water refinement routine as previously described (Markley et al., 2009b). Our protocol utilizes a method for Cartesian space molecular dynamics in explicit water using XPLOR-NIH that was shown previously to improve the stereochemical quality of NMR structural ensembles generated with simple force fields (Linge et al., 2003b). Improvements in a variety of validation criteria were observed upon water refinement of NMR ensembles, particularly for the percentage of residues in the most favored region of the Ramachandran ( vs. ψ angle) plot. While this routine is not fully integrated into CYANA calculations, the user can initiate a water refinement run with a single script once a consistent set of experimental constraints has been achieved. The initial structural model and the model refined using restrained molecular dynamics in explicit water are validated against the distance and dihedral angle constraints using WHAT IF and PROCHECK-NMR. Comparison of the input and refinement statistics shows that the Ramachandran statistics of the TAD ensemble from CYANA and the WHAT IF RMS Z scores are systematically improved by refinement in explicit water. Additional information, protocols and scripts for final structural refinement in explicit water can be obtained from the Northeast Structural Genomics wiki (http://www.nmr2.buffalo.edu/nesg.wiki/Structure_Refinement_Using_CNS_Energy_Minimization_With_Explicit_Water).
Solution NMR is a powerful tool for the characterization and structural determination of protein-ligand complexes. As a result of structural genomics efforts aimed toward automation of the principal bottlenecks in protein structure determination by NMR, structures of proteins and domains under ~ 30 kDa can now be solved in a few weeks. Consequently, NMR spectroscopy is a feasible alternative to X-ray structure determination for drug discovery applications, particularly for flexible proteins or membrane proteins that are not amenable to crystallization. The efficiency of our assignment and structure determination protocol is greatly enhanced by maintaining the essential tools: NMR pulse programs and instrument calibrations, spectral conversion and processing scripts, peak picking scripts, and standardized scripts for running programs such as TALOS+, CYANA, and NIH-XPLOR, all of which are available from the authors upon request. In addition to its capacity to define the structures of protein complexes, NMR remains a robust method to screen and characterize protein-ligand complexes at the atomic level. Each of these applications will be useful as new targets are validated and enter the pipelines for structure-based drug discovery.