|Home | About | Journals | Submit | Contact Us | Français|
The macrodomains are a multifunctional protein family that function as receptors and enzymes acting on poly(ADP-ribose), ADP-ribosylated proteins, and other metabolites of nicotinamide adenine dinucleotide (NAD+). Several new functions for macrodomains, such as nucleic acid binding and protein/protein interaction, have recently been identified in this family. Here, we discuss methods for the identification of new macrodomains in viruses and the prediction of their function. This is followed by the expression and purification of these proteins following overexpression in bacterial cells and confirmation of folding and function using biophysical methods.
In 2012, a new and dangerous human virus was identified in travelers to the Middle East and Asia. The virus spread rapidly with a fatality rate of about 40% [1,2]. The Middle East respiratory syndrome (MERS) virus belongs to the betacoronavirus genus which also hosts the pandemic 2003-2005 severe acute respiratory syndrome (SARS) virus that affected more than 30 countries . These viruses have large, complex positive-sense RNA genomes that encode replicase, structural and accessory proteins. The replicase protein is cleaved by a viral protease to form the fifteen or sixteen nonstructural proteins . The nonstructural proteins (nsps) are responsible for replication of the positive-sense RNA genome, transcription of subgenomic RNAs, and other RNA processing activities. Activities necessary for interference with host cell innate immune responses have been identified [5,6].
The nonstructural protein 3 is a multifunctional protein consisting of several functional domains and approximately 2000 amino acid residues. This protein harbors one papain-like cysteine protease, transmembrane regions, RNA-binding proteins, and one or more macrodomains [7-9]. The macrodomains are of significant interest with respect to poly(ADP-ribose) (PAR) interaction. These domains comprise a multifunctional protein family that act on NAD+ metabolites [10,11]. They act as modulators of posttranslational modifications, including PARylation, mono(ADP-ribosylation), and related modifications. In addition to their known roles in PAR binding, some macrodomains function as enzymes, for example in O-acetyl ADP-ribose deacetylase and mono-ADP-ribose hydrolase reactions . The catalytic domain of poly(ADP-ribose) glycosylase is also a macrodomain . These domains have been termed readers, erasers and interpreters of PARylation and mARylation . Divergent macrodomains have been discovered with new functions, including nucleic acid binding [9,14] and protein/protein interaction , suggesting that this protein family represents a conserved structural scaffold on which much functional variation may occur.
In viruses, the nonstructural protein 3 has several roles. Sequence analysis identified this protein as subject to positive selection during coronavirus evolution and adaptation to new hosts . Other lines of evidence show this protein being involved in inhibition of host innate immune responses [17-19]. The macrodomain of this protein is conserved among coronaviruses, hepatitis virus, alphavirus, and rubella virus [20-22]. This protein is able to dephosphorylate ADP-ribose-1″-phosphate; the importance of this reaction to viral infection is unknown. Some conserved sequence motifs, such as a ‘GGG’ and the ‘SAGIF’ which is involved in the stabilization of the phosphate groups, and ‘NAAN’ motif which employs the second asparagine to form hydrogen bonds with the terminal ribose, are characteristic for this protein family . In some viral macrodomains, the latter asparagine participates in the dephosphorylation reaction . Other functional residues are variable, and more difficult to identify on the basis of sequence . The macrodomain is important to infection through interactions with the host immune system: deletion or mutation of the conserved macrodomain led to a loss of virulence [18,24-26]. In the severe acute respiratory syndrome coronavirus (SARS-CoV), deletion of either the second macrodomain or the domain of unknown function led to a significant loss of viral RNA synthesis, indicating a critical role in the viral replicase-transcriptase complex .
In addition to this conserved macrodomain, many coronaviruses contain one or two additional macrodomains. These proteins lack conserved sequence motifs. Originally identified in the SARS-CoV, a typical structure for this region comprises a conserved macrodomain followed by two divergent macrodomains and a small domain of unknown function . Based on in vitro studies, these proteins were shown to have guanine quadruplex-binding activity. The C-terminal domain was shown to regulate the specificity of RNA binding of the macrodomains . The region containing divergent macrodomains was termed the ‘SARS-unique domain’ based on its lack of sequence identity to other viruses and known proteins .
Here, we present methods for the identification of new macrodomains and associated proteins in viruses and other species, and the prediction of their likely domain boundaries and functions. This is followed by the experimental characterization of these proteins by bacterial overexpression, purification, and biophysical and functional characterization, including binding studies with potential ligands.
All solutions were prepared with ultrapurifed water using ELGA PURELAB Ultra at a sensitivity of 18.2 MΩ/cm at 25 °C. Specified solutions were filtered with Millex® GV syringe filter units: 0.22 μm, PVDF, 33 mm diameter. Solutions were prepared and stored at 4 °C (unless indicated otherwise or to manufacturer's specifications).
Add 8 mL 5 M HCl. Make up to 1 L. Aliquot and store at -20 °C.
Buffers are filtered with a vacuum filtering unit (0.22 μm) and degassed for 5 min in a sonicator bath before use. Purification was performed with an ÄKTA™ purifier system equipped with a Frac-920 fraction collector, P-960 pump, and 50 mL Superloop™ (GE Healthcare). Columns included a 320 mL HiLoad™260/600 Superdex™ 200 PG column and 5 mL Histrap™ FF affinity column. Procedures were controlled with UNICORN™ 5.31 software (GE Healthcare).
Electrophoretic Mobility Shift Assays (EMSA)
Bioinformatic analysis of macrodomains uses several programs for sequence alignment and secondary structure prediction. Beginning with an unknown protein or nucleotide sequence to be analyzed, similar sequences may be identified using a BLAST search . If the sequence does not yield many hits, an iterative method such as PSI-BLAST or PHI-BLAST  may be employed. Building on these results, a fold prediction algorithm may also be used to detect remote sequence similarities. FFAS (Fold and Function Assignment) incorporates this method . FFAS performs protein sequence profile alignment to detect low levels of sequence similarity. All sequences used in and FFAS search must contain less than 1,000 amino acids. Options are available for multiple sequences or for a pairwise sequence alignment. Several different databases may be searched, for example the Protein Data Bank (PDB) , Structural Classification of Proteins (SCOP) , Protein Families (Pfam) , Classification Of proteins in complete Genomes (COG) , and structural genomics sequence collections as well as complete genomes of individual organisms. Search results include a score that is derived from the FFAS algorithm. The algorithm arranges the matching profile alignments, with those that have the highest confidence on top. An FFAS score of -9.5 or lower is the criterion for a protein to match a known fold .
Jpred  provides secondary structure predictions about the protein sequence. Initially, Jpred uses PSI-BLAST to create multiple alignments, and the JNet neural network algorithm  is then used to make secondary structure predictions based on the multiple sequence alignment combined with predicted solvent accessibility. A one-letter sequence is used or a file with multiple sequences can be uploaded (plain text, FASTA, MSF, BLC, or Batch). The size limit per sequence is 800 residues in plain text format. The results include the secondary structure prediction confidence for each residue. Secondary structure components are marked with the letter H for α-helix regions, E for β strands and the letter B for buried residues. Each residue will also have a reliability score from 0 to 9, in which residues with a high score are predicted with higher confidence than those with lower scores . Figure 1 shows an example secondary structure prediction for the putative conserved macrodomain in the bat coronavirus strain HKU4.
The likely locations of domain boundaries in proteins may be predicted by combining the analyses of protein sequences with these and other programs [55-57]. First, based on sequence alignments and conserved protein sequence motifs, putative macrodomains may be identified. Macrodomains share a globular structure of β-sheets flanked by α-helices. While several human, bacterial, and viral macrodomains have been found to share a βαβααββαβαβ architecture, many proteins in the macrodomain family contain additional α-helices and/or β-sheets in their folds . The arrangement of these predicted secondary structures assists in identifying this protein family. Slight variations, such as the addition of a single α-helix at the beginning or end of the protein may occur [54,59]. For example, in the SARS-CoV, the SUD-N domain has a βαβαββαβαβ fold whereas the SUD-M domain has a βαβαββαβαβα fold . In addition, conserved protein sequence motifs are associated with the macrodomain family. For example, the macrodomain ADP-ribose hydrolases employ the backbones of certain residues such as ‘GGGV’ and ‘GIYG’ to coordinate with the phosphate groups of ADP-ribose . The MacroD1-like proteins employ a conserved aspartate and histidine as part of a catalytic triad to deprotonate a water molecule which can act as a nucleophile to attack the carbonyl carbon of an ADP-ribosylated protein [61,62]. Moreover, additional motifs are involved in the binding of ADP-ribose: the ‘NAAN’ motif stabilizes the distal ribose through the sidechains of the second asparagine, the ‘SAGIF’ stabilizes the two phosphate groups, and in the ‘DAIQ’ motif, the carbonyl group of the aspartate forms hydrogen bonds with the adenine moiety of ADP-ribose in the binding pocket [23,59,63]. PAR glycosylases also contain a conserved ‘GGG’ motif used to coordinate ADP-ribose but employ a ‘QEE’ sequence where the two glutamate residues and a conserved N-terminal aspartate are responsible for trapping a water molecule for a nucleophilic attack on an ADP-ribose substrate for PAR cleavage [64,65].
Putative macrodomains that lack conserved motifs may be identified by a fold prediction algorithm such as FFAS. This algorithm identifies remote fold similarities, and together with secondary structure prediction, may be applied to identify domain boundaries. Predicted secondary structure patterns from Jpred , PSIPRED , and other programs can identify disordered loop regions that often correspond to a domain boundary. To ensure only the domain of interest is chosen and disordered tails are not introduced into the construct, only a few residues beyond the predicted domain are incorporated into the construct of interest. Typically, hydrophilic residues are chosen as endpoints in a protein domain, such as a serine or aspartic acid. For many biochemical and biophysical studies, it is also helpful to reduce the number of cysteines and prolines; if these residues are near a domain boundary, and are not conserved and/or necessary for the protein's activity. Cysteines oxidize to form disulfide bridges that can lead to protein oligomerization, while prolines introduce the possibility of conformational isomerization about the peptide bond. If cysteines are present in the sequence, reducing agents are introduced during purification. Prolines are not detectable in several common NMR experiments. At the N-terminus, destabilizing amino acids are avoided . Disordered regions, domains, or proteins may also be predicted with software such as DISOPRED , SEG , or DisEMBL™  that is specifically designed to identify them.
Bioinformatics analysis was applied to identify a potential new member of the macrodomain family in the bat coronavirus strain HKU4, which is phylogenetically close to the human MERS virus . The sequence of the bat nonstructural protein 3, a large, multidomain protein, was analyzed by BLAST and FFAS. The sequence similarity to other proteins was searched by FFAS against the PDB database. As shown in Figure 1, the sequence had strong similarity to known viral macrodomains. Secondary structure predictions from Jpred further showed a predicted secondary structure pattern consistent with that of a macrodomain. Multiple alignments identified conserved residues present in other macrodomains, such as the ‘DA’ following the first β-sheet, the ‘GGGI’ motif on the N-terminal end of α-helix 2, the ‘NAAN’ that has the catalytically important second asparagine following β-sheet 2, and the ‘SAGIF’ region N-terminal to α-helix 5. This was also supported by functional predictions, described in the next section. Using this information, we hypothesized that the bat coronavirus HKU4 nonstructural protein 3 (nsp3) contains a conserved macrodomain. To isolate this domain and study the protein further, we predicted domain boundaries. These boundaries are shown in Figure 1. These boundaries were chosen a few residues outside the predicted secondary structures to maintain the predicted globular fold.
Sequence-based or structure-based function prediction aims to generate a functional hypothesis about a protein before experimental data has been collected . Online servers can be employed to predict protein structure and function, such as I-TASSER  and BindN . I-TASSER is a program created by Yang Zhang's group in 2008 to predict 3D protein structure from a given primary sequence. This prediction is useful when the protein structure is not known. I-TASSER will output a .pdb file that can be viewed via Pymol and Chimera , or similar molecular graphics programs. To start, all sequences must be in FASTA format and have a length starting at 10 residues and not exceeding 1500. Each submission will generate an ID that can be used to track the run's progress. However, only one submission per IP address is allowed. I-TASSER generates a large ensemble of models and uses a clustering program, SPICKER , to select the most significant models based on pairwise structure similarities.
I-TASSER results are fed into the protein function database, BioLiP , and use a function prediction multi-server program, COACH , to further classify proteins. BioLiP is a database of protein-ligand interactions that has compiled structural information from the PDB database and data from published work. This database is used to provide a list of residues involved in potential ligand-binding sites that are responsible for catalytic activity. For uncharacterized proteins, the COACH algorithm was developed to predict ligand-binding sites and carry out functional analysis for I-TASSER. COACH analyzes sequences by incorporating comparative programs and ligand binding site prediction servers such as COFACTOR , FINDSITE , and ConCavity . These programs work in a complementary fashion to predict ligand-binding sites for a queried protein and potential ligands that may interact with those given residues. A complete list of structural data from I-TASSER and functional data from COACH is provided in the output page provided in a link sent via email. COACH provides a list of the top 5 potential ligands and their binding residues based on their matches in the BioLiP database. Enzymatic profiles are ranked by the COACH algorithm based on the structural similarity of the queried protein to template proteins.
COFACTOR can also be used independently for novel protein sequences to produce information based on ligand binding sites and predicted functions . COFACTOR uses both primary sequences in FASTA format and .pdb files as input for binding site and binding ligand predictions. Results from a COFACTOR prediction are focused on structural homology and three protein function types. First, a structural analysis against PDB structures is conducted, where the top 10 best alignments are provided in the output. Next, protein functions are from three independent libraries: ligand binding prediction confidence (C), enzyme classification (EC), and gene ontology (GO). These identifiers are scored based on the confidence of their prediction . These identifiers are linked to specific binding sites and protein functions from their respective databases. A description and related proteins with similar function are associated with each identifier.
Other online servers can be used for protein function prediction, such as ProFunc, SIFTER, and RaptorX. ProFunc uses PDB input files or identifiers to predict likely biochemical function . The server will run a PDBsum analysis , a server that provides a brief structural overview, for every ProFunc submission. ProFunc uses sequence and structural batch processes to compare against known protein structures . ProFunc also uses 3D template searches of known protein ligand interactions to predict function . Once ProFunc is complete, a color-coded output is produced. The output provides predictions on conserved motifs and folds, enzymatic activity, ligand binding site, and binding pocket similarities. Next, SIFTER (Statistical Inference of Function Through Evolutionary Relationships) is a server that makes predictions based on statistical methods . Only a FASTA sequence is required for this tool. SIFTER will compile a list of proteins that match the queried sequence and organize the related protein function through a confidence score. The list of significant protein function matches is classified by a GO identifier which offers further explanation for each shared function. Lastly, RaptorX, a protein threading program, uses protein sequences to predict structure, binding sites, residue contact maps, solvent accessibility per residue, and structure properties through a multistep threading approach . RaptorX constructs a structure template list from known proteins and optimizes the energy function to determine the quality of each predicted structure . Using pairwise interaction preferences, RaptorX aligns the input sequence to the backbone of templates to output a final predicted structure . The structural prediction method is employed by computational learning models that determine solvent accessibility, structured and disordered regions or the entire queried sequence and predicts enzymatic activity and ligands that may bind to the protein sequence . To use RaptorX, only a name, email, and FASTA sequence is needed. Results from the structural prediction will provide a summary prediction of secondary structure, solvent accessible residues, and disordered regions. Results from the binding tool provide a list of ligands that are predicted to bind to the protein. Furthermore, the predicted contacts tool provides a contact map, top five predicted models in .pdb format, and an interactive model of the protein via a Jmol viewer.
The computational approaches described here were used to analyze the predicted domain in the bat CoV HKU4. Structural analysis predicted that the domain should be significantly similar to known macrodomains, such as the conserved macrodomains from the SARS-CoV [23,59] and Middle East respiratory syndrome nonstructural protein 3 . These results are shown in Table 3. The prediction of this protein as a macrodomain is consistent from both I-TASSER, ProFunc, and RaptorX analysis. Predicted functions included phosphate metabolic processes, hydrolase activity, and nucleic acid recognition, which suggests that this domain is involved with ADP-ribose and/or nucleic acid recognition by the consensus results from COACH, SIFTER, and ProFunc servers. BioLiP, COACH, and RaptorX provide consistent predictions that ADP-ribose is a likely to be a ligand for this domain. These servers also predict that the conserved aspartate (D18), the second asparagine (N36) in the ‘NAAN’ region, the residues G42-G43-G44, the S124-A125-G126-I128-F129 region, and the C-terminal asparagine (N153) are involved in the recognition of ADP-ribose. These motifs are also seen in other ADP-ribose binding macrodomains, in particular, SARS and MERS [23,59,76].
BindN is a server that predicts DNA- and RNA-binding regions of a protein. This program was created by Liangjiang Wang and Susan Brown in 2006 . Neural networks were trained to incorporate queried sequence information and determine residues that were solvent-accessible to provide reasonable DNA/RNA binding predictions. To increase the predictive capabilities of BindN, evolutionary data from a position-specific scoring matrix (PSSM) derived from a PSI-Blast search was introduced . The most recent addition of support vector machines (SVMs), a set of learning algorithms, further increased the server's sensitivity performance . BindN only requires a primary sequence in FASTA format to run. Sensitivity and specificity parameters can be adjusted to increase the number of hits or reduce the number of false positives. The output will provide an assessment of binding propensity to DNA or RNA. A confidence value range from 0 (lowest) to 9 (highest) and prediction marker for binding residues (+) and non-binding residues (-) are indicated directly below each amino acid in the sequence. This program is useful in analyzing novel macrodomain sequences with potential nucleic acid-binding function [9,78].
Protein interactions with binding partners such as other proteins, peptides, nucleotides, and small molecules are the central focus of functional studies . Computational docking has proven to be a valuable tool in the study of binding events, with the help of structural information from experiments and bioinformatics. There are two types of docking procedures in general: without any restraints (ab-initio) and with restraints . Restraints typically comprise experimentally derived information. Here we describe three different programs that were found to be useful in our laboratory, for different applications.
First, ZDOCK  is an ab initio docking program authored and maintained by Zhiping Weng and colleagues. The program searches all possible binding modes in the translational and rotational space between two molecules and evaluates each pose using an energy-based scoring function. The scoring function comprises interface atomic contact energy (IFACE) statistical potential, shape complementarity, and electrostatics .
Second, HADDOCK [84,85] (High Ambiguity Driven protein-protein DOCKing) is different from other ab-initio docking methods such as ZDOCK because HADDOCK uses ambiguous interaction restraints (AIRs) to drive docking processes. AIRS can be obtained from information about the proteins, such as NMR chemical shift perturbation data, mutagenesis data, or any information regarding the interaction interface. HADDOCK requires a Linux or Unix environment with CNS (Crystallographic and NMR system)  for generating experimental restraints. A web server is implemented to provide an interface with different levels of control over the docking process, termed Easy, Expert and Guru. The Easy interface is freely accessible to non-profit users, while Expert and Guru require upgrade access given by the program administrators.
Third, Glide (Grid-based Ligand Docking with Energetics) [87-89] is a high-throughput docking program that has been developed and updated by Schrodinger, LLC and now is incorporated into the Maestro suite with other programs required for small molecule-protein interaction studies. It is a conformational search-based program that employs a grid approximation of the nonbonded ligand-receptor interaction energy. A post-docking minimization is performed to return 10 poses per ligand based on their energy score, distance and conformation. It provides extra precision (XP), standard precision (SP) and virtual high-throughput screening modes to trade off various needs for speed and accuracy.
The docking techniques described above can be used to recreate known binding interactions and predict new ones. For instance, Figure 2a shows an example of a Glide re-docking result by using the crystal structure of the SARS-CoV macrodomain and its ligand ADP-ribose (PDB ID: 2FAV). All water molecules, including interacting ones, were removed and the protein was minimized with bound ligand in this example during the protein preparation. The interacting amino acid residues (D23, I24, N41, G47, G49, V50, S129, G131, I132, F133, N157), match experimentally determined ADP-binding motifs. In Figure 2a, the D23 and I24 residues are shown to stabilize the adenine ring, the second asparagine (N41) from the ‘NAAN’ motifs coordinates with the distal ribose, and both the ‘GGG’ (G47-G48-G49) and the ‘SAGIF’ (S129 -G131-I132-F133) motifs interact with the two phosphate groups in ADP-ribose binding. The predicted binding mode of ADP-ribose matches the experimentally determined X-ray crystallographic position closely, with a root-mean-square deviation (RMSD) of heavy-atom positions of 0.74 Å (Fig. 2a).
In Figure 2b, ZDOCK was used to investigate a macrodomain-nucleic acid interaction. The SARS-unique domain M is known as a G-quadruplex binding protein , but the exact binding model is still unknown. Figure 2b shows the ZDOCK-predicted complex between the SUD-M domain (PDB: 2JZE)  and I14-Tel23, an antiparallel basket G-quadruplex from human telomeric DNA (PDB: 2KKA) . The resulting binding model shows predicted specific interactions between protein and nucleic acid in the conserved surface cavity of the macrodomain. These interacting residues, N532, L533 (at the back side, not shown), I556, and V611, correlate closely with previous studies by NMR that showed significant chemical shift changes upon binding of G-quadruplex oligonucleotides [9,90]. The predicted binding mode includes hydrogen bonds to the G-quadruplex phosphate backbone, and hydrophobic interactions between amino acid sidechains and nucleobases (Fig. 2b).
Figure 1 summarizes functional predictions for the HKU4 putative macrodomain. FFAS and Jpred predictions, used to identify domain boundaries and produce a construct for study, are shown. Additional data regarding functionally important residues are identified by stars, from functional prediction programs described above. In addition, we identify residues that are conserved between the bat HKU4 macrodomain, and residues involved in interactions with ADP-ribose in the SARS-CoV macrodomain from the docking run shown in Figure 2a, described below. These residues make key interactions with the ADP-ribose ligand. Comparatively, most of these key residues are conserved, as shown in Figure 1. For example, the aspartate (D18) after the first β-strand, the second asparagine after the second β-strand (N36), the G42-G43-G44 loop before the second α-helix, the S124-A125-G126-I128-F129 region after the fifth β-strand, and the C-terminal asparagine (N153) are all predicted to be involved in the binding of ADP-ribose. Structurally, these residues align with conserved ADP-ribose binding regions of both the SARS-CoV and MERS-CoV macrodomains. For example, D18 in the HKU4 macrodomain aligns with D23 in the SARS-CoV macrodomain. The ‘GGG’ motifs in both macrodomains, the G42-G43-G44 in HKU4 and the G47-G48-G49 in SARS, have an identical sequence alignment, further supporting the idea that the function of these proteins is similar.
Here, we describe methods for the bacterial overexpression and purification of viral macrodomain proteins.
The protocol below is a purification description for predicted macrodomain proteins of the bat coronavirus HKU4.
The protocol described is based on affinity purification by nickel affinity chromatography followed by cleavage with tobacco etch virus protease, a second nickel affinity chromatography step and a final size-exclusion chromatography step. In this system we make use of the 6×His tag that was expressed N-terminal to the protein which binds with high affinity to Ni2+. The affinity difference of the expressed protein versus cellular proteins allows for a highly selective separation. Another advantage of this purification method is its speed. Using the ÄKTA™ purification system, we can quickly remove a large amount of impurities with ease. Below is a step-by-step description of nickel affinity purification using the ÄKTA™ purifier system. Each step was conducted at 4 °C.
Using tobacco etch virus protease, the 6×His tag is cleaved from the protein, decreasing its affinity to the Ni2+ column. The imidazole is dialyzed out of the protein sample prior to overnight cleavage to optimize protease activity.
SEC is the final step in this purification scheme. This chromatographic approach is a popular finishing step due to its ability to separate protein samples based on their molecular weight. Inactive protein aggregates are separated and removed .
As a final step of purification, recombinant overexpressed HKU4 domain purification was completed via size-exclusion chromatography. Because SEC is based on the molecular size of a protein, the elution time of a protein can be calibrated to determine its molecular weight (smaller proteins have a longer path and elute later while larger proteins have a shorter path and elute earlier) . The estimated elution volume of an 18 kDa protein is 245 mL based on molecular weight standards, and the resolved peak at 250 mL is the purified protein. In Figure 3, the molecular weight of the protein was confirmed by SDS-PAGE analysis and also showed the protein sample did not contain detectable impurities. The secondary peak at 330 mL is indicative of low MW impurities that were present in the protein sample prior to SEC; these were too low in concentration to appear on the gel.
Nuclear magnetic resonance (NMR) spectroscopy takes advantage of the characteristic nuclear spin angular momentum of atomic nuclei to probe the behavior of molecules through interactions with a radiofrequency field. For biomolecular studies, the nuclei of greatest interest are 1H, 13C, 15N, 19F, and 31P . Each nucleus resonates at a characteristic frequency that changes when introduced to different local microenvironments. The observed changes provide valuable information about the molecular structure and conformation of a given sample . For example, one can use the information collected from 1H, 15N, 13C and 2H-based experiments to gain information on high-resolution structure, protein stability, dynamics, folding pathways, enzyme mechanisms, and protein complex assembly, in both the solution and solid state [96-100]. The methods described below provide guidance to conducting and analyzing NMR techniques for protein-ligand interactions by saturation transfer difference (STD) NMR and by chemical shift mapping.
The STD-NMR experiment is a commonly used ligand screening technique for rapid analysis of protein-ligand interactions. This method is based on the transfer of magnetization from protein to ligand . This process occurs by the application of a low-power radiofrequency pulse to the protein . This saturation builds up through the entire protein through intramolecular 1H-1H nuclear Overhauser effect transfer. This saturation will also transfer to other protons located close to the protein in a binding pocket . Thus, ligands that bind to a selectively irradiated protein will also become saturated. Experimentally, this process is useful because the saturated protons from a bound ligand dissociate, where they relax more slowly than in the bound state . The saturation may be observed in a difference spectrum. The spectrum with on-resonance saturation is subtracted from the spectrum with off-resonance saturation . The difference spectrum contains only signals from the ligand involved in the binding event. Consequently, this method has become widely used to identify compounds that bind to proteins of interest, and to carry out epitope mapping of binding compounds.
Notably, there are a few drawbacks. If a ligand binds too weakly then it will not stay bound long enough to be saturated sufficiently, and conversely, if it binds too tightly then the saturated ligand will relax too quickly to be detected . For a good STD-NMR experiment, the ligand should have dissociation constants, KD, ranging from approximately 10-3 to 10-8 M. Therefore, this method is quite robust in its ease of detecting protein-ligand interactions.
The method below describes an STD-NMR analysis with a putative macrodomain from the bat HKU4 CoV and ADP-ribose, using Bruker instrumentation.
The STD-NMR difference spectrum will yield only signals from ligand hydrogens that are involved in interactions with the protein . Based on the distinct chemical shifts of each 1H, information on how the ligand binds to the receptor can be determined.
The protocol discussed in this section was used to detect ligand binding to the HKU4 putative macrodomain by showing enhancement of the NMR signals from ADP-ribose. Further analysis using this protocol can be used to map the ligand in the binding pocket and determine a binding constant (KD). The binding constant of the HKU4 macrodomain is expected to be similar to those of other viral macrodomains, such as that of the SARS-CoV (~24 μM) .
The local chemical microenvironments of a protein are sensitive to changes in solvent and molecular interactions. This phenomenon is useful when mapping structural changes and classifying binding interactions of a protein ligand complex. The method described here measures the chemical shifts of a protein, via a [15N, 1H]-HSQC spectrum, at different ligand concentrations. Based on the relative movement of peaks from each titration step, a map of the binding pocket can be determined .
The control spectrum is the baseline spectrum in the absence of ligand, used to determine the chemical shifts of the apoprotein. Once ligand is added to the protein sample, changes in the spectrum due to ligand binding may be observed . In a [15N,1H]-HSQC, each signal results from a covalently bonded 15N,1H pair. With a few exceptions, the correlation peaks represent each amino acid in the protein sequence and are used to map the binding pocket of the protein. However, for unambiguous characterization of the binding pocket, assignment of the backbone is necessary .
Perturbation of chemical shifts occurs under three general conditions: fast, slow and intermediate exchange rates. Fast exchange indicates that the off-rate (koff) for protein-ligand interaction is much greater than the frequency difference (chemical shift difference) in Hz between the bound and free states of a given peak. This condition is often seen for weakly binding ligands, with binding constants (KD) in the high μM to mM range . In a titration with increasing ligand amount, peaks in the HSQC spectrum are observed to gradually shift from their original position toward the chemical shift characteristic of the bound state. In fast exchange conditions, the observed chemical shift is a weighted average of the shifts in the free and bound states . KD may be determined by a fitting procedure. Slow exchange occurs when the off-rate is much less than the chemical shift difference between these two states, a condition that is often satisfied for ligands that bind with sub-micromolar affinity. In this condition, two separate signals for the bound and free states of a given peak are observed. Intermediate exchange is observed when the off-rate is comparable to the chemical shift difference and is characterized by peak broadening together with chemical shift changes [96,104].
The CSP method is useful in screening potential ligands for novel proteins. Testing multiple ligands and mapping their interactions with the protein can help shed light on proteins with unknown function. In particular, predicted macrodomains have characteristic functions that can aid in their classification, such as ADP-ribose recognition. Furthermore, CSP aids in the analysis of protein-ligand binding affinity and offers structural information characteristic of a protein-ligand interface.
The steps in this protocol describe a biochemical assay that is used to identify protein: nucleic acid complexes. We describe a protocol that is used for a variety of protein and ligand samples under native conditions.
This method is useful when determining if the protein is a receptor for a given oligonucleotide. If so, the protein-ligand complex will appear as discrete bands at higher molecular weight, whereas free oligonucleotides will travel farther through the gel due to their lower molecular weight. The increased molecular weight of the complex formed will prevent the complex from sieving through the gel as fast as the free oligonucleotide . The gels are stained with protein stain to confirm the colocalization of the oligonucleotide and protein.
EMSA has been used to identify nucleic acid binding partners for macrodomains . Assays based on varying DNA and RNA sequences were used to determine sequence specificity preferences and estimate binding affinities of novel viral macrodomains. Analysis of structured oligonucleotides was conducted to compare size and sequence preferences of the HKU4 macrodomain. Preliminary studies suggest that nucleic acid binding may be affected by sequences surrounding the core domain (data not shown).
CD spectroscopy is a useful technique in determining the secondary structure of a protein. Common features of α-helix (minima at 222 nm and 208 nm; maximum at 190 nm), β-sheet, (minimum at 218 nm; maximum at 196 nm), and random coil (minimum at 195 nm; maximum at 212 nm) can be seen from the CD spectra. The spectra can be deconvoluted to determine the global secondary structure content of a protein, using various methods and software (recently reviewed by N.J. Greenfield . The assumption used by all methods is that the observed spectrum is a linear combination of the spectra of its secondary structural components plus a noise contribution from aromatic groups and prosthetic groups. For example, a CD measurement that has strong minima at 222 nm and 208 nm is expected to have a large fraction of α-helix structure. This is helpful when verifying the fold type and stability of a protein sample. These analyses are implemented by software such as SELCON3  which takes mean residue ellipticity, or θMRE, as input. SELCON3 will analyze these data and report the content of each secondary structure in the analyte.
The combination of bioinformatic approaches, protein function prediction and computational docking has allowed the identification of new members of the macrodomain family based on protein sequence alone. These sequences have also been used to develop homology models. To validate these predictions, a system of biochemical and biophysical approaches for characterizing protein structure and ligand interactions may be employed. These methods are useful for the study of proteins with no previously known structural or functional data, as described here for a putative macrodomain in the bat coronavirus HKU4.
This work was supported by University of Alabama at Birmingham Faculty Startup Funding, NIH NIGMS GM119456-01, the University of Alabama at Birmingham Department of Chemistry, and the NSF Bridge to Doctorate Program. We thank Jeffrey McDonald and Sadanandan Velu for assistance with the Glide program. We thank members of the Johnson laboratory for helpful discussions and technical assistance. The UAB Cancer Center NMR facility is supported by a CCSG Grant P30 CA-13148 from the National Cancer Institute.