|Home | About | Journals | Submit | Contact Us | Français|
In selecting a method to produce a recombinant protein, a researcher is faced with a bewildering array of choices as to where to start. To facilitate decision-making, we describe a consensus ‘what to try first’ strategy based on our collective analysis of the expression and purification of over 10,000 different proteins. This review presents methods that could be applied at the outset of any project, a prioritized list of alternate strategies and a list of pitfalls that trip many new investigators.
Recombinant proteins are used throughout biological and biomedical science. Their production was once the domain of experts, but the development of simple, commercially available systems has made the technology more widespread. As a result, also more widespread is an appreciation of the difficult, strategic choices inherent to the process. Commonly confronted questions include: should the protein(s) be expressed in bacteria, in yeast, in insect cells or in human cells? Which expression vector should be used? If bacterial expression is used, which strain(s) should be chosen? Should one express the full-length protein or a fragment thereof? Should the protein be tagged, and which affinity tag is the best? What is a good purification strategy, and what are the common pitfalls? Unfortunately, because every protein is different, there can be no ‘right’ answer to any of these questions a priori, and purification protocols and strategies must be worked out for each individual protein and with an eye to its intended use. This said, each project must begin somewhere, and purification strategies can now be guided by evidence-based trends, probabilities and cautionary notes that have emerged from large-scale structural genomics studies. In this review, which is targeted to the researcher with limited experience in protein expression and purification, we draw on our collective experiences to suggest a ‘consensus’ starting point for soluble protein expression and purification.
Over the past decade, our laboratories have collectively targeted and purified tens of thousands of different proteins from the Eubacteria and Archaea, and thousands from the Eukarya, including fungal, nematode, parasite, plant and human proteins (Table 1). These proteins belong to many different classes, including proteins with no predictable structure, human proteins of therapeutic relevance, proteins from parasites and viruses, integral membrane proteins and multiprotein complexes. A near-complete list of these proteins is available in a database (TargetDB) maintained by the Protein Data Bank (PDB; http://targetdb.pdb.org/) under the auspices of the US National Institute of General Medical Sciences (NIGMS)-funded Protein Structure Initiative (http://www.nigms.nih.gov/Initiatives/PSI/). The European research network Structural Proteomics in Europe (SPINE) also provides detailed target lists online (http://www.spineurope.org/).
In efforts to identify an optimal approach(es) for the initial production and purification of a ‘typical’ protein, our groups have explored many different technologies and strategies. Our common objective has been to balance success rates with ease and breadth of use, speed, cost and versatility1–16. By comparing our independently optimized approaches, it is apparent that our preferred methods have, in many instances, evolved to be quite similar, but by no means identical (Table 2). Accordingly, in an effort to provide guidance to scientists interested in generating purified recombinant proteins, representatives from our research groups collaborated to articulate our ‘consensus’ advice (Box 1), along with a brief rationale for each choice. In essence, we tried to answer the question “what would you try first?”, understanding that several choices are often possible or even desirable. We also provide guidance for those cases in which the initial attempt fails or problems are encountered, in other words, “What next?”. In Supplementary Methods online, we provide links to online protocols offered by several structural genomics groups as well as detailed experimental protocols for the methods described here.
It is important to emphasize three aspects of this review. First, it is meant to serve as a guide to those members of the research community who are interested in expressing recombinant proteins, but who feel that they may not have the breadth of experience to decide among the various possible approaches. Second, we selected this consensus strategy because it is simple and has the widest use. There are other methods that are perhaps equivalent, but space limitations preclude an in-depth discussion of all possible cloning, expression and purification strategies. Third, the methods described here were developed with the intention to produce purified, soluble protein in close-to-milligram quantities; there are many applications for purified protein (biochemical assays, antibody production) that may not have such requirements.
There are two important provisos to the methods and strategies described in this review. First, our experience is dominated by studies with nonmembrane cytosolic and/or fragments of proteins that comprise soluble domains. Second, although the protocols for the ‘first attempt’ described here have proven to be optimal for the broadest range of proteins, in any individual case, the methods will fail more often than they succeed.
Recently, sequencing efforts and various cDNA consortia have made available large libraries of full-length, sequence-verified cDNAs. Although there are inevitably issues with clone contamination and mix-up, the resources are in general trustworthy. Among the most comprehensive and best annotated is the Mammalian Gene Collection, which maintains a repository of >19,000 human cDNAs, covering ~65% of all annotated genes. For genes or splice variants not easily obtained through more traditional routes, total gene synthesis can be used. Over the past few years, the cost of gene synthesis has dropped almost fivefold, and it will undoubtedly continue to decrease. One advantage of gene synthesis is the ability to change the codon bias of the gene to be more compatible with the recombinant host. However, for Escherichia coli, expression strains supplemented with additional tRNAs can often overcome the codon bias of the recombinant gene17. For example, in a study of 30 human genes by the Structural Genomics Consortium (SGC), there was no clear advantage in the use of codon-optimized genes compared with the natural sequence expressed in tRNA-supplemented strains (N.A. Burgess-Brown, S. Sharma, F. Sobott, C. Loenarz, U. Oppermann and O. Gileadi; submitted).
The objective of recombinant protein expression is usually to produce a sample that supports a certain biochemical or biological activity, such as enzyme catalysis or protein-ligand interactions. Frequently, the desired activity is supported by a discrete domain, and thus it is often not necessary to express the full-length protein to address a particular biological question. In expressing a protein domain, the choice of the N- and C-terminal boundaries represents an important consideration because even small differences can dramatically influence both solubility and expression. For example, Klock and colleagues18 evaluated a nested set of 2,143 N- and C-terminal truncations from 96 targets and found considerable variation in both solubility and aggregation behavior by altering the protein length by just a few amino acids.
Despite the best efforts, and even for proteins whose domain structure is well-defined, it is not currently possible to predict which specific N- and C-terminal boundaries are most compatible with the expression of a soluble protein. Thus, pragmatism dictates testing many truncated forms of the protein to select one or more for scale-up production. For proteins of known or readily predicted three-dimensional structure, the borders should be engineered to encompass the domain of interest. As an example, ten constructs of the targeted domain might be made at the outset of every project, one corresponding to the full-length protein and nine representing the clones derived from amplifying a combination of three different 5′-end primers and three different 3′-end primers. Gräslund and colleagues have compared the success rate of the nested-primer approach with the predicted success rate if one had chosen only a single ‘optimal’ construct. In a sample set of 400 human protein domains, the use of multiple constructs increased the probability of generating a soluble protein twofold19.
To select the sets of PCR primers for proteins with a predictable three-dimensional structure, one should consider prior knowledge of the structure of a related protein, sequence conservation patterns, and predictions of secondary structure or unfolded/disordered regions20,21. Widely accepted guidelines are to: (i) remove predicted membrane-spanning regions; (ii) avoid disrupting predicted secondary structural elements; (iii) respect the boundaries of globular domains, if known; and (iv) avoid inclusion of low-complexity regions or hydrophobic residues at the termini22. The optimal step size between the nested primers is not yet fully understood; we commonly make constructs to encode proteins that vary in length by 2–10 amino acids at each end19. For proteins without a predictable three-dimensional structure, the approximate boundaries of the region of interest might be identified using functional assays and scanning deletion mutagenesis, and then optimal boundaries for expression can be identified using nested sets of PCR primers, as above23. Boundaries of structured domains can also be determined experimentally by using limited proteolysis combined with mass spectrometry analysis24. Clearly, when using protein fragments, caution should be used in interpreting unexpected biological results.
The most common methods now used in our groups to clone target genes into the requisite expression vector rely on homology-based approaches, using either recombination enzymes25 or ligation-independent cloning (LIC)26. Restriction enzyme–based approaches are used less frequently. A comparison of the methods is shown in Supplementary Table 1 online.
Recombination-based methods include, for example, the bacteriophage lambda integrase system27 and the Cre-lox recombination system28. These methods are rapid, easy and produce few false positives. However, the requirement for special cloning sites imposes constraints: either additional amino acid codons are inserted at either end of the gene, making the PCR primers quite long, or the work-around cloning strategies are more complicated. The unique feature of these methods is the ability to transfer the cloned sequence among a series of compatible vectors that can be used to express the gene in different hosts or with different tags. For bacterial expression, however, the probability of identifying a clone that expresses a soluble protein is increased by making different variants of a single protein in the same E. coli host rather than by cloning a single variant into vectors with different tags and expression hosts19,29.
Ligation-independent cloning, which is used by most of our groups, has the disadvantage compared with recombination-based approaches in that one needs to clone sequences independently into each vector (if this is required). However, the method is inexpensive and simple. One scientist can routinely generate two 96-well plates of distinct clones in a week without the benefit of automation.
The stably folded, globular domains of prokaryotic and eukaryotic proteins (for example, catalytic domains or protein interaction domains) are a major focus both of the biomedical research community and of our laboratories. These proteins are generally suitable for expression in E. coli. Over the years, much effort has been put into optimizing E. coli as an expression host for proteins from higher organisms30. This strategy has generated a wide arsenal of tools that can be used to increase the yield of soluble protein.
A surprising variety of other classes of proteins, from full-length bacterial and human proteins, to protein complexes, and even some human integral membrane proteins can also be produced in E. coli. In terms of full-length proteins, analysis of large-scale protein expression trials shows that up to 50% of proteins from the Eubacteria or Archaea and 10% of proteins from the Eukarya can be expressed in E. coli in soluble form31 (http://targetdb.pdb.org/). Overall, the probability of successfully expressing a soluble protein decreases considerably at molecular weights above ~60 kDa (Fig. 1). Proteins that do not express in soluble form may not be modified or folded properly, or may precipitate within E. coli through formation of inclusion bodies. Remarkably, expression in a heterologous host does not solely account for the poor success rates; even after extensive screens of expression conditions, 30% of proteins from E. coli itself cannot be produced in soluble form when overexpressed in E. coli32.
On the basis of these studies, our view is that the first attempt for the recombinant production of any protein—whatever the source—is to try E. coli as the expression host. It is fast and inexpensive to test a wide variety of possible strategies in E. coli, and one can complete a fairly comprehensive analysis within a relatively short period of time. Alternative systems should be used only after the E. coli system has been reasonably explored. This view balances the fact that there is definitely a lower probability of expressing some classes of proteins in E. coli (full-length eukaryotic proteins, integral membrane proteins) compared with other systems (human or insect cells), with the fact that the E. coli system is useful in many cases, and also is far more cost-effective and convenient.
For high-level protein production purposes, BL21(DE3) is an appropriate E. coli strain. It has the advantage of being deficient in both lon and ompT proteases and it is compatible with the T7 lacO promoter system33. For eukaryotic proteins, it is often important to use BL21(DE3) derivatives carrying additional tRNAs to overcome the effects of codon bias. Historically, ampicillin has been the most commonly used antibiotic-selection marker, but it is being replaced by carbenicillin, which is more stable. Vectors encoding resistance to kanamycin or chloramphenicol are now widely used as well.
We suggest that the protein should be produced as a fusion to an affinity tag because tags dramatically aid in protein purification and rarely adversely affect biological or biochemical activity34. However, in selecting which tag to use, one is faced with a daunting number of choices. Our groups have explored most of the available options, and we observed that no affinity tag emerged as significantly more efficacious in successfully producing soluble, active recombinant proteins35. Despite the lack of a clear winner based on success rate, most of our research groups selected an N-terminal hexahistidine tag that can be removed by a site-specific protease, such as the tobacco etch virus (TEV) protease36. However, many other instances can be found in which proteins can be expressed in soluble form only as fusions to other affinity tags29.
The rationale for the choice of an N-terminal hexahistidine is manifold. First, an N-terminal tag ensures that the bacterial transcription and translation machineries always encounter 5′ and N-terminal sequences that are compatible with robust RNA synthesis and protein expression, respectively. Second, oligohistidine-tagged proteins can be purified using a relatively simple protocol using immobilized metal affinity chromatography (IMAC)37. Third, histidine tags rarely affect the characteristics of the protein, which distinguishes it, for example, from glutathione S-transferase (GST), which itself is a dimer that then imposes dimerization on the recombinant protein. Fourth, the hexahistidine tag is relatively small and usually does not dramatically alter the solubility properties of the target protein. By contrast, larger tags, such as the maltose-binding protein (MBP), can often increase the apparent solubility of the recombinant moiety, even when the protein is either insoluble by nature, or unstable or unfolded and, therefore, less likely to be active38–40. Fifth, for the specific application of protein crystallography, short histidine tags appear to be neutral actors; in most of our projects, we routinely attempt crystallization and NMR structure determination with both cleaved and uncleaved proteins, and their relative representation among the resulting three-dimensional structures is roughly equivalent. A recent PDB-wide survey41 also indicates that hexahistidine tags do not have a consistent impact on the N-terminal structure of the target protein.
The most commonly used expression systems are based on pET vectors (Merck/EMD; the pET System manual, 2006), which drive expression of a recombinant gene under the control of the T7 RNA polymerase promoter and lac operator33,42. The vectors are designed for use in λDE3 lysogen strains of E. coli, which harbor a genomic copy of the gene for T7 RNA polymerase under the control of the lac repressor. Under repressive conditions, T7 RNA polymerase is not produced, and transcription of the target gene is negligible. After induction, when the T7 RNA polymerase is produced, most of the cellular protein synthesis machinery will be devoted to producing the target protein. On occasion, low-level expression of T7 polymerase within these strains leads to expression of the recombinant protein and may slow or prevent growth of the transformed bacteria. The expression of such highly toxic proteins can be effected by using T7 lysozyme-expressing strains42, strains in which the T7 RNA polymerase is under the control of the arabinose promoter43, by producing the protein in a cell-free system44 or by driving expression of the recombinant protein directly by the more tightly regulated arabinose promoter system45.
Using T7 systems, protein expression can be induced either with the chemical inducer isopropyl-β-d-thiogalactoside (IPTG) or by manipulating the carbon sources during E. coli growth (auto-induction; ref. 46 and the pET System manual; Merck/EMD, 2006).
In both cases, the cells can, and should, be grown to high densities (OD600 of 4–20) in highly enriched medium47 in baffled shake flasks48,49. Whatever the final cell density, it is advisable to induce the expression of the T7 RNA polymerase at mid-to-late log phase of the growth curve to ensure maximal yield while avoiding the problems associated with cells going into stationary phase (for example, induction of proteases). One feature of the T7 system is that many recombinant proteins often precipitate when expressed at 37 °C, but are soluble when the temperature during induction is 15–25 °C, presumably because slower rates of protein production allow newly transcribed recombinant proteins time to fold properly50. Thus, lower temperatures during induction should be used as the default.
Small-scale test expression is widely used as a predictive tool to determine which of the derivative clones actually produces soluble protein and to establish the optimal scale for the large-scale growth. A major concern is that the expression level and solubility of a recombinant protein is influenced by the culture conditions and the degree of aeration, and these parameters do not always scale with culture volume. The results from small and large-scale growth also vary owing to differences in sample preparation and protein purification methods that are used for each scale of growth. Therefore, whereas positive small-scale experiments are often predictive of the results from large-scale growth, there will inevitably be a substantial proportion of false negatives in which an apparently nonexpressed or insoluble protein can be in fact, expressed in soluble form when grown on a larger scale. If the total number of constructs to be tested is small (for example, <20 constructs), it may be wiser to proceed immediately to larger-scale cultures to avoid any potential complications.
For analysis of large numbers of constructs, parallel small-scale protein purification can be performed efficiently in volumes of 1–20 ml, in 96-well format. This scale typically produces 10–200 µg of protein, which is sufficient for many analytical tests. The results can be used to optimize the construct design and experimental conditions before embarking on larger scale purifications49,51,52.
As a chromatographic procedure, IMAC has the advantages of having strong, specific binding, mild elution conditions and the ability to control selectivity by including low concentrations of imidazole in chromatography buffers. There is a broad array of common resins with slightly different binding capacities and binding strengths, but all tolerate harsh cleaning procedures (TALON Metal Affinity Resins User Manual, Clontech, 2007; the QIAexpressionist, Qiagen, 2003; and HisTrap HP, 1 ml and 5 ml (instructions), Amersham Biosciences, GE Healthcare, 2003). Most purification steps can be integrated by high-performance liquid chromatography; the most commonly used devices are the ÄKTA systems from GE Healthcare.
The final purity of the protein can be optimized by controlling the ratio of recombinant protein to the column size; lower-affinity contaminants can be competed with a relative excess of the histidine-tagged recombinant protein. Accordingly, it is beneficial to determine the amount of the soluble target protein to be loaded on the column, and this can be estimated from small-scale expression trials. As a general rule, to maximize purity, one should load the column with a slight excess over the predicted binding capacity. Although not necessary, it is relatively straightforward to implement these protein purification protocols on automated chromatography systems, which have proven reliable, effective and simple to use.
Preparation of the bacterial lysate is a critical step. Optimal conditions maximize cell lysis and the fraction of the recombinant protein that is extracted while minimizing protein oxidation, unwanted proteolysis and sample contamination with genomic DNA. Mechanical lysis by high-pressure homogenization or sonication, or lysis by freeze-thaw procedures with lysozyme are equivalent in most cases. The lysis buffer should contain a strong buffer (50–100 mM phosphate or HEPES) to overcome the contribution of the bacterial lysate, high ionic strength (equivalent to 300–500 mM NaCl) to enhance protein solubility and stability, protease inhibitors and a reducing agent such as Tris(2-carboxyethyl) phosphine hydrochloride (TCEP) to prevent oxidation of the protein. Loading large amounts of bacterial lysate (>1 l culture volume) on small (<1 ml) affinity columns may require prior removal of any particulate or viscous material. This can be accomplished by using enzymes that degrade nucleic acid and cell-wall material, such as DNase or Benzonase (Merck/EMD) and lysozyme, respectively. Some of the enzymes used in lysis are less active in the presence of reducing agents or high salt concentration; optimal lysis may require sequential addition of the components. Clarified lysates can also be filtered before loading on the affinity columns.
IMAC purification is performed in phosphate buffer, pH 8.0 and an ionic strength equivalent to 300–500 mM NaCl. HEPES buffer (and, to a lesser extent, Tris buffer) at pH 7.5–8.0 can also be used. It has been consistently observed that conditions of high ionic strength (for example, 500 mM NaCl) maintain solubility and stability of the widest variety of proteins. Indeed, a substantial fraction of proteins precipitate if the salt concentration is reduced to physiological levels, particularly as the protein becomes more pure and concentrated. The choice of NaCl as the salt is mainly historical and, although not systematically explored, there is no reason to believe that sodium and chloride are optimal. Indeed, sodium and chloride levels in the cell are very low and are probably never the physiologically relevant counter-ions for intracellular proteins. A modest amount of imidazole (see resin manufacturer’s recommendations) should be included in the cell extraction buffer to reduce binding of less histidine-rich proteins to the IMAC column. For intracellular proteins, care should be taken to maintain a reducing environment. TCEP, unlike dithiothreitol (DDT), is compatible with all known IMAC matrices. Finally, inclusion of glycerol (10%) during protein purification enhances the solubility and stability of many proteins.
After the lysate is loaded on the IMAC column, it should be washed with buffer including an intermediate concentration of imidazole (see manufacturer’s instructions), which will elute weakly bound contaminants without sacrificing large amounts of the recombinant protein. It is sometimes necessary to optimize the wash step with respect to the concentration of imidazole as well as the volume of the wash. Finally, the recombinant protein should be with a step gradient (for example, 300 mM imidazole). If EDTA and DTT are added after IMAC; add the EDTA first to sequester any nickel that has leached off and that could react with the DTT.
The choice of gel filtration as the next step may be surprising, considering its lower resolving power compared with ion exchange or other adsorption chromatography methods, but this step is often sufficient after IMAC if the protein was abundant in the lysate. Moreover, gel filtration is more generic, can be performed in any buffer condition, and can be used to resolve the oligomerization state of the target protein. In some cases, if the protein is judged insufficiently pure for the intended purpose, one can remove the tag with a histidine-tagged TEV protease and perform IMAC again as an additional ‘generic’ purification step, collecting the recombinant protein in the flowthrough. This step very efficiently removes histidine-rich proteins derived from the expression host, which may have copurified in the primary IMAC procedure, as well as the cleaved tag and the histidine-tagged protease.
Characterizing the purified protein in some detail reduces the risk of wasting resources on protein material of inadequate quality. It also provides a means to ensure that different batches of the same protein have similar properties. Below, we outline a simple, generic protein characterization protocol that allows the experimentalist to judge whether the correct protein has been purified, whether additional molecular species are present and to estimate the approximate protein concentration. Other characterization methods that are very informative but not as widely applied, such as mass spectrometry, static or dynamic light scattering, and measuring protein thermal stability, are described in Supplementary Methods.
If size exclusion chromatography was used as the last purification step, a close look at the chromatogram is essential. Symmetric elution profiles are characteristic of homogeneous proteins, whereas asymmetric profiles reflect inhomogeneous, or partially aggregated, samples (Fig. 2), or whether the column itself is in poor condition. The elution profiles will also reveal the primary oligomerization state. The presence of additional oligomerization states may be of biological significance, or may be a sign of nonspecific aggregation. If the protein elutes in the void volume of the chromatogram, the protein is most likely forming large, nonspecific aggregates, which may be an indication of improper folding and compromised activity. It is also of value to analyze individual peaks by SDS-PAGE or mass spectrometry to analyze the protein in each peak.
After protein purification, samples should be resolved by denaturing SDS-PAGE. If stained with a dye such as Coomassie brilliant blue, the intensity of the bands will usually be proportional to the amount of protein53. This allows the purity of the sample to be estimated and whether the purified protein is of the expected size.
To quantify the amount and concentration of purified protein, the simplest and most common method is the Bradford assay53, which measures the binding of Coomassie brilliant blue to the protein. As some proteins bind the dye anomalously, it is also useful to measure the UV absorption at A280 and calculate the concentration of the protein by using the predicted molar extinction coefficient at A280 (http://www.expasy.org/tools/protparam.html). By taking a UV absorption spectrum, it is also possible to uncover contamination with DNA or RNA, or reveal common copurifying cofactors (for example, NAD, FAD, heme).
Aliquots of the protein to be stored should be placed in thin-walled PCR plastic tubes, frozen in liquid nitrogen and stored at −80 °C. Small aliquots should be frozen to avoid damaging freeze-thaw cycles, and aliquots should be thawed on ice. Concentrated proteins (for example, >1 mg/ml) tend to be more stable to freeze-thaw cycles. Proteins are usually concentrated using centrifuge-driven filter devices with adequate molecular weight size cutoffs. Care should be taken during centrifugation to avoid local over-concentration and irreversible precipitation or aggregation of the protein on the filtration membrane.
It is advisable to explore the stability of the protein to concentration and freeze-thaw cycles before processing the entire batch. The frozen and thawed sample should be compared with protein that was not frozen for biochemical activity, visible precipitation, changes in physical properties (for example, dynamic light scattering or gel filtration profile) or crystallization characteristics. In our collective experience, relatively few proteins are irreversibly inactivated by one freeze-thaw cycle. In those rare instances, the protein can be stored at 4 °C for short periods of time, at −20 °C in high concentrations of glycerol, or as an ammonium sulfate suspension.
In small-scale test expression and solubility trials designed to assess the extent to which a protein partitions to the soluble or insoluble fractions, it is important to ensure that the cells are lysed and fractionated properly. Although this is not technically challenging, we have found that it is very common to fail to achieve complete bacterial lysis, which leads to an underestimation of the proportion of recombinant protein in the soluble fraction. Care should also be taken when removing the soluble fraction after centrifugation; it is relatively easy to contaminate the soluble fraction with insoluble material, which can lead to an overestimate of the amount of recombinant protein in the soluble fraction. As a quality control, it is advisable to inspect the protein profiles of the fractions using SDS gel electrophoresis. Some cellular proteins characteristically resolve into the soluble and insoluble fractions and these serve as excellent internal controls (Supplementary Fig. 1 and 2 online).
The pH of the lysate should be 7.5–8.0 for efficient binding, and the buffer should not contain chelators (EDTA or citrate), high imidazole concentrations (for example, >30 mM for Ni-NTA resins) or DTT. In some instances, it is necessary to reduce the amount of imidazole in the loading buffer to <5 mM. The column must be properly charged with metal ions and, when charging columns, make sure the concentrated NiSO4 solution is buffered and set to pH 7.5. It is also important to remember that imidazole is a base; the final solutions must be adjusted to the correct pH. In some cases the target protein may bind weakly to the IMAC column, so the concentration of imidazole in the wash step should be reduced (for example, 20 mM).
An incorrect protein may occasionally be expressed and purified, which most commonly results from a simple clone mix-up. In that instance the problem will be detected either by gel electrophoresis or mass spectrometry of the purified protein.
If the recombinant protein is expressed at low levels, it is also relatively common to purify an endogenous E. coli protein that binds to, and elutes from, the IMAC column and that also adventitiously migrates with the predicted mobility of the target protein54. In some cases, this E. coli protein may even appear to be induced after the expression of T7 RNA polymerase. Determining whether you have purified your recombinant protein or an endogenous bacterial protein can readily be accomplished with mass spectrometry, but is more difficult by denaturing gel electrophoresis. A western blot to the affinity tag can sometimes be useful to track the recombinant protein.
If the expression construct is sequenced before the experiment, errors introduced in primer synthesis or PCR will be detected. In practice, PCR-generated sequence errors are so rare that it is often more practical to do the expression trials first, and to sequence the successful expression constructs later. Of course, if none of the constructs express a protein, it is essential to sequence the expression clones and, ultimately, to sequence the clones selected for scale-up and purification.
Copurification of E. coli proteins with the histidine-tagged recombinant protein is very common, especially when the expression level of the recombinant protein is low. Contaminants include proteins that contain multiple histidine residues (for example, SlyD; Table 3), and molecular chaperones that may bind to the resin directly or to the recombinant protein54,55. The affinity resin has limited capacity, so loading near-saturating amounts of the recombinant protein on a column improves purity. Tag cleavage followed by affinity purification is also effective in removing contaminants, as these proteins are unaffected by the protease and bind to the column after reapplication of the cleavage reaction. Samples copurifying with chaperones should be regarded with suspicion because this indicates that the protein may have some unfolded character. In cases where the target protein cannot be separated from the chaperones by additional chromatography, use an alternative expression system, process a different construct of the protein or try working with a closely related ortholog.
If the protein target is contaminated with other proteins, one can perform additional purification steps such as ion-exchange chromatography. Purifying samples contaminated with different post-translationally modified species or proteolytic fragments of the same protein is more challenging, but not necessarily intractable. For example, different phosphorylated states of a protein can sometimes be resolved using ion-exchange chromatography56.
Pure proteins often precipitate out of solution, even at relatively low (<1 mg/ml) concentrations. This behavior is sometimes coupled with sample inhomogeneity, either in the form of contaminating protein or alternate folded states. Precipitation can also occur by aggregation owing to the presence of hydrophobic or hydrophilic patches on the surface of the target protein. In either case, the problem worsens as the protein concentration increases. There are no generic solutions but some potential solutions, which must be explored for each protein, are to: find a more stabilizing buffer through screening using analytical gel filtration or thermal denaturation (see Supplementary Methods), maintain the protein at lower concentration (<0.1–0.5 mg/ml), maintain an adequate reduced state to prevent protein oxidation (>5 mM DTT, refreshed as required), maintain the salt concentration at high levels (ionic strength >500 mM of a monovalent salt), add glycerol to 10%, add arginine in the range of 50–500 mM, add a mild nondenaturing detergent (0.1% β-octylglucoside) or keep the protein at its optimal temperature (determined empirically).
In even the best of circumstances, it is unusual to generate a soluble version of any given protein on the first attempt. As such, it is important to have a series of alternative approaches. Here we provide various suggestions in the order in which we would usually apply them.
Adjustment of the expression conditions seldom results in radical changes but, as some optimization can be done quite easily, it is worth the effort. The first step is to lower the temperature to slow down protein production. Different types of media can also be tested; rich media, such as Terrific Broth, 2×YT or ZYP5052 (auto-induction), often support good expression. Changing the E. coli strain can also improve expression of a soluble protein51.
As described above, it is important to test the expression of a range of constructs to identify those that express a soluble derivative. We suggest expressing as many as 10 constructs in the initial attempts. If this proves unsuccessful, then it may be advisable to explore additional constructs, particularly if one has knowledge that a structurally related protein can be expressed in soluble form.
Our consensus strategy is to append an N-terminal histidine tag to each construct. If the histidine-tagged recombinant protein does not express or is insoluble, then the probability that it will be expressed in an active form with another N-terminal fusion partner is reduced considerably. Our advice, therefore, is not to iteratively append different N-terminal fusions but to first explore a C-terminal fusion to the histidine tag instead. Some proteins that are completely insoluble with an N-terminal histidine tag can be expressed in soluble form with a C-terminal histidine tag57.
Although we do not advise extensive sampling of other N-terminal fusions, this strategy can sometimes lead to production of soluble, stable fusion protein. If the aim is to study the function of the target protein, and the fusion protein is an acceptable reagent, then it may be an appropriate strategy. However, this approach has its caveats. In the absence of a robust and quantitative functional assay, one reasonably uses solubility as a proxy for function. However, proteins that are soluble only with a larger tag can be ‘dragged’ into solution by the tag, and revert to an insoluble form if the fusion partner is removed38–40.This indicates that the integrity of the recombinant protein as a fusion protein may be suspect. For example, wildtype GFP is mostly insoluble when expressed in E. coli at 37 °C but is largely expressed in the soluble fraction as an MBP fusion58. Nonetheless, bacterial colonies expressing the MBP-GFP fusions display only weak fluorescence, suggesting that the GFP is nonfunctional (G.S. Waldo; unpublished data). Accordingly, before any functional studies, considerable attention should be paid to whether a target protein appears to be soluble only because it is a passenger on a larger tag.
Many proteins are obligate components of multiprotein assemblies and these often require an interacting protein for correct folding and stability21,59,60. Such proteins, and those with unstructured polypeptide chain segments, often cannot be expressed in E. coli in soluble form, but it has proven possible to improve the properties of these proteins by coexpressing the cognate interacting protein61–63. This strategy is only starting to be used in the large-scale projects, in those cases when entire families of interacting proteins are being studied.
Many proteins can be stabilized by the binding of a small molecule—a principle that has found widespread application in generic screening for protein ligands64,65. This property can be exploited to increase the proportion of recombinant protein expressed in soluble form or to stabilize a protein during purification. If a sufficiently soluble, cell-permeable and avid ligand is available, one can use it to stabilize newly synthesized proteins and promote solubility66,67. This concept has also not yet been explored sufficiently in a systematic way.
If bacterial expression is unsuccessful to this point, other hosts should be considered. Common eukaryotic alternatives are the baculovirus expression system in insect cells68, the yeasts Pichia pastoris69 and Saccharomyces cerevisiae70, human cells71, or cell-free systems using prokaryotic or eukaryotic extracts72–76. These cell-free systems, which have been used extensively to generate thousands of purified proteins for structural studies77–79, can be used to produce proteins that are toxic to E. coli79 and can use PCR-amplified linear DNA fragments, without cloning into a vector, for screening and optimization.
All these other expression systems are reasonably simple to use, but they are somewhat more time-consuming to work with than are bacteria and require equipment less commonly found in a typical laboratory.
Proper in vivo folding of a recombinant protein can be promoted by coexpression of molecular chaperones, which are typically produced from cotransformed plasmids carrying several chaperones with synergistic effects, such as the pG-Tf2 vector80—a combination of GroEL-GroES81 and trigger factor82. In our hands, chaperones have been used successfully only in isolated cases, and we know of no study of considerable size that has demonstrated broad efficacy.
A commonly tried but only episodically successful protocol to rescue insoluble protein is to denature the protein and try to refold it in vitro. The method can be successful83,84, particularly for extracellular proteins. However, even the most robust protocols only refold a small fraction of the input protein, and it is difficult to purify the refolded fraction. The best procedures use an activity assay to monitor refolding, and affinity reagents that select any refolded, active protein. We would advise using refolding as a last resort for intracellular proteins.
The methods and strategies for protein expression and purification have been reviewed for the expert many times in excellent, comprehensive ways. Here we attempted to provide a resource for those entering the field, reflecting the experiences of our groups in the application of the various methods to large numbers of proteins. We understand there are many possible routes to obtain high-quality protein and acknowledge that the methods described above should be considered as a starting point that can be embellished once sufficient expertise has been obtained. Detailed protocols for the methods described in this review can be found in the Supplementary Methods.
The Structural Genomics Consortium is a registered charity (number 1097737) that receives funds from the Canadian Institutes for Health Research, the Canadian Foundation for Innovation, Genome Canada through the Ontario Genomics Institute, GlaxoSmithKline, Karolinska Institutet, the Knut and Alice Wallenberg Foundation, the Ontario Innovation Trust, the Ontario Ministry for Research and Innovation, Merck & Co., Inc., the Novartis Research Foundation, the Swedish Agency for Innovation Systems, the Swedish Foundation for Strategic Research and the Wellcome Trust. The New York Structural GenomiX Research Center for Structural Genomics is supported by the US National Institute of General Medical Sciences (U54 GM074945). Work at the MDC was supported by the German Federal Ministry for Education and Research (BMBF) through the Leitprojektverbund Proteinstrukturfabrik and the German National Genome Network (NGFN; FKZ 01GR0471, 01GR0472), and by the Fonds der Chemischen Industrie. The Protein Sample Production Facility is funded by the Helmholtz Association of German Research Centres. The China Structural Genomics Consortium is supported by the National 863 Hi-Tech Research and Development Program of China. The Israel Structural Proteomics Center is supported by The Israel Ministry of Science, Culture and Sport, the Divadol Foundation, the Neuman Foundation, the European Commission Sixth Framework Research and Technological Development Programme ‘SPINE2-Complexes’ Project under contract 031220. The RIKEN Structural Genomics/Proteomics Initiative was supported by the National Project on Protein Structural and Functional Analyses, Ministry of Education, Culture, Sports, Science and Technology of Japan. The Joint Center for Structural Genomics is supported by the US National Institutes of Health (NIH) Protein Structure Initiative grant U54 GM074898 from the NIGMS. The Northeast Structural Genomics Consortium is supported by the NIH NIGMS (U54-GM074958). The Midwest Center for Structural Genomics is supported by the NIH (GM074942) and by the US Department of Energy, Office of Biological and Environmental Research (DE-AC02-06CH11357). The Oxford Protein Production Facility is funded by the UK Medical Research Council and Biotechnology and Biological Sciences Research Council. SPINE2-Complexes is funded by the European Commission (contract 031220) under the Framework 6 RTD Programme and is coordinated from the Division of Structural Biology, Wellcome Trust Centre for Human Genetics, Oxford, UK. The Berkeley Structural Genomics Center is supported by the NIH (GM62412). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIGMS or the NIH.
Note: Supplementary information is available on the Nature Methods website.
The authors are:
Susanne Gräslund1, Pär Nordlund1, Johan Weigelt1, B Martin Hallberg1,24, James Bray2, Opher Gileadi2, Stefan Knapp2, Udo Oppermann2, Cheryl Arrowsmith3, Raymond Hui3, Jinrong Ming3, Sirano dhe-Paganon3, Hee-won Park3, Alexei Savchenko3, Adelinda Yee3, Aled Edwards3, Renaud Vincentelli4, Christian Cambillau4, Rosalind Kim5, Sung-Hou Kim5, Zihe Rao6, Yunyu Shi7, Thomas C Terwilliger8, Chang-Yub Kim8, Li-Wei Hung8, Geoffrey S Waldo8, Yoav Peleg9, Shira Albeck9, Tamar Unger9, Orly Dym9, Jaime Prilusky9, Joel L Sussman9, Ray C Stevens10, Scott A Lesley10,11, Ian A Wilson10,11, Andrzej Joachimiak12, Frank Collart12, Irina Dementieva12, Mark I Donnelly12, William H Eschenfeldt12, Youngchang Kim12, Lucy Stols12, Ruying Wu12, Min Zhou12, Stephen K Burley13, J Spencer Emtage13, J Michael Sauder13, Devon Thompson13, Kevin Bain13, John Luz13, Tarun Gheyi13, Fred Zhang13, Shane Atwell13, Steven C Almo14, Jeffrey B Bonanno14, Andras Fiser14, Sivasubramanian Swaminathan15, F William Studier15, Mark R Chance16, Andrej Sali17, Thomas B Acton18, Rong Xiao18, Li Zhao18, Li Chung Ma18, John F Hunt19, Liang Tong19, Kellie Cunningham18, Masayori Inouye18, Stephen Anderson18, Heleema Janjua18, Ritu Shastry18, Chi Kent Ho18, Dongyan Wang18, Huang Wang18, Mei Jiang18, Gaetano T Montelione18, David I Stuart20,23, Raymond J Owens20,23, Susan Daenke20,23, Anja Schütz21, Udo Heinemann21, Shigeyuki Yokoyama22, Konrad Büssow21,24, Kristin C Gunsalus18,24