PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of actafjournal home pagethis articleInternational Union of Crystallographysearchsubscribearticle submission
 
Acta Crystallogr Sect F Struct Biol Cryst Commun. 2012 April 1; 68(Pt 4): 366–376.
Published online 2012 March 31. doi:  10.1107/S1744309112008421
PMCID: PMC3325800

Detection and analysis of unusual features in the structural model and structure-factor data of a birch pollen allergen

Abstract

Physically improbable features in the model of the birch pollen structure Bet v 1d (PDB entry 3k78) are faithfully reproduced in electron density generated with the deposited structure factors, but these structure factors themselves exhibit properties that are characteristic of data calculated from a simple model and are inconsistent with the data and error model obtained through experimental measurements. The refinement of the 3k78 model against these structure factors leads to an isomorphous structure different from the deposited model with an implausibly small R value (0.019). The abnormal refinement is compared with normal refinement of an isomorphous variant structure of Bet v 1l (PDB entry 1fm4). A variety of analytical tools, including the application of Diederichs plots, Rσ plots and bulk-solvent analysis are discussed as promising aids in validation. The examination of the Bet v 1d structure also cautions against the practice of indicating poorly defined protein chain residues through zero occupancies. The recommendation to preserve diffraction images is amplified.

Keywords: protein structure, Bet V 1 birch pollen allergen, Diederichs plot, validation, bulk-solvent correction, refinement statistics, intensity statistics

1. Introduction  

During a routine search of the public PDB_REDO database (Joosten et al., 2011 [triangle]) for a crystal structure model of birch pollen protein Bet v 1, a significant discrepancy between the originally reported R values (R free = 0.298, R work = 0.274) and the conservatively re-refined structure of PDB entry 3k78 (Bet v 1d) was detected (0.177, 0.126). These R values are unexpectedly low for a 2.8 Å structure. At the same time, the electron-density map provided by the Uppsala Electron Density Server, EDS (Kleywegt et al., 2004 [triangle]), publicly accessible through the PDBe (Velankar et al., 2010 [triangle]), shows numerous side chains that do not fit the experimental electron density. The EDS service also reported a negative bulk-solvent contribution B factor and a negligibly small bulk-solvent contribution scale factor, which is abnormal for an experimentally determined protein structure (Fokine & Urzhumtsev, 2002 [triangle]). Given the fact that the R values calculated by PDB_REDO from the data without refinement (0.265, 0.275; a new R free set was calculated by PDB_REDO) agreed reasonably well with the values reported in the PDB header (0.298, 0.273), an accidental swap of experimentally observed structure factors F(obs) against the final calculated structure factors F(calc) when generating the deposited structure-factor file can be excluded (in that case also the reproduced R values without refinement would be improbably low). In view of these discrepancies it seemed sensible to re-examine the 3k78 model and the associated deposited diffraction data.

The crystal structure model of birch pollen hypoallergen Bet v 1d (Zaborsky et al., 2010 [triangle]), PDB code 3k78, was reported as solved by molecular replacement (MR) from the nearly sequence identical model of the hypoallergenic isoform Bet v 1l (Marković-Housley et al., 2003 [triangle]), PDB entry 1fm4. The model structures are isomorphous (P21) with cell constants identical within experimental error. 1fm4 itself was derived by MR from the C2221 structure model of the clinically important inhalant major allergen, Bet v 1a (Gajhede et al., 1996 [triangle]; PDB entry 1bv1). A sequence alignment including additional information relevant to the following discussion is provided in Fig. 1 [triangle].

Figure 1
Sequence alignment of Bet v 1 allergens. The yellow codes indicate sequence differences between search model 1fm4 and 3k78, while the red highlights indicate nine residues that contain zero occupancy atoms in both models, 1fm4 and 3k78 ...

The 3k78 model was refined against structure factors with 2.8 Å resolution, and 1fm4 was refined at 2.0 Å. Both structures appear unremarkable (in a technical sense, no insult to biological relevance intended), and the refinement statistics and protocols reported in the PDB entries are appropriate for the resolution. However, on closer inspection, both the model and the structure-factor data of 3k78 exhibit highly unlikely, physically improbable (if not impossible) features. For reference, the results of the 3k78 analysis and re-refinement are compared with those obtained for the isomorphous 1fm4 structure of good and reproducible quality. This comparison may provide useful reference for the aspiring crystallographer and can serve as teaching material.

2. Structure models and re-refinement  

The two models were originally refined using different programs, CNS 1.0 (Brünger et al., 1998 [triangle]), and REFMAC5 (Murshudov et al., 1997 [triangle], 2011 [triangle]; Winn et al., 2001 [triangle]), with different refinement protocols. To aid comparison, a common isotropic B-factor refinement protocol with REFMAC was used in both cases, with parameters adjusted appropriate to each refinement.

The mmCIF structure-factor files and PDB coordinate files were downloaded from the PDBe (Velankar et al., 2010 [triangle]). Structure-factor files were converted into mtz files using the programs of the CCP4 suite (Winn, 2003 [triangle]; Winn et al., 2011 [triangle]) through the CCP4i user interface (Potterton et al., 2003 [triangle]). The original R free data sets were kept (except in an additional refinement of 3k78 for graphing purposes discussed in §3). Original maximum-likelihood maps were computed via REFMAC (zero cycles) with automated weighting from original coordinates and structure factors, and in case of 3k78 also the TLS parameters were read in from the deposited coordinate file. The procedures for analysis of the structure-factor data are provided in §3.

The common REFMAC protocol included isotropic individual B factors, flat bulk-solvent model (Jiang & Brünger, 1994 [triangle]), and riding H atoms were used in these refinements. The REFMAC X-ray matrix weight (Murshudov et al., 2011 [triangle]) and B-factor restraint weights were manually adjusted by monitoring the negative cross-validation log-likelihood (−LLfree) minimum at convergence (Tickle, 2007 [triangle]).

2.1. Coordinates and model 1fm4  

The coordinate file of the Bet v 1l search model, 1fm4, reveals no unusual features. The PDB file contains residues 2–160 of the sequence, but the residue numbers in the coordinate file are decremented by 1 compared to the aligned sequences in Fig. 1 [triangle]. As specified in REMARK 480, occupancies for the surface exposed, terminal side-chain atoms of Lys28, Lys65, Lys80, Lys103, Lys129, Glu131, Gln132, Lys134 and Lys137 are set to zero (§4, Fig. 8). Zero side-chain occupancies usually indicate that the side chains were poorly defined in electron density owing to displacement such as disorder or multiple conformations, and instead of accepting the correspondingly high displacement parameters or B factors from the refinement, the occupancies of such atoms are manually set to zero. While still common practice, such is not necessarily the best way to indicate the limited knowledge of their actual position (c.f. discussion in §4).

2.2. Re-refinement of 1fm4  

Progress in the methodology of macromolecular refinement has led to steady improvements of the programs, and major efforts to re-refine already deposited PDB models have been undertaken in the PDB_REDO effort (Joosten et al., 2011 [triangle]). In this work, the purpose of re-refining the already good 1fm4 structure is not to generate a better model (which ultimately would also require some minor rebuilding) but to provide a benchmark for the applied procedure and an example of the characteristics of a well refined model in order to appreciate the abnormal refinement of 3k78.

1fm4 was already well refined with CNS1.0 about a decade ago. During the multiple weight adjustment runs REFMAC reached stable convergence after about 30 cycles, with a resolution-typical X-ray matrix weight of 0.2 and restraint weight σs for B-factor main-chain 1–2, 1–3 neighbors and side-chain 1–2, 1–3 neighbors adjusted to 3, 5, 7 and 9 Å2, which is reasonable given the empirical values (Tronrud, 1996 [triangle]). The re-refined REFMAC model differs very little from the original model. The overall coordinate r.m.s.d. between models on all atoms is 0.247 Å and on Cα is 0.078 Å, which is well below the historic value for 100% sequence identity expected from the Chothia and Lesk function (Chothia & Lesk, 1986 [triangle]). No significant geometry improvements resulted during re-refinement, and both 1fm4 and its re-refined model are of good quality. No attempts at model rebuilding were made, which probably could close the slightly increased RR free gap (Tickle et al., 1998b [triangle], 2000 [triangle]) compared with the original refinement. A subset of refinement statistics relevant to the structure comparison are compiled in Table 1 [triangle]. Considering the different programs (CNS1.0 versus REFMAC5.6), the differences in protocol, as well as different X-ray and restraint weight optimization, this result is quite reassuring and attests to the reproducibility of crystallographic refinement.

Table 1
Selected refinement statistics

The B factors of the previously ‘unoccupied’ side-chain atoms with reset occupancy refined as expected to high B factors, and the inspection of the electron density of these residues in COOT (Emsley et al., 2010 [triangle]) shows the corresponding and increasing weakening of density along the side-chain terminals (§4, Fig. 9). Apart from polishing the model ‘ad tedium’ (the term originally being coined by Phil Evans), the well refined 1fm4 model remains fully valid even under different refinement protocols executed nearly a decade later. As stated above, setting the occupancies of side-chain atoms of residues with weak density to zero seems to be unnecessary and could probably be avoided.

2.3. Coordinates and model 3k78  

Although the 3k78 Bet v 1d model has five backbone torsion angle outliers and numerous severe geometry deviations in the residues with zero occupancy atoms, it is otherwise unremarkable. The coordinate file of 3k78 contains residues 3–159 of the sequence, with the residue numbers matching the sequence alignment in Fig. 1 [triangle] (i.e. incremented from 1fm4 by 1). However, for the residues containing zero occupancy atoms (Asn29, Lys66, Lys81, Lys104, Lys130, Glu132, Gln133, Lys135 and Lys138) an interesting pattern emerges: the zero occupancies are systematically shifted in atom number to lower values, i.e. it is not the terminal side-chain atoms that are unoccupied, but the zero occupancies move towards the Cβ, and even to the (in the PDB file but not physically) adjacent backbone O atoms of the respective residue, while the terminal atoms of the residues become occupied again (§4, Fig. 8). This pattern is physically highly improbable, but no explanation for this selection of zero occupancy atoms has been reported. These physically improbable model features do, however, lead to some interesting features in the electron density of the original refinement (§4, Fig. 9). The substantial bond distance deviations of most of the residues with zero occupancy atoms are listed in §4, Fig. 10. The remaining deviations can be found in the 3k78 PDB header REMARK 500 records or may be generated with RUN500 from CCP4i.

2.4. Original refinement of 3k78  

The model was originally refined using the REFMAC hybrid TLS–isotropic B-factor refinement (Painter & Merritt 2006 [triangle]; Murshudov et al., 2011 [triangle]) with a single TLS group. Given the 2.8 Å resolution, hybrid TLS refinement would not be unusual or unreasonable, although a rationale for the choice of protocol, parameterization, and analysis of the (small) TLS contributions is absent (Zaborsky et al., 2010 [triangle]). Original density maps were calculated from unchanged deposited data and coordinates via a zero cycle refinement run in REFMAC (including the published TLS groups and matrices). The resulting R values (0.304, 0.269) were in reasonable agreement with those reported in the PDB header (0.298, 0.273) and by PDB_REDO (0.265, 0.275).

When the original coordinate file is loaded into COOT (Emsley et al., 2010 [triangle]), difference density peaks > 5σ clearly indicate that several residues such as Ile8, Gln37, Glu43, Gly52, Lys56, Glu61, Arg71, Asp110, Glu128, Tyr151 and His155 should be modeled with different conformations (Fig. 2 [triangle]), in agreement with the findings of the EDS service (Kleywegt et al., 2004 [triangle]) which can be readily accessed via the PDB validation links. While such modeling errors are not unusual, they can easily be corrected. There was no support for the claim of unidentified density in the core of the molecule made in the 3k78 publication (Zaborsky et al., 2010 [triangle]). Instead, two chemically plausible water molecules included in the model can be discerned in the electron density. Given the relatively high R values and poor geometry of the side chains with zero occupancy atoms in the published model, rebuilding and re-refinement of 3k78 appeared promising.

Figure 2
Electron density of original 3k78 model. 2mF oDF c electron density contoured at 0.8σ (blue), 5σ mF oDF c difference density (positive light green, negative red). The left panel shows the misplaced residues in the ...

2.5. Isotropic B-factor refinement of 3k78  

The original 3k78 coordinates were used without rebuilding (only the zero occupancies were reset to 0.01) for isotropic B-factor refinement. Initially a resolution-appropriate low X-ray matrix weight of 0.1 was used to keep the geometry tight and repair the originally distorted zero-occupancy residues. The same B-factor restraint weights as for 1fm4 (3/5/7/9 Å2) were used for 30 cycles. The refinement did not reach convergence, but the R values already dropped unexpectedly quickly to 0.131 and 0.068. Inspection of the model geometry showed that the model overall had in fact improved, and maps showed that the misplaced residues Ile8, Gln37, Glu43, Gly52, Lys56, Glu61, Arg71, Asp110, Glu128, Tyr151 and His155 all had assumed correct positions practically identical to those in 1fm4 with good geometry in the remarkably noiseless density map. Nine water atoms from 1fm4 that also occupied density in the 3k78 map were added to the new model by a simple cut and paste.

At that point of the refinement the R values had already reached values typical for atomic resolution structures. Given the negative bulk-solvent B factor of −10 Å2 and small bulk-solvent scale factor of 0.026 e Å−3, no sensible bulk-solvent scattering contribution seemed to be present, and the assumption of calculated structure factors was made. As a consequence, (a) the bulk-solvent correction was turned off, (b) no riding H atoms were included, (c) X-ray matrix weights were increased to 0.6, (d) B-factor restraint weights were loosened up to their physically reasonable limit (5/7/9/11 Å2) as established by empirical values (Tronrud, 1996 [triangle]).

The refinement, with its atypical protocol for any experimental protein structure, reached stable convergence at R values of 0.040 and 0.019, with stable geometry and practically the same target r.m.s.d. values as 1fm4 (Table 1 [triangle]). The resulting density maps were practically noiseless, with the only remaining significant difference density features in the vicinity of the residues with unoccupied side-chain atoms. According to PROCHECK (Laskowski, 2001 [triangle]) or RUN500, the entire model had excellent geometry quality. Tedium was declared and no manual rebuilding of the side chains with unoccupied atoms was attempted.

At this point it was clearly established that (a) the deposited structure factors are calculated structure factors, (b) the resulting re-refined model resembles in most details the mutated search model, (c) that the original model has not, or not properly, been refined against these structure factors (or had been altered from a model essentially similar to the re-refined model and after the structure factors had been calculated).

3. Analysis of structure factors  

Given the highly improbable refinement results inconsistent with experimental data at 2.8 Å resolution, a closer examination of the deposited structure-factor data was undertaken.

3.1. Intensity statistics and R-value analysis  

The data for 1fm4 and for 3k78 were collected in-house on rotating anode sources and recorded on imaging plate detectors, with reported redundancies of 3.3 and 2.1 respectively, and should be comparable. In absence of unmerged intensity data, a SHELX (Sheldrick, 2008 [triangle]) format data file was generated from the mtz structure-factor amplitudes, read into XPREP (George Sheldrick, Bruker AXS) with HKL3 format option, and converted to intensities following the basic, error-propagation-based F to I conversion (see e.g. Rupp, 2009 [triangle], pp. 328), i.e. An external file that holds a picture, illustration, etc.
Object name is f-68-00366-efi1.jpg

While the mean I, mean I/σ(I), and Rσ (Schneider & Sheldrick, 2002 [triangle]) values for 1fm4 are typical, the 3k78 data show highly unusual features (Table 2 [triangle], Supplementary Table 3b 1, Fig. 3 [triangle]). The value of Rσ for validation is based on the fact that it allows computation and assessment of an a posteriori R merge-like data-quality indicator when unmerged data or images for proper reprocessing are not available owing to the unfortunate absence of a formal obligation to deposit unmerged intensity data or diffraction images. An external file that holds a picture, illustration, etc.
Object name is f-68-00366-efi2.jpg tends to be somewhat lower than the corresponding linear R merge. For a discussion of the various merging R values see Diederichs & Karplus (1997 [triangle]); Weiss (2001 [triangle]); Rupp (2009 [triangle]); and Einspahr & Weiss (2012 [triangle]).

Figure 3
Mean I/σ(I) and Rσ versus resolution for 1fm4 and 3k78. The left column shows what can be considered representative statistics for experimental diffraction data (1fm4). The I/σ(I) versus resolution graphs generally reproduce the ...
Table 2
Comparison of key intensity statistics of 1fm4 versus 3k78

3.2. Diederichs plots  

The improbably low Rσ values in 3k78 data are caused by a discrepancy between the intensities and their exceptionally low standard uncertainties. In addition to Poisson-statistics-derived counting errors, multiple other sources of instrumental errors limit the achievable signal to noise ratio, that is, I/σ(I). This has been investigated in detail (Diederichs, 2010 [triangle]), and Diederichs notes that even with good crystals the I/σ(I) ratio of the strongest (unmerged) observations is rarely above 30 even in the lowest resolution shell. It is obvious then, that ‘counting statistics are not the limiting factor, as individual reflections may well have many more than 10 000 counts, which would allow I/σ(I) ratios of more than 100 and low-resolution R factors of better than 1%’ (Diederichs, 2010 [triangle]). The paper also provides multiple plots of I/σ(I) versus log(I) which show distinct plateaux at around I/σ(I) values of about 20 to 30.

In absence of original unmerged intensity data and to account for possible effects of redundancy, the 1fm4 data with a reported overall redundancy of 3.3 and of 3k78 with a redundancy of 2.1 were compared with the aid of Diederichs plots (Fig. 4 [triangle]). 1fm4 shows the behavior expected for a normal data set, while 3k78 shows extremely high I/σ(I) values and completely atypical behavior, and are apparently unlimited by any instrument measurement errors.

Figure 4
Diederichs plots for 1fm4 and 3k78. The left panel depicts the graph of I/σ(I) versus log(I) for each unique reflection in the 1fm4 data set. It can be clearly seen that the sigmoid shape of the distribution levels off at around 20 to 30 I/σ( ...

The resulting improbably high signal-to-noise ratios in turn indicate that these standard uncertainties are not based on any experimental variances. Some analysis of a possible origin can be provided by examining a non-logarithmic version of the Diederichs plot. A simple power law fit of the deposited data reveals that the signal-to-noise ratio I/σ(I) is essentially proportional to the square root of I, which is expected if the σ(I) is computed from I 1/2. An error model closely reproducing the deposited standard uncertainties can be obtained by generating a random error from the absolute inverse cumulative normal distribution around mean zero with a σ of 3.0 via the Excel NORMINV function, and forming the square root of the product of this random error with I. From these I/σ(I) values (Fig. 5 [triangle]), F and σ(F) follow again by basic error propagation, with an atypical σ(F) distribution very similar to the deposited standard uncertainties. Spreadsheets including the calculations and additional graphs are included in the supplementary material.

Figure 5
Model of the experimental uncertainties. The left panel depicts the graph of I/σ(I) versus (I) for the 1fm4 data set (i.e., a subsection of a non-log Diederichs plot). The distribution follows the I 1/2 versus I parabola (a.k.a. power law), indicating ...

3.3. Bulk-solvent content analysis  

Proteins contain large fractions of disordered solvent, whose bulk-solvent scattering contributions supress the low-resolution intensities in an experimentally collected protein diffraction data set. The low-resolution structure factors calculated without bulk-solvent contributions should be significantly higher than the observed structure factors, while at the same time the R values for a refinement of a not bulk-solvent-corrected structure should be much higher than for a properly bulk-solvent-corrected structure. Representative graphs and a review of bulk-solvent scattering models can be found in Fokine & Urzhumtsev (2002 [triangle]) and in basic textbooks (e.g. Rupp, 2009 [triangle]).

The original cross-validation data set contained only 4.8% of the data (162 reflections), and in the two lowest resolution shells the original 3k78 data contained no or only one cross-validation reflection, respectively. For the overall data range, the uncertainty in R free (Kleywegt & Brünger, 1996 [triangle]; Tickle et al., 1998a [triangle]) is still acceptable with the low number of crossvalidation reflections, but for plotting in shells the R free count is too low to be of practical value. For plotting, new a posteriori R free data (Brünger, 1997 [triangle]) were obtained from new cross-validation data sets with 10% random selection against which the coordinate-perturbed starting model from the first 3k78 isotropic refinement was refined. Even with this suboptimal cross-validation procedure, the isotropic B-factor refinements reproduced the same R values of around 0.04/0.02. The R free versus resolution plots for 3k78 were still noisy but show the same trend as plots from the original cross-validation set, and these data were used in the following analysis.

Structure factors and R values were calculated by REFMAC with and without bulk-solvent correction from the respective re-refined models of 1fm4 and 3k78. The R free versus resolution plots as well as F(calc) and F(obs) versus resolution show expected behavior for 1fm4 consistent with bulk-solvent scattering contributions (Fig. 6 [triangle]). The same plots for 3k78 indicate absence of bulk-solvent scattering contributions in the structure factors, consistent with the negative bulk-solvent correction and trivially small bulk-solvent scale factor reported by REFMAC and the EDS report. The R free plot for 3k78 shows the same lack of the strong increase in low resolution R value that would be expected for the refinement in the absence of a bulk-solvent correction and resembles the findings for the fabricated C3b structure (Janssen et al., 2007 [triangle]). Given identical F(obs) and F(calc) without bulk-solvent contribution, logarithmic intensity ratio data plots (not shown) again replicate the situation demonstrated for the C3b structure.

Figure 6
Bulk-solvent contribution analysis for 1fm4 and 3k78. The left panels depict the expected, nearly textbook-like behavior of a normal crystal structure like 1fm4. The top row shows the resolution-dependent behavior of R free when the bulk-solvent correction ...

For the purpose of validation, bulk-solvent parameters need to be calculated reliably from the original data. The EDS data at present suffer from some divergences, leading to a multimodal distribution probably caused by certain threshold or limit values for the bulk-solvent parameters. A consistent calculation using the flat bulk-solvent contribution (Afonine et al., 2005 [triangle]; Afonine 2012 [triangle]) model using phenix.refine (Adams et al., 2010 [triangle]) provides ~40 000 valid bulk-solvent contribution B-factor–scale-factor pairs. The probability distribution function represented in Fig. 7 [triangle] is consistent with the earlier published smaller set of data (Fokine & Urzhumtsev, 2002 [triangle]). Entry 3k78, the fabricated entry 2hr0 (Janssen et al., 2007 [triangle]), and two entries that are now updated (1n0q and 1n0r) but contained erroneously deposited calculated structure factors (Mosavi et al., 2002 [triangle]), could be clearly identified as outliers given the distribution in Fig. 7 [triangle].

Figure 7
Probability distribution function of bulk-solvent correction parameters. The plot shows the distribution of bulk-solvent parameter pairs (scale factor and B factor) calculated from 40 000 PDB entries where valid parameters could be refined using ...

4. Improbable model features caused by zero occupancies  

The pattern that the zero occupancy atoms of 3k78 residues (Asn29, Lys66, Lys81, Lys104, Lys130, Glu132, Gln133, Lys135 and Lys138) display seems to be caused by a shift of zero occupancies to atoms with atom numbers decremented consistently by 2. This shift causes the backbone O atoms of the respective residue to become unoccupied, while the terminal atoms of the residues become occupied again (Fig. 8 [triangle] and Supplementary Table 4a). Such errors could be introduced during the preparation of molecular replacement models. In case of experimental structure factors, the electron-density map will indicate the error by positive difference density peaks in place of the atoms missing in the model. In case of 3k78, however, the atom absences propagate into the electron density.

Figure 8
Zero occupancy atoms in 1fm4 and 3k78. Condensed REMARK 480 from PDB headers. The atoms in 3k78 (right-hand columns) are shifted towards lower atom numbers compared to 1fm4, causing the zero occupancies to progress towards the main chain including the ...

Quite unexpected is that in original 3k78 maps (§2.4) no 2mF oDF c density for the unoccupied missing atoms down to near-noise levels below 0.5σ nor difference density the mF oDF c maps is visible for unoccupied atoms, including the backbone O atoms in Lys130, Glu132 and Gln133 (Fig. 9 [triangle]). The weak difference density for Lys135 probably results from incorrect placement. Given the reported typical main-chain B factors (~30–35 Å2) of the adjacent, covalently connected backbone atoms, this behavior is very unusual and improbable. Following the lysine side chains towards the solvent, there is again clear density for the solvent-exposed C[sm epsilon] and Nζ atoms of the lysine residues, but they are untethered by hydrogen bonds or other contacts. These observations are characteristic of data calculated from a model with zero occupancy atoms.

Figure 9
Normal and pathological side-chain density. 2 mF oDF c electron density contoured at 0.8σ. The left panel shows the progressive weakening of electron density owing to displacement of the side-chain atoms, after re-refinement ...

Setting occupancies of protein atoms that are poorly defined or absent in electron density to zero has very little effect on the overall model quality or refinement itself: zero occupancy as well as a very high B factor both lead to respectively zero or negligible scattering contributions, and either will have an insignificant effect on the rest of the model. Inspection of the electron density of the side-chain atoms of residues with reset occupancy in the re-refined 1fm4 model illustrate the fact that such atoms simply refine to high B factors and display correspondingly weak electron density (Fig. 9 [triangle]). Nevertheless, it should be kept in mind that for many cases of local disorder, large isotropic displacement (B) factors are not a physically correct description either (Merritt, 2012 [triangle]). A number of other inconsistencies and problems however can be introduced by zero occupancy atoms in the chain of a protein model.

  • (i) Despite the fact that these unoccupied atoms are not included in the refinement, they do remain in the model but may not be included in the calculation of the r.m.s. deviation from geometry restrain target values listed in the PDB header. Table 1 [triangle] lists such a discrepancy for 3k78.
  • (ii) An additional problem caused by the zero occupancies is that geometry validation programs may be misled. For example, WHAT_CHECK (Hooft et al., 1996 [triangle]) properly warns of zero occupancy atoms but does not compute their geometry deviations, leaving the corresponding errors unlisted. Fig. 10 [triangle] demonstrates this scenario for entry 3k78. MolProbity (Davis et al., 2007 [triangle]) also excludes atoms with occupancies below 0.02 and also does not report side-chain bond distance and angle violations (J. Richardson, personal communication). However, the PDB validation does include zero occupancy atoms in the preparation of geometry violation statistics for REMARK 480 and 500 (available as RUN500 from the CCP4i interface).
    Figure 10
    WHAT_CHECK report of bond distance violations for 3k78. The last column contains the deviation from known r.m.s. values, expressed in σ levels. Setting atoms to zero occupancies can lead to missing them during model validation and correction. ...
  • (iii) Not all display programs recognise zero occupancies, while at the same time the B factors of those atoms can be set to an arbitrary, non-representative (often low) value which again may be misinterpreted, or missed in B-factor analysis.

5. Conclusions  

The findings surfacing during model refinement in §2 and amplified during the structure factor analysis in §3 and the feature propagation discussed in §4 provide consistent and very convincing evidence that (a) the structure-factor data deposited for 3k78 are calculated structure factors, (b) the resulting re-refined model resembles in most details the mutated search model, (c) that the original model has not, or not properly, been refined against these structure factors (or had been altered from a model essentially similar to the re-refined model and after the structure factors had been calculated). Being not refined against the deposited structure factors, the 3k78 model at present at least lacks experimental basis. The findings leading to the above conclusions are summarized below.

  • (i) The deposited structure factors do not contain any bulk-solvent contribution.
  • (ii) The noise level of the data is abysmally small and nearly constant over the entire resolution range, consistent with a truncated calculated data set with inappropriate error model.
  • (iii) The Diederichs plots show almost orders of magnitude higher signal-to-noise ratios than expected for real data, indicative of absence of instrumentation errors in calculated structure factors and in the error model.
  • (iv) The structure factors deposited for the PDB entry 3k78 are in fact calculated structure factors, and their standard uncertainties are not based on experimental errors.
  • (v) Because the original refinement against these structure factors gives the same R values as reported or calculated by PDB_REDO and in this work, a simple error of swapping the F(obs) and F(calc) columns during data deposition can be excluded.
  • (vi) The refinement statistics reported in the PDB header are inconsistent with actual refinement against the structure-factor data.
  • (vii) The model refines against the deposited 2.8 Å data without the need for bulk-solvent correction, no H atoms, atypical X-ray matrix weights, to near-zero R values, compatible only with calculated structure factors.
  • (viii) The model obtained by re-refinement does not correspond to the deposited model, but is in details closer to the molecular replace­ment starting model.
  • (ix) The non-physical zero occupancy residues in the model are faithfully reproduced in the electron density calculated from the deposited structure-factor data, which is inconsistent with experimental data obtained from a real protein structure.
  • (x) Numerous residues of the original model are not located in their electron density, but return to the exact position of the density when refined. This is consistent with these parts of the re-refined model being manipulated after the structure factors were generated from it.

Each of these points alone is reason for concern, and when combined and evaluated against prior expectations, they leave no doubt that model and data of 3k78 are incompatible and that the deposited structure factors are not based on actual experiments, and their standard uncertainties are not based on experimental errors.

Following basic scientific epistemology, strong and convincing evidence would have to be provided to overcome these doubts (Rupp, 2010 [triangle]). In case of an error during deposition, this should be trivial to achieve, and database integrity could be easily restored. At least an experimental data set which refines to the deposited structure, or unmerged intensity data reprocessed from the original images should be supplied. Most convincing and irrefutably, the presentation of actual diffraction images which produce data representing the deposited model would establish the facts.

6. A few recommendations  

Considerable efforts by the PDB validation task force (Read, 2011 [triangle]) will make it much less likely that poorly refined models, models inconsistent with data, or implausible data will enter the public databases. Nevertheless, it remains a fact that – irrespective of the cause of the problem – in the case of 3k78 a calculated data set also incompatible with the associated coordinate entry has been successfully deposited. The example of 3k78 provides a few additional suggestions that might be useful not just for a posteriori validation during deposition but also particularly for the aspiring crystallographer during structure refinement.

  • (i) Diffraction image deposition and archival. The need for preserving diffraction images for scientific reasons has been officially suggested by the IUCr in 2008 (Baker et al., 2008 [triangle]) and a standing IUCr committee on data deposition has been formed in 2011. Although matters of policies and technical issues remain to be resolved, there is little doubt that image deposition is a timely and beneficial practice for scientific reasons. As an additional side-effect, image deposition allowing reprocessing would immediately resolve any questions of data provenance. Successful redeposition of the correct observed structure factors of entries 1n0q and 1n0r (Mosavi et al., 2002 [triangle]), reprocessed form original diffraction images collected a decade ago, clearly demonstrates the value of proper image data archiving.
  • (ii) Bulk-solvent correction. It would be useful if all refinement programs consistently report the bulk-solvent B factor and also the bulk-solvent scale factor in the REMARK 3 section of the PDB header. Implausible values could be readily detected and corrective action taken already during refinement. The bulk-solvent scale factor actually becomes a more useful measure than the bulk-solvent B factor, particularly at the spurious solvent contents refined from calculated structure factors.
  • (iii) Setting the occupancy of protein chain atoms to zero as an indication of positional uncertainty is physically not correct. Accepting high B factors (which are not necessarily a correct physical description of substantial disorder either) causes less problems, such as geometry validation programs not including unoccupied atoms in the validation statistics. Isolated backbone zero occupancies are physically not meaningful and should be correspondingly flagged as a serious problem. Side-chain atoms may be absent owing to radiation damage, and in such cases the use of zero occupancies as an indicator could be arguably justified.
  • (iv) The Diederichs plot (§3.2) seems to be a valuable tool in spotting anomalies in diffraction data, particularly as far as the signal-to-noise ratios, i.e. I/σ(I) and the instrumentation error model is concerned. Potential for abuse by fitting calculated error models to the sigmoid distribution does exist.
  • (v) Rσ (§3.1) can serve as a useful a posteriori measure for the plausibility of the error model and signal-to-noise levels in the absence of any merging R values.
  • (vi) A posteriori, the PDB_REDO database can be examined for improbably high discrepancies between the originally reported R values and the conservatively re-refined structure of a PDB entry.
  • (vii) In the absence of image deposition, and as an option requiring no special effort, more refinement data could be deposited. At least the F(calc) set could be submitted in addition to F(obs) to allow easy detection of simple column swapping or other possible deposition mistakes. Even better, the Fourier coefficients for the final electron-density map should be deposited, because this map ultimately represents what the crystallographer was interpreting during model building. EDS can only reconstruct maps from what it is provided with, which presently are only the deposited structure-factor amplitudes and the model coordinates.

Finally, despite all the diagnostics and validation tools available during model building, refinement, and ultimately upon PDB deposition, one needs to recollect that not the PDB but the individual crystallographer bears the final – and sometimes far reaching (Petsko, 2007 [triangle]) – responsibility for the correctness of the deposited model.

Supplementary Material

Acknowledgments

I wish to anonymously acknowledge several colleagues who provided critical comments and detailed information about the refinement and data analysis programs used in this work. Ed Pozharski extracted raw data from the EDS database. P. Afonine computed bulk-solvent contributions with an improved bulk-solvent parameter implementation in phenix.refine. Reviewers have pointed out a number of didactical and presentational improvements to the manuscript. The REFMAC command script, the input files, and the results for the isotropic B-factor refinement of 3k78 as well as the XPREP data analysis and bulk-solvent data are deposited as supplementary materials. The hyperlink to PDB_REDO of 3k78 is http://www.cmbi.ru.nl/pdb_redo/k7/3k78/index.html, for the EDS report http://eds.bmc.uu.se/cgi-bin/eds/uusfs?pdbCode=3k78, and the electron density can be loaded via the EDS link to the ASTEX Viewer at http://eds.bmc.uu.se/cgi-bin/eds/eds_astex.pl?infile=3k78&centre=A61.

Footnotes

1Supplementary materials have been deposited in the IUCr electronic archive (Reference: WD5176).

References

  • Adams, P. D. et al. (2010). Acta Cryst. D66, 213–221. [PMC free article] [PubMed]
  • Afonine, P. V., Grosse-Kunstleve, R. W. & Adams, P. D. (2005). Acta Cryst. D61, 850–855.
  • Afonine, P. V., Grosse-Kunstleve, R. W., Echols, N., Headd, J. J., Moriarty, N. W., Mustyakimov, M., Terwilliger, T. C., Urzhumtsev, A., Zwart, P. H. & Adams, P. D. (2012). Acta Cryst. D68, 352–367.
  • Baker, E. N., Dauter, Z., Guss, M. & Einspahr, H. (2008). Acta Cryst. D64, 337–338. [PubMed]
  • Brünger, A. T. (1997). Methods Enzymol. 277, 366–396. [PubMed]
  • Brünger, A. T., Adams, P. D., Clore, G. M., DeLano, W. L., Gros, P., Grosse-Kunstleve, R. W., Jiang, J.-S., Kuszewski, J., Nilges, M., Pannu, N. S., Read, R. J., Rice, L. M., Simonson, T. & Warren, G. L. (1998). Acta Cryst. D54, 905–921. [PubMed]
  • Chothia, C. & Lesk, A. M. (1986). EMBO J. 5, 823–826. [PubMed]
  • Davis, I. W., Leaver-Fay, A., Chen, V. B., Block, J. N., Kapral, G. J., Wang, X., Murray, L. W., Arendall, W. B. III, Snoeyink, J., Richardson, J. S. & Richardson, D. C. (2007). Nucleic Acids Res. 35, W375–W383. [PMC free article] [PubMed]
  • Diederichs, K. (2010). Acta Cryst. D66, 733–740. [PubMed]
  • Diederichs, K. & Karplus, P. A. (1997). Nature Struct. Biol. 4, 269–275. [PubMed]
  • Einspahr, H. M. & Weiss, M. S. (2012). International Tables for Crystallography, Vol. F, 2nd ed., edited by E. Arnold, D. M. Himmel & M. G. Rossmann, pp. 64–74. Chichester: Wiley.
  • Emsley, P., Lohkamp, B., Scott, W. G. & Cowtan, K. (2010). Acta Cryst. D66, 486–501. [PMC free article] [PubMed]
  • Fokine, A. & Urzhumtsev, A. (2002). Acta Cryst. D58, 1387–1392. [PubMed]
  • Gajhede, M., Osmark, P., Poulsen, F. M., Ipsen, H., Larsen, J. N., Joost van Neerven, R. J., Schou, C., Løwenstein, H. & Spangfort, M. D. (1996). Nature Struct. Biol. 3, 1040–1045. [PubMed]
  • Hooft, R. W., Vriend, G., Sander, C. & Abola, E. E. (1996). Nature (London), 381, 272. [PubMed]
  • Janssen, B. J., Read, R. J., Brünger, A. T. & Gros, P. (2007). Nature (London), 448, E1–E2. [PMC free article] [PubMed]
  • Jiang, J.-S. & Brünger, A. T. (1994). J. Mol. Biol. 243, 100–115. [PubMed]
  • Joosten, R. P., te Beek, T. A., Krieger, E., Hekkelman, M. L., Hooft, R. W., Schneider, R., Sander, C. & Vriend, G. (2011). Nucleic Acids. Res. 39, D411–D419. [PMC free article] [PubMed]
  • Kleywegt, G. J. & Brünger, A. T. (1996). Structure, 4, 897–904. [PubMed]
  • Kleywegt, G. J., Harris, M. R., Zou, J., Taylor, T. C., Wählby, A. & Jones, T. A. (2004). Acta Cryst. D60, 2240–2249. [PubMed]
  • Larkin, M. A., Blackshields, G., Brown, N. P., Chenna, R., McGettigan, P. A., McWilliam, H., Valentin, F., Wallace, I. M., Wilm, A., Lopez, R., Thompson, J. D., Gibson, T. J. & Higgins, D. G. (2007). Bioinformatics, 23, 2947–2948. [PubMed]
  • Laskowski, R. A. (2001). Nucleic Acids. Res. 29, 221–222. [PMC free article] [PubMed]
  • Marković-Housley, Z., Degano, M., Lamba, D., von Roepenack-Lahaye, E., Clemens, S., Susani, M., Ferreira, F., Scheiner, O. & Breiteneder, H. (2003). J. Mol. Biol. 325, 123–133. [PubMed]
  • McRee, D. E. (1999). J. Struct. Biol. 125, 156–165. [PubMed]
  • Merritt, E. A. (2012). Acta Cryst. D68, 468–477. [PMC free article] [PubMed]
  • Merritt, E. A. & Bacon, D. J. (1997). Methods Enzymol. 277, 505–524. [PubMed]
  • Mosavi, L. K., Minor, D. L. & Peng, Z. Y. (2002). Proc. Natl Acad. Sci. USA, 99, 16029–16034. [PubMed]
  • Murshudov, G. N., Skubák, P., Lebedev, A. A., Pannu, N. S., Steiner, R. A., Nicholls, R. A., Winn, M. D., Long, F. & Vagin, A. A. (2011). Acta Cryst. D67, 355–367. [PMC free article] [PubMed]
  • Murshudov, G. N., Vagin, A. A. & Dodson, E. J. (1997). Acta Cryst. D53, 240–255. [PubMed]
  • Painter, J. & Merritt, E. A. (2006). Acta Cryst. D62, 439–450. [PubMed]
  • Petsko, G. A. (2007). Genome Biol. 8, 103. [PMC free article] [PubMed]
  • Potterton, E., Briggs, P., Turkenburg, M. & Dodson, E. (2003). Acta Cryst. D59, 1131–1137. [PubMed]
  • Read, R. J. et al. (2011). Structure, 19, 1395–1412. [PMC free article] [PubMed]
  • Rupp, B. (2009). Biomolecular Crystallography: Principles, Practice, and Application to Structural Biology, 1st ed. New York: Garland Science.
  • Rupp, B. (2010). J. Appl. Cryst. 43, 1242–1249.
  • Schneider, T. R. & Sheldrick, G. M. (2002). Acta Cryst. D58, 1772–1779. [PubMed]
  • Sheldrick, G. M. (2008). Acta Cryst. A64, 112–122. [PubMed]
  • Tickle, I. J. (2007). Acta Cryst. D63, 1274–1281. [PubMed]
  • Tickle, I. J., Laskowski, R. A. & Moss, D. S. (1998a). Acta Cryst. D54, 243–252. [PubMed]
  • Tickle, I. J., Laskowski, R. A. & Moss, D. S. (1998b). Acta Cryst. D54, 547–557. [PubMed]
  • Tickle, I. J., Laskowski, R. A. & Moss, D. S. (2000). Acta Cryst. D56, 442–450. [PubMed]
  • Tronrud, D. E. (1996). J. Appl. Cryst. 29, 100–104.
  • Velankar, S. et al. (2010). Nucleic Acids Res. 38, D308–D317. [PMC free article] [PubMed]
  • Weiss, M. S. (2001). J. Appl. Cryst. 34, 130–135.
  • Winn, M. D. (2003). J. Synchrotron Rad. 10, 23–25. [PubMed]
  • Winn, M. D., Isupov, M. N. & Murshudov, G. N. (2001). Acta Cryst. D57, 122–133. [PubMed]
  • Winn, M. D. et al. (2011). Acta Cryst. D67, 235–242. [PMC free article] [PubMed]
  • Zaborsky, N., Brunner, M., Wallner, M., Himly, M., Karl, T., Schwarzenbacher, R., Ferreira, F. & Achatz, G. (2010). J. Immunol. 184, 725–735. [PMC free article] [PubMed]

Articles from Acta Crystallographica Section F: Structural Biology and Crystallization Communications are provided here courtesy of International Union of Crystallography