|Home | About | Journals | Submit | Contact Us | Français|
Physically improbable features in the model of the birch pollen structure Bet v 1d (PDB entry 3k78) are faithfully reproduced in electron density generated with the deposited structure factors, but these structure factors themselves exhibit properties that are characteristic of data calculated from a simple model and are inconsistent with the data and error model obtained through experimental measurements. The refinement of the 3k78 model against these structure factors leads to an isomorphous structure different from the deposited model with an implausibly small R value (0.019). The abnormal refinement is compared with normal refinement of an isomorphous variant structure of Bet v 1l (PDB entry 1fm4). A variety of analytical tools, including the application of Diederichs plots, Rσ plots and bulk-solvent analysis are discussed as promising aids in validation. The examination of the Bet v 1d structure also cautions against the practice of indicating poorly defined protein chain residues through zero occupancies. The recommendation to preserve diffraction images is amplified.
During a routine search of the public PDB_REDO database (Joosten et al., 2011 ) for a crystal structure model of birch pollen protein Bet v 1, a significant discrepancy between the originally reported R values (R free = 0.298, R work = 0.274) and the conservatively re-refined structure of PDB entry 3k78 (Bet v 1d) was detected (0.177, 0.126). These R values are unexpectedly low for a 2.8 Å structure. At the same time, the electron-density map provided by the Uppsala Electron Density Server, EDS (Kleywegt et al., 2004 ), publicly accessible through the PDBe (Velankar et al., 2010 ), shows numerous side chains that do not fit the experimental electron density. The EDS service also reported a negative bulk-solvent contribution B factor and a negligibly small bulk-solvent contribution scale factor, which is abnormal for an experimentally determined protein structure (Fokine & Urzhumtsev, 2002 ). Given the fact that the R values calculated by PDB_REDO from the data without refinement (0.265, 0.275; a new R free set was calculated by PDB_REDO) agreed reasonably well with the values reported in the PDB header (0.298, 0.273), an accidental swap of experimentally observed structure factors F(obs) against the final calculated structure factors F(calc) when generating the deposited structure-factor file can be excluded (in that case also the reproduced R values without refinement would be improbably low). In view of these discrepancies it seemed sensible to re-examine the 3k78 model and the associated deposited diffraction data.
The crystal structure model of birch pollen hypoallergen Bet v 1d (Zaborsky et al., 2010 ), PDB code 3k78, was reported as solved by molecular replacement (MR) from the nearly sequence identical model of the hypoallergenic isoform Bet v 1l (Marković-Housley et al., 2003 ), PDB entry 1fm4. The model structures are isomorphous (P21) with cell constants identical within experimental error. 1fm4 itself was derived by MR from the C2221 structure model of the clinically important inhalant major allergen, Bet v 1a (Gajhede et al., 1996 ; PDB entry 1bv1). A sequence alignment including additional information relevant to the following discussion is provided in Fig. 1 .
The 3k78 model was refined against structure factors with 2.8 Å resolution, and 1fm4 was refined at 2.0 Å. Both structures appear unremarkable (in a technical sense, no insult to biological relevance intended), and the refinement statistics and protocols reported in the PDB entries are appropriate for the resolution. However, on closer inspection, both the model and the structure-factor data of 3k78 exhibit highly unlikely, physically improbable (if not impossible) features. For reference, the results of the 3k78 analysis and re-refinement are compared with those obtained for the isomorphous 1fm4 structure of good and reproducible quality. This comparison may provide useful reference for the aspiring crystallographer and can serve as teaching material.
The two models were originally refined using different programs, CNS 1.0 (Brünger et al., 1998 ), and REFMAC5 (Murshudov et al., 1997 , 2011 ; Winn et al., 2001 ), with different refinement protocols. To aid comparison, a common isotropic B-factor refinement protocol with REFMAC was used in both cases, with parameters adjusted appropriate to each refinement.
The mmCIF structure-factor files and PDB coordinate files were downloaded from the PDBe (Velankar et al., 2010 ). Structure-factor files were converted into mtz files using the programs of the CCP4 suite (Winn, 2003 ; Winn et al., 2011 ) through the CCP4i user interface (Potterton et al., 2003 ). The original R free data sets were kept (except in an additional refinement of 3k78 for graphing purposes discussed in §3). Original maximum-likelihood maps were computed via REFMAC (zero cycles) with automated weighting from original coordinates and structure factors, and in case of 3k78 also the TLS parameters were read in from the deposited coordinate file. The procedures for analysis of the structure-factor data are provided in §3.
The common REFMAC protocol included isotropic individual B factors, flat bulk-solvent model (Jiang & Brünger, 1994 ), and riding H atoms were used in these refinements. The REFMAC X-ray matrix weight (Murshudov et al., 2011 ) and B-factor restraint weights were manually adjusted by monitoring the negative cross-validation log-likelihood (−LLfree) minimum at convergence (Tickle, 2007 ).
The coordinate file of the Bet v 1l search model, 1fm4, reveals no unusual features. The PDB file contains residues 2–160 of the sequence, but the residue numbers in the coordinate file are decremented by 1 compared to the aligned sequences in Fig. 1 . As specified in REMARK 480, occupancies for the surface exposed, terminal side-chain atoms of Lys28, Lys65, Lys80, Lys103, Lys129, Glu131, Gln132, Lys134 and Lys137 are set to zero (§4, Fig. 8). Zero side-chain occupancies usually indicate that the side chains were poorly defined in electron density owing to displacement such as disorder or multiple conformations, and instead of accepting the correspondingly high displacement parameters or B factors from the refinement, the occupancies of such atoms are manually set to zero. While still common practice, such is not necessarily the best way to indicate the limited knowledge of their actual position (c.f. discussion in §4).
Progress in the methodology of macromolecular refinement has led to steady improvements of the programs, and major efforts to re-refine already deposited PDB models have been undertaken in the PDB_REDO effort (Joosten et al., 2011 ). In this work, the purpose of re-refining the already good 1fm4 structure is not to generate a better model (which ultimately would also require some minor rebuilding) but to provide a benchmark for the applied procedure and an example of the characteristics of a well refined model in order to appreciate the abnormal refinement of 3k78.
1fm4 was already well refined with CNS1.0 about a decade ago. During the multiple weight adjustment runs REFMAC reached stable convergence after about 30 cycles, with a resolution-typical X-ray matrix weight of 0.2 and restraint weight σs for B-factor main-chain 1–2, 1–3 neighbors and side-chain 1–2, 1–3 neighbors adjusted to 3, 5, 7 and 9 Å2, which is reasonable given the empirical values (Tronrud, 1996 ). The re-refined REFMAC model differs very little from the original model. The overall coordinate r.m.s.d. between models on all atoms is 0.247 Å and on Cα is 0.078 Å, which is well below the historic value for 100% sequence identity expected from the Chothia and Lesk function (Chothia & Lesk, 1986 ). No significant geometry improvements resulted during re-refinement, and both 1fm4 and its re-refined model are of good quality. No attempts at model rebuilding were made, which probably could close the slightly increased R–R free gap (Tickle et al., 1998b , 2000 ) compared with the original refinement. A subset of refinement statistics relevant to the structure comparison are compiled in Table 1 . Considering the different programs (CNS1.0 versus REFMAC5.6), the differences in protocol, as well as different X-ray and restraint weight optimization, this result is quite reassuring and attests to the reproducibility of crystallographic refinement.
The B factors of the previously ‘unoccupied’ side-chain atoms with reset occupancy refined as expected to high B factors, and the inspection of the electron density of these residues in COOT (Emsley et al., 2010 ) shows the corresponding and increasing weakening of density along the side-chain terminals (§4, Fig. 9). Apart from polishing the model ‘ad tedium’ (the term originally being coined by Phil Evans), the well refined 1fm4 model remains fully valid even under different refinement protocols executed nearly a decade later. As stated above, setting the occupancies of side-chain atoms of residues with weak density to zero seems to be unnecessary and could probably be avoided.
Although the 3k78 Bet v 1d model has five backbone torsion angle outliers and numerous severe geometry deviations in the residues with zero occupancy atoms, it is otherwise unremarkable. The coordinate file of 3k78 contains residues 3–159 of the sequence, with the residue numbers matching the sequence alignment in Fig. 1 (i.e. incremented from 1fm4 by 1). However, for the residues containing zero occupancy atoms (Asn29, Lys66, Lys81, Lys104, Lys130, Glu132, Gln133, Lys135 and Lys138) an interesting pattern emerges: the zero occupancies are systematically shifted in atom number to lower values, i.e. it is not the terminal side-chain atoms that are unoccupied, but the zero occupancies move towards the Cβ, and even to the (in the PDB file but not physically) adjacent backbone O atoms of the respective residue, while the terminal atoms of the residues become occupied again (§4, Fig. 8). This pattern is physically highly improbable, but no explanation for this selection of zero occupancy atoms has been reported. These physically improbable model features do, however, lead to some interesting features in the electron density of the original refinement (§4, Fig. 9). The substantial bond distance deviations of most of the residues with zero occupancy atoms are listed in §4, Fig. 10. The remaining deviations can be found in the 3k78 PDB header REMARK 500 records or may be generated with RUN500 from CCP4i.
The model was originally refined using the REFMAC hybrid TLS–isotropic B-factor refinement (Painter & Merritt 2006 ; Murshudov et al., 2011 ) with a single TLS group. Given the 2.8 Å resolution, hybrid TLS refinement would not be unusual or unreasonable, although a rationale for the choice of protocol, parameterization, and analysis of the (small) TLS contributions is absent (Zaborsky et al., 2010 ). Original density maps were calculated from unchanged deposited data and coordinates via a zero cycle refinement run in REFMAC (including the published TLS groups and matrices). The resulting R values (0.304, 0.269) were in reasonable agreement with those reported in the PDB header (0.298, 0.273) and by PDB_REDO (0.265, 0.275).
When the original coordinate file is loaded into COOT (Emsley et al., 2010 ), difference density peaks > 5σ clearly indicate that several residues such as Ile8, Gln37, Glu43, Gly52, Lys56, Glu61, Arg71, Asp110, Glu128, Tyr151 and His155 should be modeled with different conformations (Fig. 2 ), in agreement with the findings of the EDS service (Kleywegt et al., 2004 ) which can be readily accessed via the PDB validation links. While such modeling errors are not unusual, they can easily be corrected. There was no support for the claim of unidentified density in the core of the molecule made in the 3k78 publication (Zaborsky et al., 2010 ). Instead, two chemically plausible water molecules included in the model can be discerned in the electron density. Given the relatively high R values and poor geometry of the side chains with zero occupancy atoms in the published model, rebuilding and re-refinement of 3k78 appeared promising.
The original 3k78 coordinates were used without rebuilding (only the zero occupancies were reset to 0.01) for isotropic B-factor refinement. Initially a resolution-appropriate low X-ray matrix weight of 0.1 was used to keep the geometry tight and repair the originally distorted zero-occupancy residues. The same B-factor restraint weights as for 1fm4 (3/5/7/9 Å2) were used for 30 cycles. The refinement did not reach convergence, but the R values already dropped unexpectedly quickly to 0.131 and 0.068. Inspection of the model geometry showed that the model overall had in fact improved, and maps showed that the misplaced residues Ile8, Gln37, Glu43, Gly52, Lys56, Glu61, Arg71, Asp110, Glu128, Tyr151 and His155 all had assumed correct positions practically identical to those in 1fm4 with good geometry in the remarkably noiseless density map. Nine water atoms from 1fm4 that also occupied density in the 3k78 map were added to the new model by a simple cut and paste.
At that point of the refinement the R values had already reached values typical for atomic resolution structures. Given the negative bulk-solvent B factor of −10 Å2 and small bulk-solvent scale factor of 0.026 e− Å−3, no sensible bulk-solvent scattering contribution seemed to be present, and the assumption of calculated structure factors was made. As a consequence, (a) the bulk-solvent correction was turned off, (b) no riding H atoms were included, (c) X-ray matrix weights were increased to 0.6, (d) B-factor restraint weights were loosened up to their physically reasonable limit (5/7/9/11 Å2) as established by empirical values (Tronrud, 1996 ).
The refinement, with its atypical protocol for any experimental protein structure, reached stable convergence at R values of 0.040 and 0.019, with stable geometry and practically the same target r.m.s.d. values as 1fm4 (Table 1 ). The resulting density maps were practically noiseless, with the only remaining significant difference density features in the vicinity of the residues with unoccupied side-chain atoms. According to PROCHECK (Laskowski, 2001 ) or RUN500, the entire model had excellent geometry quality. Tedium was declared and no manual rebuilding of the side chains with unoccupied atoms was attempted.
At this point it was clearly established that (a) the deposited structure factors are calculated structure factors, (b) the resulting re-refined model resembles in most details the mutated search model, (c) that the original model has not, or not properly, been refined against these structure factors (or had been altered from a model essentially similar to the re-refined model and after the structure factors had been calculated).
Given the highly improbable refinement results inconsistent with experimental data at 2.8 Å resolution, a closer examination of the deposited structure-factor data was undertaken.
The data for 1fm4 and for 3k78 were collected in-house on rotating anode sources and recorded on imaging plate detectors, with reported redundancies of 3.3 and 2.1 respectively, and should be comparable. In absence of unmerged intensity data, a SHELX (Sheldrick, 2008 ) format data file was generated from the mtz structure-factor amplitudes, read into XPREP (George Sheldrick, Bruker AXS) with HKL3 format option, and converted to intensities following the basic, error-propagation-based F to I conversion (see e.g. Rupp, 2009 , pp. 328), i.e.
While the mean I, mean I/σ(I), and Rσ (Schneider & Sheldrick, 2002 ) values for 1fm4 are typical, the 3k78 data show highly unusual features (Table 2 , Supplementary Table 3b 1, Fig. 3 ). The value of Rσ for validation is based on the fact that it allows computation and assessment of an a posteriori R merge-like data-quality indicator when unmerged data or images for proper reprocessing are not available owing to the unfortunate absence of a formal obligation to deposit unmerged intensity data or diffraction images. tends to be somewhat lower than the corresponding linear R merge. For a discussion of the various merging R values see Diederichs & Karplus (1997 ); Weiss (2001 ); Rupp (2009 ); and Einspahr & Weiss (2012 ).
The improbably low Rσ values in 3k78 data are caused by a discrepancy between the intensities and their exceptionally low standard uncertainties. In addition to Poisson-statistics-derived counting errors, multiple other sources of instrumental errors limit the achievable signal to noise ratio, that is, I/σ(I). This has been investigated in detail (Diederichs, 2010 ), and Diederichs notes that even with good crystals the I/σ(I) ratio of the strongest (unmerged) observations is rarely above 30 even in the lowest resolution shell. It is obvious then, that ‘counting statistics are not the limiting factor, as individual reflections may well have many more than 10 000 counts, which would allow I/σ(I) ratios of more than 100 and low-resolution R factors of better than 1%’ (Diederichs, 2010 ). The paper also provides multiple plots of I/σ(I) versus log(I) which show distinct plateaux at around I/σ(I) values of about 20 to 30.
In absence of original unmerged intensity data and to account for possible effects of redundancy, the 1fm4 data with a reported overall redundancy of 3.3 and of 3k78 with a redundancy of 2.1 were compared with the aid of Diederichs plots (Fig. 4 ). 1fm4 shows the behavior expected for a normal data set, while 3k78 shows extremely high I/σ(I) values and completely atypical behavior, and are apparently unlimited by any instrument measurement errors.
The resulting improbably high signal-to-noise ratios in turn indicate that these standard uncertainties are not based on any experimental variances. Some analysis of a possible origin can be provided by examining a non-logarithmic version of the Diederichs plot. A simple power law fit of the deposited data reveals that the signal-to-noise ratio I/σ(I) is essentially proportional to the square root of I, which is expected if the σ(I) is computed from I 1/2. An error model closely reproducing the deposited standard uncertainties can be obtained by generating a random error from the absolute inverse cumulative normal distribution around mean zero with a σ of 3.0 via the Excel NORMINV function, and forming the square root of the product of this random error with I. From these I/σ(I) values (Fig. 5 ), F and σ(F) follow again by basic error propagation, with an atypical σ(F) distribution very similar to the deposited standard uncertainties. Spreadsheets including the calculations and additional graphs are included in the supplementary material.
Proteins contain large fractions of disordered solvent, whose bulk-solvent scattering contributions supress the low-resolution intensities in an experimentally collected protein diffraction data set. The low-resolution structure factors calculated without bulk-solvent contributions should be significantly higher than the observed structure factors, while at the same time the R values for a refinement of a not bulk-solvent-corrected structure should be much higher than for a properly bulk-solvent-corrected structure. Representative graphs and a review of bulk-solvent scattering models can be found in Fokine & Urzhumtsev (2002 ) and in basic textbooks (e.g. Rupp, 2009 ).
The original cross-validation data set contained only 4.8% of the data (162 reflections), and in the two lowest resolution shells the original 3k78 data contained no or only one cross-validation reflection, respectively. For the overall data range, the uncertainty in R free (Kleywegt & Brünger, 1996 ; Tickle et al., 1998a ) is still acceptable with the low number of crossvalidation reflections, but for plotting in shells the R free count is too low to be of practical value. For plotting, new a posteriori R free data (Brünger, 1997 ) were obtained from new cross-validation data sets with 10% random selection against which the coordinate-perturbed starting model from the first 3k78 isotropic refinement was refined. Even with this suboptimal cross-validation procedure, the isotropic B-factor refinements reproduced the same R values of around 0.04/0.02. The R free versus resolution plots for 3k78 were still noisy but show the same trend as plots from the original cross-validation set, and these data were used in the following analysis.
Structure factors and R values were calculated by REFMAC with and without bulk-solvent correction from the respective re-refined models of 1fm4 and 3k78. The R free versus resolution plots as well as F(calc) and F(obs) versus resolution show expected behavior for 1fm4 consistent with bulk-solvent scattering contributions (Fig. 6 ). The same plots for 3k78 indicate absence of bulk-solvent scattering contributions in the structure factors, consistent with the negative bulk-solvent correction and trivially small bulk-solvent scale factor reported by REFMAC and the EDS report. The R free plot for 3k78 shows the same lack of the strong increase in low resolution R value that would be expected for the refinement in the absence of a bulk-solvent correction and resembles the findings for the fabricated C3b structure (Janssen et al., 2007 ). Given identical F(obs) and F(calc) without bulk-solvent contribution, logarithmic intensity ratio data plots (not shown) again replicate the situation demonstrated for the C3b structure.
For the purpose of validation, bulk-solvent parameters need to be calculated reliably from the original data. The EDS data at present suffer from some divergences, leading to a multimodal distribution probably caused by certain threshold or limit values for the bulk-solvent parameters. A consistent calculation using the flat bulk-solvent contribution (Afonine et al., 2005 ; Afonine 2012 ) model using phenix.refine (Adams et al., 2010 ) provides ~40 000 valid bulk-solvent contribution B-factor–scale-factor pairs. The probability distribution function represented in Fig. 7 is consistent with the earlier published smaller set of data (Fokine & Urzhumtsev, 2002 ). Entry 3k78, the fabricated entry 2hr0 (Janssen et al., 2007 ), and two entries that are now updated (1n0q and 1n0r) but contained erroneously deposited calculated structure factors (Mosavi et al., 2002 ), could be clearly identified as outliers given the distribution in Fig. 7 .
The pattern that the zero occupancy atoms of 3k78 residues (Asn29, Lys66, Lys81, Lys104, Lys130, Glu132, Gln133, Lys135 and Lys138) display seems to be caused by a shift of zero occupancies to atoms with atom numbers decremented consistently by 2. This shift causes the backbone O atoms of the respective residue to become unoccupied, while the terminal atoms of the residues become occupied again (Fig. 8 and Supplementary Table 4a). Such errors could be introduced during the preparation of molecular replacement models. In case of experimental structure factors, the electron-density map will indicate the error by positive difference density peaks in place of the atoms missing in the model. In case of 3k78, however, the atom absences propagate into the electron density.
Quite unexpected is that in original 3k78 maps (§2.4) no 2mF o − DF c density for the unoccupied missing atoms down to near-noise levels below 0.5σ nor difference density the mF o − DF c maps is visible for unoccupied atoms, including the backbone O atoms in Lys130, Glu132 and Gln133 (Fig. 9 ). The weak difference density for Lys135 probably results from incorrect placement. Given the reported typical main-chain B factors (~30–35 Å2) of the adjacent, covalently connected backbone atoms, this behavior is very unusual and improbable. Following the lysine side chains towards the solvent, there is again clear density for the solvent-exposed C and Nζ atoms of the lysine residues, but they are untethered by hydrogen bonds or other contacts. These observations are characteristic of data calculated from a model with zero occupancy atoms.
Setting occupancies of protein atoms that are poorly defined or absent in electron density to zero has very little effect on the overall model quality or refinement itself: zero occupancy as well as a very high B factor both lead to respectively zero or negligible scattering contributions, and either will have an insignificant effect on the rest of the model. Inspection of the electron density of the side-chain atoms of residues with reset occupancy in the re-refined 1fm4 model illustrate the fact that such atoms simply refine to high B factors and display correspondingly weak electron density (Fig. 9 ). Nevertheless, it should be kept in mind that for many cases of local disorder, large isotropic displacement (B) factors are not a physically correct description either (Merritt, 2012 ). A number of other inconsistencies and problems however can be introduced by zero occupancy atoms in the chain of a protein model.
The findings surfacing during model refinement in §2 and amplified during the structure factor analysis in §3 and the feature propagation discussed in §4 provide consistent and very convincing evidence that (a) the structure-factor data deposited for 3k78 are calculated structure factors, (b) the resulting re-refined model resembles in most details the mutated search model, (c) that the original model has not, or not properly, been refined against these structure factors (or had been altered from a model essentially similar to the re-refined model and after the structure factors had been calculated). Being not refined against the deposited structure factors, the 3k78 model at present at least lacks experimental basis. The findings leading to the above conclusions are summarized below.
Each of these points alone is reason for concern, and when combined and evaluated against prior expectations, they leave no doubt that model and data of 3k78 are incompatible and that the deposited structure factors are not based on actual experiments, and their standard uncertainties are not based on experimental errors.
Following basic scientific epistemology, strong and convincing evidence would have to be provided to overcome these doubts (Rupp, 2010 ). In case of an error during deposition, this should be trivial to achieve, and database integrity could be easily restored. At least an experimental data set which refines to the deposited structure, or unmerged intensity data reprocessed from the original images should be supplied. Most convincing and irrefutably, the presentation of actual diffraction images which produce data representing the deposited model would establish the facts.
Considerable efforts by the PDB validation task force (Read, 2011 ) will make it much less likely that poorly refined models, models inconsistent with data, or implausible data will enter the public databases. Nevertheless, it remains a fact that – irrespective of the cause of the problem – in the case of 3k78 a calculated data set also incompatible with the associated coordinate entry has been successfully deposited. The example of 3k78 provides a few additional suggestions that might be useful not just for a posteriori validation during deposition but also particularly for the aspiring crystallographer during structure refinement.
Finally, despite all the diagnostics and validation tools available during model building, refinement, and ultimately upon PDB deposition, one needs to recollect that not the PDB but the individual crystallographer bears the final – and sometimes far reaching (Petsko, 2007 ) – responsibility for the correctness of the deposited model.
I wish to anonymously acknowledge several colleagues who provided critical comments and detailed information about the refinement and data analysis programs used in this work. Ed Pozharski extracted raw data from the EDS database. P. Afonine computed bulk-solvent contributions with an improved bulk-solvent parameter implementation in phenix.refine. Reviewers have pointed out a number of didactical and presentational improvements to the manuscript. The REFMAC command script, the input files, and the results for the isotropic B-factor refinement of 3k78 as well as the XPREP data analysis and bulk-solvent data are deposited as supplementary materials. The hyperlink to PDB_REDO of 3k78 is http://www.cmbi.ru.nl/pdb_redo/k7/3k78/index.html, for the EDS report http://eds.bmc.uu.se/cgi-bin/eds/uusfs?pdbCode=3k78, and the electron density can be loaded via the EDS link to the ASTEX Viewer at http://eds.bmc.uu.se/cgi-bin/eds/eds_astex.pl?infile=3k78¢re=A61.
1Supplementary materials have been deposited in the IUCr electronic archive (Reference: WD5176).