We conducted a statistical analysis of intermolecular contacts in the crystal structures of 821 unambiguously monomeric proteins determined to a resolution of 2.5 Å or better. Our primary objective was to evaluate whether protein crystal contacts are isotropic (i.e.
stochastic) in nature, as concluded by earlier studies (Janin & Rodier, 1995
; Carugo & Argos, 1997
), or anisotropic, as suggested by more recent computational and experimental studies (Derewenda & Vekilov, 2006
; Pellicane et al.
). Isotropic, or random, interactions imply that the surfaces involved are not distinguishable from randomly selected solvent-exposed surfaces of the protein in terms of amino-acid composition. Assuming that the probability of an amino acid being involved in an interaction is directly proportional to its ASA, one can derive a contact-propensity scale using the logarithm of the ratio of the amino acid’s area-based frequency in the interface (defined as the buried ASA) to its area-based frequency in the total molecular ASA (Bahadur et al.
). Such an area-based composition scale was used to show that large crystal contacts, i.e.
those generated by twofold rotational symmetry, are slightly enriched in aliphatic and aromatic residues (LIVMFYW) and somewhat depleted of lysine and acidic residues (KED) (Bahadur et al.
), although the analysis was not extended to all crystal contacts. However, the area-based approach is based on the premise that the solvent-exposed protein surface, i.e.
the surface accessible to a water probe, is equivalent to the surface capable of making contact with another molecule. This is not completely true because proteins are characterized by irregular surface landscapes which consist on average of 36% knobs (or protrusions) and 62% clefts (or crevices), with the knobs containing residues that are 30% more likely to be involved in contacts than those in clefts (Albou et al.
). Our results also clearly show (Fig. 2) that the contact frequency for residues with small accessible surface area is lower per Å2
than for more exposed residues.
Notwithstanding, the area-based calculations would still be correct if the amino-acid compositions of knobs and clefts were the same. However, we show explicitly that this is not the case: amino acids with small average ASA (i.e.
those in clefts) are predominantly hydrophobic and small, while those that are exposed (i.e.
within knobs) are charged large residues (Fig. 3
). Similar results have recently been reported using alpha shapes representations (Albou et al.
To conclude, the area-based approach underestimates the propensity of small hydrophobic residues for inclusion in crystal contacts. For any given range of ASA, smaller and hydrophobic amino acids actually have a higher relative propensity for involvement in crystal contacts than large charged residues.
In principle, the discrepancy between the solvent-accessible surface and the contact-capable surface can be resolved in area-based calculations by the use of a more stringent lower ASA cutoff value of as high as 30% (Negi & Braun, 2007
). In this way, many of the residues/atoms in clefts are excluded from the protein surface. Such an approach results in the selection of an effective contact-capable surface which is more hydrophilic than the water-accessible surface, but it also generates ambiguity when residues that are not classified as exposed are actually physically incorporated into contacts.
It should be noted that the area-based approach is still applicable if either the selection pressure is very strong (as is the case in evolved biological interfaces) or if frequencies are compared between states which have been selected using the same criteria (such as a comparison of the buried surface area in biological and crystal interfaces). Given that stable biological interfaces developed in response to evolutionary pressure, while crystal contacts are formed by surfaces with no functional significance, the magnitudes of selection are dramatically different. Consequently, area-based composition analysis allows distinction between them and the observed quantitative differences are valid (Zhu et al.
). However, this methodology is not sensitive enough for comparisons of crystal contacts and random surface patches, where the differences are more subtle.
Here, we propose an alternative approach, based on logistic regression, which has significant advantages over the area-based methodology. It does not require the choice of an arbitrary threshold of solvent ASA to define the contact-capable surface, it does not assume a linear dependency of contact frequency on ASA and it allows us to rationalize the propensities in terms of physicochemical properties such as charge, side-chain entropy and polarity.
Firstly, we derive a crystal contact-propensity scale for all 20 amino acids and we show that Gly and small hydrophobic residues top the list, with Glu and Lys having the lowest rank. Thus, crystal contacts are systematically depleted of residues with high side-chain entropy crystal contacts. This observation is consistent with the notion of anisotropic nonrandom protein–protein interactions in solution during crystallization, i.e.
patch–patch interactions (Pellicane et al.
). Furthermore, we also show that side-chain entropy rather than polarity is the key determining factor in these interactions. However, polarity appears to play a dominant role in the actual packing of larger contacts, so that apolar amino acids are systematically located towards the core of the contact.
Our results lend strong theoretical support to the concept of rational crystal engineering and specifically the surface-entropy reduction (SER) strategy (Derewenda, 2004
; Derewenda & Vekilov, 2006
). The approach was originally suggested based on the simple theoretical premise that loss of conformational degrees of freedom by large side chains incorporated into crystal contacts constitutes a potentially critical impediment to protein crystallization (Longenecker et al.
; Mateja et al.
). Subsequently, we designed and implemented a server that identifies suitable mutation sites based on amino-acid sequence information alone (Goldschmidt et al.
). Using the SER strategy, a significant number of proteins have been successfully crystallized and their structures solved both in our group (Derewenda et al.
; Devedjiev et al.
) as well as other laboratories (Levinson et al.
; Yip et al.
). The conclusions of this paper may help to further refine the surface-engineering strategies.
After this paper was submitted, a study analyzing the physical properties that control protein crystallization and based on large-scale experimental data was published by the Northeast Structural Genomics Consortium (Price et al.
). The authors analyzed a sequence database of 679 proteins, of which 157 were crystallized, and used logistic regression to identify protein-sequence features that impact on the binary outcome of the crystallization effort. The study concluded that surface entropy dominates all other effects, so that the fractional content of amino acids in the target sequence can be used as predictive parameters for crystallization. The approach used in that work is different from ours in that it attempts to derive the propensity of proteins to form well diffracting crystals from global sequence features by comparison of crystallizable and noncrystallizable proteins, whereas our analysis focuses exclusively on surface properties in proteins of known structure. Nevertheless, the fact that different computational approaches lead to virtually identical conclusions is most encouraging.