Many proteins are oligomeric due to the association of identical subunits under physiological conditions. Homooligomerization may be part of allosteric regulation1
, or contribute to conformational and thermal stabilities2
and to higher binding affinity with other molecules. Homodimeric proteins have been found to form interactions with a larger number of other proteins than monomeric proteins3
. Multimerization is particularly common in enzymes, transcription factors, and signal transduction4
. The major driving forces for protein multimerization are shape and charge complementarity between the associating subunits, brought about by a combination of hydrophobic and polar interactions5; 6
. Some proteins oligomerize by domain swapping, in which a segment of monomeric protein is replaced by an identical segment from the other subunit and vice versa7; 8
. Many proteins have different predominant oligomeric states under different physiologically relevant conditions, and these states may have important functional differences. Homodimerization may arise in evolution because of stronger tendencies of identical interfaces to self-associate compared to dissimilar interfaces. Heterodimers of proteins in the same superfamily may then evolve from such homodimers.9
Some human diseases are caused by inherited missense mutations in proteins that cause disease in part by having an effect on oligomeric association. For instance, infantile cortical hyperostosis (Caffey disease) is a genetic disorder caused by a missense mutation in exon 41 of the gene encoding the α1(I) chain, producing abnormal disulfide-bonded dimeric α1(I) chains10
. Myofibrillar myopathy (MFM) is a human disease of muscle weakening, and a causative mutation is localized in the dimerization domain of the filamin c gene, disrupting its secondary structure, leading to an inability to dimerize properly11
. Cu,Zn superoxide dismutase (CuZnSOD) is an efficient enzyme that catalyzes the conversion of superoxide to oxygen and hydrogen peroxide12
. Familial amyotrophic lateral sclerosis or Lou Gehrig’s disease is associated with mutations in CuZnSOD13; 14
. Some mutations destabilize the SOD dimer, causing abnormal aggregation that may be lethal to cells.
Experimental means for determining the size of an oligomer include analytical ultracentrifugation15
and gel filtration16
. These methods separate proteins and protein complexes based on their size or mass, from which oligomerization state may be inferred. However, knowing the size of a protein oligomer does not provide information on the interacting surfaces within an oligomer or the overall structure. Combining separation of oligomers with cross-linking and mass-spectrometry can be used to determine protein segments that may be in the binding interfaces between monomers17
. Fluorescence resonance energy transfer (FRET) experiments can be used to identify donor-acceptor pairs of residues that must be near each other in a protein complex to identify which of several dimers in an X-ray crystal structure is likely to be physiologically relevant18
. NMR can also be used to determine detailed information on the structure and dynamics of protein oligomers in solution. However, the size of proteins that can be studied easily by NMR is limited.
For most proteins, information on oligomeric association size and in particular structure comes from X-ray crystallography. For many proteins the size and actual structure of multimers is controversial or unknown, and is based only
on what is observed in crystal structures, sometimes even a single crystal structure. Both the Protein Data Bank (PDB)19
and the European Bioinformatics Institute (EBI)20; 21
provide information on “biological units” or assemblies which are the assumed biologically relevant oligomeric structures found within crystals. The PDB’s biological units are based on what authors of structures themselves believe to be the biologically relevant structure, while those of the recently developed PISA server21
from the EBI are based on the analysis of interfaces and predicted stability of complexes observed in single crystal structures. PQS (Protein Quaternary Server)20
contains both manual and automated identifications of biological units (E. Krissinel, personal communication). The PDB and PQS usually have one biological unit size for each PDB entry, while PISA contains multiple oligomeric structures of different sizes for many PDB entries based on chemical thermodynamics calculations on complex stability. The recently developed PIQSI database provides manually annotated sizes of biological units from the literature for PDB entries22
Many databases and analyses have used PDB and PQS biological units to examine the interfaces between protein domains. For instance, PIBASE23
provides a list of structures for a query of two SCOP superfamily or family designations24
, and provides access to coordinates for each pairwise interaction. Interactions in PIBASE are derived from two sources – the author-approved files provided by the PDB (e.g., pdb1ylv.ent), which generally contain the asymmetric unit of the crystal structure and many non-physiological interactions, and hypothetical biological units as proposed by the authors of PQS, (e.g., 1ylv.mmol). The emphasis is on characterizing pairwise interfaces in terms of surface area and polar/nonpolar content. PSIMAP/PSIBASE25
also performs binary searches for two SCOP-defined domains and finds all structures containing interactions between the query domains. Other databases such as SNAPPI-DB26
, and iPFAM28
also use SCOP, PFAM, PDB and PQS to define atomic interactions among protein domains. Databases of this sort are used for statistical analysis of residue contacts across interfaces to develop methods for predicting or scoring interfaces5; 29; 30; 31; 32; 33
. However, if the data in PDB and PQS are incorrect, these analyses are called into question, both in training data and testing data. Homology modeling based on known multimer structures also depends on accurate multimer structures, and incorrect biological inferences can be made when the assumed quaternary structure of the template is incorrect.
We recently compared the biological units in the PDB and PQS for all crystallographic entries in the PDB, and found that they agree on only 83% of entries34
. The PDB has a higher tendency than PQS to have biological units that are identical with the asymmetric unit of the same structure, indicating perhaps that many authors may make the unwarranted assumption that the asymmetric and biological units are the same. We also found that PDB and PQS have inconsistent assignments of biological units for proteins in multiple entries in the PDB that all have the same crystal form. This occurs in the PDB for 12% of entries and in PQS for about 18% of entries. The PDB’s assignments may be more consistent merely because a single research group may solve multiple structures within the same crystal form, and assign similar biological units to all of them. When the PDB and PQS agree on the size of a multimer for a single PDB entry, they disagree on the orientation and interface between interacting monomers in less than 2% of cases. The PDB and PQS may have different interfaces across a family of closely related or identical proteins34
A number of studies have attempted to differentiate between biological and crystallization-induced contacts. Ponstingl et al. compiled a set of 96 monomers and 76 homodimers in the PDB by reference to the published literature35
, and compared the ability of buried surface area and pair interaction scores to predict biological contacts in crystals. This dataset has subsequently used by others as a benchmark for methods that attempt to determine biological assemblies from single crystals21; 22
. Bahadur et al.36; 37
assembled a set of interfaces consisting of 70 heterodimeric structures, 122 homodimeric structures, and 188 crystal packing interfaces with surface area greater than 800 Å2
, and examined the physical properties of the different interface classes. Shoemaker et al.38
looked for common interfaces in different crystals of identical and homologous proteins, so-called “conserved binding modes,” in order to identify likely biologically relevant structures.
In this paper, we examine thoroughly the interfaces in crystals of single homologous proteins. We attempt to answer several questions. First, when are two crystals of the same or similar proteins really the same crystal form and when are they not? We find surprisingly that PDB entries with the same space group, asymmetric unit size, and quite similar unit cell dimensions are occasionally different crystal forms as judged by the interfaces and monomer-monomer orientations that exist within the crystal lattice. Conversely, two crystals in different space groups may be quite similar in terms of all or nearly all of the interfaces within each crystal. This occurs when one contains a subset of symmetry operators of the other and a larger asymmetric unit, and also when one is a small distortion of the other such that the space group is different. This analysis helps to sort PDB entries within a family into truly different crystal forms.
Second, we examine the hypothesis used by many crystallographers to infer biological interactions: observation of the same interface in different crystal forms of a protein (or members of the same family) suggests that the interface may be biologically relevant. We compare all interfaces in the available crystal forms in each family and determined those shared by two or more crystal forms. We determine the number of crystal forms with the interface, M, compared to the total number of different crystal forms in the same family, N. We then evaluate the usefulness of these numbers with prior benchmarks on oligomeric interactions as well as with NMR structures. When M is greater than 4 or 5, and especially when M is close to or equal to N, then the observed interfaces are likely to be part of biologically relevant assemblies. We find 36 families in which all N out of N crystal forms contain a particular interface, where N≥10. These interfaces are very likely to be physiological. We also find that monomers in a benchmark set comprising both the Ponstingl and Bahadur sets tend to have M<<N.
Third, we examine the usefulness of evolutionary information in evaluating interfaces appearing in more than one crystal form. It occurs often that different crystal forms of identical proteins contain common interfaces but that these usually appear in only 2 or 3 such forms and are not shared by homologous proteins. That is, they are only formed under non-physiological crystallization conditions including high protein concentration, peculiar pH, and the presence of non-physiological ligands. This has previously been observed for T4 lysozyme, which has been studied in many crystal forms39
. When an interface is shared in two different crystal forms by divergent proteins, then the interface is very likely to be biologically important. We also find that in large families, some interfaces are restricted to one branch of a family, indicating the evolution of an interface in one branch of the family and/or loss in another. This highlights the importance of solving structures of related proteins.
Finally, we compared interfaces common to multiple crystal forms with the annotations found in the PDB, PQS, and PISA. With an increasing number of crystal forms that contain a given interface, it becomes increasingly likely that the available annotations agree that such an interface is part of a biologically relevant assembly. PISA is found to be the most reliable in identifying interfaces for which the evidence, in terms of number of crystal forms containing the interface, seems very high. PISA is therefore the best source of biological assembly information when only one or two crystal forms is or are currently available.
This study is closest to the work of Shoemaker et al.38
but with some important differences. First, we examine the interfaces across PDB entries of homologous proteins to determine whether they are or are not the same crystal form, despite similarities and differences in space group, asymmetric unit size, and unit cell dimensions and angles. Shoemaker et al. separate crystal forms only by space group and/or differences in cell dimensions of greater than 2%. We find that this is inadequate to classify crystals as similar or different. Second, we evaluate the usefulness of the number of different crystal forms and the evolutionary relationships of shared interfaces, neither of which is considered by Shoemaker et al. Finally, we provide in supplemental data coordinate files of the shared interfaces that may be useful for further research as training or testing data.