We identified two major, high-level challenges in achieving our ambition to capture crystallization data: how to describe our attempts to produce protein crystals; and how to measure the outcomes of these attempts.
A description of our trials in some unambiguous, reproducible and universally understood manner in principle requires nothing more than a set of standards and a way of ensuring compliance to them. Yet even this purely logistical, scientifically non-controversial task is very challenging, as there are currently no defined nomenclatures for describing a crystallization experiment, not for the chemicals used, nor for the physical parameters, never mind for the protein sample itself. Take, for example, the non-protein component of a ‘standard’ (vapour diffusion or microbatch) crystallization experiment; this has been called ‘the precipitant’, ‘the reservoir’, ‘the cocktail’, ‘the condition’, ‘the well solution’ or (for the hopeful) ‘the crystallant’ amongst others. It has been reported that in the free-form data field for crystallization in the PDB (REMARK 280), the chemical ‘ammonium sulfate’ is represented by approximately 100 different strings (Peat et al.
The problem of capturing outcomes objectively would appear even more challenging still, as it requires scientific effort rather than merely establishing conventions. At least the push into high-throughput crystallization means that many recent experiments do have a measured outcome, in the form of one or more image(s) associated with the experiment. However, the image still has to be translated objectively into a form that can be used for quantitative analysis. This process will have to be automated to obtain not only complete but also consistent results: manual scoring of the same experiments is only about 70% consistent if using a seven-class system (Walker et al.
The simplest outcome is the binary, crystal/no crystal classification, which can provide meaningful information. The other extreme is the classification of outcomes related to protein solubility and the phase diagram, e.g.
crystal, clear, precipitate, phase separation, skin etc.
(Luft, Wolfley et al.
) which provides for significantly more information. This is a more detailed classification scheme, but requires correspondingly higher analysis times and the data will be less accurate than a simple binary classification. It becomes increasingly difficult to differentiate between the classifications of similar-looking outcomes when using a finer granularity in the classifications. Research in one of our centres (at HWI) has led to an automated classifier which is now comparable to humans at identifying single categories such as clear, precipitate, and also combinations of phase, skin, precipitate etc.
but is not as precise or accurate when identifying crystals (Kotseruba et al.
). Efforts to automate the classification of crystallization experimental outcomes have been ongoing for over a decade (Pan et al.
; Cumbaa & Jurisica, 2005
; Walker et al.
). In designing our ontology we must keep in mind the reliability of the measurement and its associated data. We have to capture not only the outcome but how that outcome was determined. In this manner we can account for different visual mechanisms (multiple types of microscopes and magnifications) and classification schemes. These will be aided by using other parts of the light spectrum, e.g.
ultraviolet, and even in situ
One of the results of the meeting was a commitment to develop a vocabulary to capture the complete crystallization data available to us. This vocabulary has several requirements. Overall it must be able to capture any crystallization experiment; not only those from the well defined protocols of large-scale crystallization centres, but also anything set up in less industrialized labs where scientists focus on individual projects. Regardless of where a crystallization experiment is performed, the information that needs to be captured is the same: we want to know about the sample, the experiment and the outcome, essentially the information captured in any good laboratory notebook. However, it is useful to consider this information in the context of what makes a difference to the experiment. Often seemingly small changes in a protocol can have a dramatic impact on the experiment’s outcome and reproducibility. Let us consider each of these three categories in detail.
The sample can be described by a name and the sequence of the protein,3
or proteins, that comprise it. Important protein properties may include sequence, molecular weight and isoelectric point (Slabinski et al.
). The sample has other properties associated with it: even a minimal sample consisting of only one protein in water has an associated concentration, unit of concentration, a history (e.g.
‘snap frozen and thawed just prior to setup’). Preparation details of the sample may be also be important, e.g.
‘retention time on a column’, ‘purity’ and ‘polydispersity’.3
The experimental setup, even for something as common as a hanging-drop experiment (Benvenuti & Mangani, 2007
), is also very hard to describe precisely. Assuming ‘hanging drop’, we need to know that is a type of ‘vapour diffusion’ and thus we should capture the chemicals used, drop volumes, reservoir volume, initial concentrations, predicted final concentrations, the time course, surface areas, geometry, material, incubation temperature, amongst other things. Indeed, even the time between drop mixing and sealing (in vapour diffusion), or the time course of temperature and dehydration can be critical.
Outcomes, the results of our experiments, are a morass into which we rarely delve with any enthusiasm: the sheer number of experiments which we don’t accurately describe, or describe at all, attests to this. To a large part this is a result of our fixation, almost a glorification, of crystals as the only useful result (Chayen & Saridakis, 2008
). The non-crystal results can point toward an optimization direction, although one may have to work harder to determine what that direction is. There is a major difficulty in describing these non-crystalline outcomes. When does a precipitate become an amorphous or a crystalline precipitate? Is that drop clear, or is there evidence of a light precipitate? Even then we should note we are looking at results, and not reasons. Is that clear drop clear because it is under-saturated? Is it clear because it is metastable? Or does it appear clear because the perfect crystal contained within matches the refraction index of the surrounding liquid and we simply cannot see it? We should not only capture outcome, but also how that outcome was determined to add a level of confidence to the classification. Was the classification strictly an evaluation through a low-magnification binocular microscope, or were spectroscopic, UV fluorescence, light scattering, dyes, or other physico-chemical means employed for validation? One of the potential benefits of such rigour would be the development of metrics to allow us to abandon non-productive experiments early.
This emphasises that our vocabulary has to be comprehensive, it has to have multiple tiers to capture and integrate basic information recorded in one laboratory with more detailed information from another, and it has to be descriptive, precise and uniform.
A number of other disciplines have already faced these challenges leading to the development of computational analysis techniques built on ontologies (first seen as the New Latin ontologia
‘the study of that which is’). See for example Soldatova et al.
). An ontology can be described as a structured formalization of knowledge that reconciles different descriptions of similar things (Musen, 2007
). Ontology development deals with questions concerning the entities of interest, and how they can be grouped, related to each other, and subdivided according to similarities and differences. By developing a common ontology, multiple different sets of data can be related to each other via
a common descriptive language. Given the ontology as a basis, tools and methods of analysis developed for one set of data can be shared and directly applied to data from other groups.
The field of crystallography is not new to ontology developments. Under the auspices of the International Union of Crystallography a data exchange format was developed for small-molecule single-crystal diffraction experiments, the Crystallographic Information File (CIF) (Hall et al.
). An extension to this for macromolecules (mmCIF) followed (Bourne et al.
). This includes some terms for describing a successful crystal growth experiment but fewer for describing the unsuccessful majority of outcomes in a crystallization experiment. In developing a more detailed crystallization ontology, we will be building on the current mmCIF with the aim of developing a means to capture and be able to analyse all crystallization screening experiments. To do so we have to comprehensively define the ‘things’ that it needs to represent. This includes both physical objects, e.g.
in the experiment example, ‘ammonium sulfate solution’, and the properties associated with the object, e.g.
concentration’, ‘contains NH4
ions’, ‘is volatile’, ‘has 2:1 stoichiometry of cations to anions’. The power in the ontology approach comes from the ability to use these descriptions and links between them (e.g.
‘all solutions containing the cation NH4
are somewhat similar’) as the basis for both describing our experiments and understanding better the relationships between experimental conditions and outcomes.