GelML models the process of gel electrophoresis applied in the context of a proteomics experiment, after sample preparation and prior to image analysis or protein identification. The model supports the description of the protocols for electrophoresis, protein detection - either directly on the gel matrix or indirectly (e.g. Western blotting), and image acquisition from gel matrices. GelML is intended to be used in a modular way together with existing formats. It does not contain explicit models designed for sample processing or preparation, prior to applying a sample on a gel matrix, since such information can be captured in the core FuGE model, which is imported along with the GelML schema. GelML does not provide detailed support for describing the analysis of digitised imaged derived from gel matrices (see Discussion), although limited support is provided in GelML for capturing locations identified on gel images and related quantitative information. In addition, GelML does not describe the process of protein identification, for example by mass spectrometry, for which standards formats already exist (mzML [31
] and mzIdentML as detailed on the PSI website, http://psidev.info/
The GelML model can be broken down into various sub-sections. Each model represents a particular stage in a gel electrophoresis experiment, including: the gel materials and optionally the manufacture of the gel; one-dimensional gel electrophoresis; two dimensional gel electrophoresis; “non-standard”’ methods of gel electrophoresis that do not fit the traditional structure of 1-DE or 2-DE, such as 3-dimensional geometry gel electrophoresis; sample loading; electrophoresis; protein detection; image acquisition and the excision of locations on gels.
GelML makes uses of several structures of FuGE: models of protocols or procedures (Protocol), the running of the protocol and runtime parameters or readings (ProtocolApplication), all physical/biological materials (Material) and data files (Data). An overview of different parts of GelML is given in for a 2-DE example; similar workflows can also be constructed for a 1-DE or DIGE experiment. The backbone of a typical file is a series of ProtocolApplications (standard rectangles in ) that map inputs and outputs. The inputs and outputs to each ProtocolApplication can only be types of Material or Data (rounded rectangles). This structure allows some flexibility with regards to how workflows are constructed if non-standard procedures have been carried out. Each ProtocolApplication must reference a corresponding standard protocol, defined within the file. Each protocol consists of the main text of the protocol, parameters and equipment or software details. As such, if the same protocol is run many times, it only has to be recorded once in the file. is illustrated with several key details captured in each stage that are required by MIAPE GE.
A graphical representation of example components from a GelML file, and certain key details that should be captured in each section. Standard rectangles indicate ProtocolApplications, rounded rectangles indicate Materials or Data.
In the rest of this section, a brief summary is given of several components of the model from the point of view of a “standard” 2-DE experiment, illustrating how these components could represent a MIAPE GE compliant data set.
The MIAPE GE document requires that users report a description of the gel matrix, the physical dimensions, the concentration of acrylamide and the crosslinking agent. GelML has model to support these details, as outlined in as a representative example of GelML (detailed diagrams of other key model components can be found in the supplementary figures
). There is an additional model (not shown) which allows the protocol for the gel manufacture to be recorded if the gel was not purchased pre-cast, which is also required by MIAPE GE. The Gel element has attributes for specifying the separation dimension and the batch number. Associations to other elements can be used to capture the dimensions of the gel(s), the ratio of acrylamide to a crosslinker (such as bisacrylamide), the overall percentage of acrylamide, the model number and identifiers for any lanes within the gel. All of these characteristics can affect the quality of the resulting protein separation and estimates of protein quantities, so it is important that such details are stored in a structured format. There is a separate element representing a 1-D or 2-D gel after electrophoresis has been performed (Gel1D, Gel2D) which can be used to specify the range of physicochemical separation performed, such as molecular weight or pH. In , example instances of the XML format are shown.
Figure 2 A. The model in XSD of the gel material prior to (Gel) and following electrophoresis (ElectrophoresedGel and the sub-elements: Gel2D, Gel1D, OtherGel). B. Examples in XML of one instance of Gel and Gel2D, the relationships between Gel and Gel2D are captured (more ...)
MIAPE GE requests that users report the electrophoresis protocols employed, allowing, for example, database users to apply protocols in their own labs. The protocol, as represented in GelML, consists of the main protocol text and references to buffer details and equipment, such as gel tanks (Supplementary Figure 1
). Earlier iterations of GelML modelled electrophoresis protocols by breaking down each step of the protocol into individual parameters, with values and units (rather than plain text). However, there are currently no software packages able to export these protocols directly from electrophoresis control software, and our experience testing implementations has shown that users are generally not willing to complete complex forms manually with such high granularity information.
Proteins are detected or visualised on a gel by either a direct method, such as staining, or an indirect method in which they are transferred to another medium such as a Western blot. Choosing the appropriate detection agent, such as silver, Coomassie blue or fluorescent stains (for example used in DIGE), is based on the concentration and abundance of the sample. The choice of detection agent is also influenced by the information required in the post gel processing steps, such as mass spectrometry. The overall details of the procedure are captured as plain text in GelML. The protocol references a controlled vocabulary term for the name of the detection agent (which would allow a database to be queried for this property) and the quantity of the agent as a volume, mass or concentration (Supplementary Figure 2
). The model can also capture indirect detection procedures, such as Western blots in which proteins are first transferred to a new medium (e.g. a nitrocellulose membrane).
Gel image acquisition
The protocol for acquiring a digitised image can be captured as plain text in GelML with a set of parameters including a specification of how scanner calibration was performed (Supplementary Figure 3
). The model also captures the make and model of the scanner. The application of the protocol has an input of the gel on which proteins were detected by a direct process or the medium on which indirect detection was performed (not shown). The output of the ProtocolApplication is the image itself, with attributes for capturing the image dimensions, the bit depth, resolution and file format (information required by MIAPE GE). In a DIGE experiment, several instances of the ProtocolApplication are created, each producing one Image, to capture the procedure of scanning at several different wavelengths.
Spot or band excision
In a typical 1-DE or 2-DE experiment, following protein detection (and image acquisition), individual spots or bands are excised and progress to mass spectrometry for protein identification. The PSI format for mass spectrometry data, mzML [31
] can specify a reference to an input sample. In a gel-based experiment, the ProtocolApplication for excision produces a series of samples (ExcisedSample) with unique identifiers which could be referenced within mzML. This link would allow a mass spectrum to be linked back to a complete trace of the gel, and associated protocols, from which it was extracted.
GelML contains a model for linking the samples back to the corresponding locations on images, and for capturing a protocol describing how excision is performed (Supplementary Figure 4
). Locations on a gel, such as spots or bands, can be captured in several different ways depending on how the images have been analysed, such as pairs of X/Y coordinates, circular or rectangular locations. If spot locations have a complex shape, as produced by image analysis software, the location can be specified by a set of X/Y boundary points by an ordered chain of boundary points (see specification document for more detail). Gel locations can be annotated with additional measurements, which could be used to store quantitative values derived from image analysis, such as spot density or volume.
The PSI-Gel workgroup has developed the controlled vocabulary sepCV, which contains terms specific to the methods and techniques of protein separation using gel electrophoresis. It covers gel manufacture and preparation, running conditions, protein detection techniques as well as imaging methods. Several key parts of GelML require CV terms to be sourced from sepCV, such as the protein detection agent and the type of crosslinker in the gel. The description of the starting sample requires the use of CV terms to capture its important characteristics, as defined by the investigators, for example sourced from an organism-specific ontology within the OBO Foundry [27
]. The Unit Ontology should be used with GelML to standardise the naming of units, which is also part of the OBO Foundry. The use of CV terms is controlled by a mapping file that specifies exactly which CV terms are allowed within each part of the schema. The usage can then be checked using the PSI’s semantic validation technology [32
] for which a test implementation has been created by the OpenMS developers (details at http://www.psidev.info/validator/
Implementations of GelML
The first implementations of GelML within database systems have recently been developed. The ProteoRed consortium has developed the MIAPE generator tool that automates the process of collecting methods and data sets for proteomics, compliant with the MIAPE guidelines [9
]. The tool guides users through each stage of an experimental process, capturing key details as specified in each MIAPE module. The sepCV and unitCV vocabularies have been implemented to ensure that consistent method descriptions and units are provided throughout. At the end of the process, the user can verify that their submission is MIAPE GE compliant. Other users can browse the MIAPE database, and have the opportunity to download descriptions of methods in “Report format” (as pdf). A tool has been developed for mapping the internal ProteoRed format to GelML, using a Java Webstart application (). The ProteoRed database covers protein separation and electrophoresis protocols in much greater detail than the EBI PRIDE database format, while PRIDE provides a central repository for protein identifications based on mass spectrometry. Thus, data from gel-based proteomics workflows can be accommodated by a dual submission of methodology description, gel images and image features to ProteoRed with protein identifications stored in PRIDE. A mechanism has been created for linking the two submissions by unique identifiers. Users of the system can therefore also download linked files in PRIDE XML and GelML format for local analysis.
Figure 3 Screenshots from the ProteoRed MIAPE Generator Website, showing the pipeline for generating GelML files from an existing MIAPE experiment: i) The user is guided through the data input process and is provided with the option to export to GelML; ii) a validator (more ...)
A second beta implementation is under development at the Swiss Institute of Bioinformatics, in which MIAPE-compliant submissions to the World-2DPAGE Repository can be created using the MIAPEGelDB interface [33
]. An example file can be viewed at http://miapegeldb.expasy.org/experiment/2/gel/102/as_xml/
. Since the GelML model is based on FuGE, FuGE-based software can be adapted relatively simply to provide implementations for GelML. A toolkit has been developed that provides one such mechanism, comprising a software application to facilitate the collection, storage and the browsing of FuGE compliant models and FuGE extensions such as GelML [34
]. A mapping has also been created from GelML as part of the ISA-TAB project, allowing the XML to be rendered in a tab-based format for simpler visualisation (http://isatab.sourceforge.net/examples.html
). The ISA-TAB mechanism is also used in the Bioinvestigation Index (http://www.ebi.ac.uk/bioinvindex
) project to submit data to ‘omics databases hosted at the European Bioinformatics Institute.
The PSI Protein Separation (PSI-PS) work group has an active team of developers working on software implementations. The group is committed to providing on-going documentation and help guides for GelML, and will provide support for other groups implementing GelML through the group’s mailing list (see the workgroup home page http://www.psidev.info/index.php?q=node/83