AGML was developed through close interaction with bioinformaticians and experimentalists to create a common data format that is open, accessible, and encompassing of all aspects of 2-DE experiments [13
]. The AGML format accommodates a description that spans from the start of the experiment to the final identification step, and as a consequence, all of the data is placed within the experimental context. The resulting definition, AGML 2.0, allows users to establish both the provenance and relevance of a 2-DE experiment, thereby enabling the development of effective search and analysis tools.
Specialized databases exist throughout the world that focus on 2-DE data [20
], with SWISS-2DPAGE being a major database [23
]. Other research efforts have been directed toward comprehensive representation of proteomics experiments, such as PEDRo [14
] and HUP-ML [24
]. AGML was developed as a pragmatic representation of the 2-DE-centric subdomain and can be used to interoperate with those much larger and more encompassing representations.
In spite of being the workhorse in proteomic study, a gel-centric approach has had weak bioinformatics support due to the lack of stable gel-centric data standards and formats. In order to assist with tool development and data dissemination, a public database of AGML formatted entries, AGML Central [25
], was developed. This web interface comes with a visualization plug-in and a portal for retrieval and submission by external applications. For example, MATLAB®
(The MathWorks, Inc., Natick, MA, USA) GUI functions are available in the tools page for direct access to data to and from AGML Central. The ability to map AGML XML format to a MATLAB®
'struct' (agml) enables statisticians and bioinformaticians to create algorithms based on it. This MATLAB®
struct can hold information extracted from an AGML XML document; hence, users of this format can, by extension, use any algorithm developed for the struct 'agml'. This feature allows the development of AGML Central-based pipelines and analysis tools. Additionally, the results of the analysis using other methods can also be submitted to AGML Central, to be appended to the corresponding entry. Although AGML Central is a public repository, all data submitted is private by default. The owner of the data can decide to make it public by providing selective access. There are currently 26 entries, of which only 2 have been made public. However, 14 of the entries have been designated for collaboration. It is our hope that as collaboration is completed all data will be made public.
The AGML concept and its implementation facilitate the management of proteomic data coming from diverse labs using different instruments and protocols, and enable the creation of much needed public 2-DE databases [26
]. For this reason, the AGML format provides a wider community of developers (through the accompanying open source project) and a larger audience of users (such as bioinformaticians and statisticians) with a way to access information generated by 2-DE experiments, thus enabling them to develop comprehensive data mining algorithms that allow for exploratory and confirmatory data analysis. For example, Oates et al [27
] used the AGML Central infrastructure to manage, integrate, and analyze 2-DE data to identify biomarkers that differentiate the two most common causes of acute renal failure. They used AGML Central to disseminate both their protocol and proteomic data in the AGML format to their bioinformatics collaborators. They then used the AGML data structure in the MATLAB®
native format, which is provided by AGML Central, to do exploratory analysis on their proteomic data. Using AGML Central allowed the collaborators access to all of the information at any time, thus streamlining their collaborative effort to get results faster. Additionally, a nascent controlled vocabulary exists for AGML; please see Minimum Ontology for 2DE Gel Electrophoresis [28
] for more information on this effort. Completion of this work will give the AGML format true agility and the ability to work with semantic web technologies.
The standards being created by HUPO-PSI-GEL for 2-D gel electrophoresis data markup, gelML and GelInfoML [7
] and two analogous MIAPE modules (GE and GI; [29
]), hope to encompass more details and be a comprehensive data standard for 2-D gel electrophoresis as a whole. gelML and GelInfoML are both based on the Functional Genomics Experiment [9
] modeling framework. While these efforts will ultimately result in community-based data standards, AGML was created to answer this need more pragmatically. Since its inception in 2004 [13
], AGML has been more interested in getting the data to tool makers. In the process AGML has acquired many of the features that are being proposed. Briefly, the <mi2dg> elements can be analogous to GelInfoML, and the <reals> can be analogous to gelML. Once stable gelML and GelInfoML standards are published, AGML documents will be made available to be translated to these standards, thereby making the data available for any tools developed for HUPO-PSI-GEL standards.
Overcoming barriers in data flow is a central theme in the route toward Systems Biology and this is especially true for rapidly expanding methodologies such as those developed for proteomics research. Rapid growth of the field has seen the emergence of high throughput instruments from different vendors that use many different proprietary data standards that, due to the lack of data interoperability, limit data integration. This fact is underscored by the formation of the Interoperable Informatics Infrastructure Consortium, whose major goal is to eliminate barriers to application interoperability, data integration, and eventually knowledge sharing [30
]. Additionally, work undertaken by HUPO-PSI to advance the field of proteomics data standards also points to the need that exists in the area of data interoperability in proteomics [31