The Tissue Microarray Data Exchange Specification (TMA DES) is an eXtensible Markup Language (XML) specification for encoding TMA experiment data in a machine-readable format that is also human readable. TMA DES defines Common Data Elements (CDEs) that form a basic vocabulary for describing TMA data. TMA data are routinely subjected to univariate and multivariate statistical analysis to determine differences or similarities between pathologically distinct groups of tumors for one or more markers or between markers for different groups. Such statistical analysis tests include the t-test, ANOVA, Chi-square, Mann-Whitney U, and Kruskal-Wallis tests. All these generate output that needs to be recorded and stored with TMA data.
Materials and Methods:
We propose extending the TMA DES to include syntactic and semantic definitions of CDEs for describing the results of statistical analyses performed upon TMA DES data. These CDEs are described in this paper and it is illustrated how they can be added to the TMA DES. We created a Document Type Definition (DTD) file defining the syntax for these CDEs, and a set of ISO 11179 entries providing semantic definitions for them. We describe how we wrote a program in R that read TMA DES data from an XML file, performed statistical analyses on that data, and created a new XML file containing both the original XML data and CDEs representing the results of our analyses. This XML file was submitted to XML parsers in order to confirm that they conformed to the syntax defined in our extended DTD file. TMA DES XML files with deliberately introduced errors were also parsed in order to verify that our new DTD file could perform error checking. Finally, we also validated an existing TMA DES XML file against our DTD file in order to demonstrate the backward compatibility of our DTD.
Our experiments demonstrated the encoding of analysis results using our proposed CDEs. We used XML parsers to confirm that these XML data were syntactically correct and conformed to the rules specified in our extended TMA DES DTD. We also demonstrated that this extended DTD was capable of being used to successfully perform error checking, and was backward compatible with pre-existing TMA DES data which did not use our new CDEs.
The TMA DES allows Tissue Microarray data to be shared. A variety of statistical tests are used to analyze such data. We have proposed a set of CDEs as an extension to the TMA DES which can be used to annotate TMA DES data with the results of statistical analyses performed on that data. We performed experiments which demonstrated the usage of TMA DES data containing our proposed CDEs.
CDEs; DTD; statistical analysis; tissue microarray; TMA Data Exchange Specification; XML
Despite the complete determination of the genome sequence of a huge number of bacteria, their proteomes remain relatively poorly defined. Beside new methods to increase the number of identified proteins new database applications are necessary to store and present results of large- scale proteomics experiments.
In the present study, a database concept has been developed to address these issues and to offer complete information via a web interface. In our concept, the Oracle based data repository system SQL-LIMS plays the central role in the proteomics workflow and was applied to the proteomes of Mycobacterium tuberculosis, Helicobacter pylori, Salmonella typhimurium and protein complexes such as 20S proteasome. Technical operations of our proteomics labs were used as the standard for SQL-LIMS template creation. By means of a Java based data parser, post-processed data of different approaches, such as LC/ESI-MS, MALDI-MS and 2-D gel electrophoresis (2-DE), were stored in SQL-LIMS. A minimum set of the proteomics data were transferred in our public 2D-PAGE database using a Java based interface (Data Transfer Tool) with the requirements of the PEDRo standardization. Furthermore, the stored proteomics data were extractable out of SQL-LIMS via XML.
The Oracle based data repository system SQL-LIMS played the central role in the proteomics workflow concept. Technical operations of our proteomics labs were used as standards for SQL-LIMS templates. Using a Java based parser, post-processed data of different approaches such as LC/ESI-MS, MALDI-MS and 1-DE and 2-DE were stored in SQL-LIMS. Thus, unique data formats of different instruments were unified and stored in SQL-LIMS tables. Moreover, a unique submission identifier allowed fast access to all experimental data. This was the main advantage compared to multi software solutions, especially if personnel fluctuations are high. Moreover, large scale and high-throughput experiments must be managed in a comprehensive repository system such as SQL-LIMS, to query results in a systematic manner. On the other hand, these database systems are expensive and require at least one full time administrator and specialized lab manager. Moreover, the high technical dynamics in proteomics may cause problems to adjust new data formats. To summarize, SQL-LIMS met the requirements of proteomics data handling especially in skilled processes such as gel-electrophoresis or mass spectrometry and fulfilled the PSI standardization criteria. The data transfer into a public domain via DTT facilitated validation of proteomics data. Additionally, evaluation of mass spectra by post-processing using MS-Screener improved the reliability of mass analysis and prevented storage of data junk.
In spite of two-dimensional gel electrophoresis (2-DE) being an effective and widely used method to screen the proteome, its data standardization has still not matured to the level of microarray genomics data or mass spectrometry approaches. The trend toward identifying encompassing data standards has been expanding from genomics to transcriptomics, and more recently to proteomics. The relative success of genomic and transcriptomic data standardization has enabled the development of central repositories such as GenBank and Gene Expression Omnibus. An equivalent 2-DE-centric data structure would similarly have to include a balance among raw data, basic feature detection results, sufficiency in the description of the experimental context and methods, and an overall structure that facilitates a diversity of usages, from central reposition to local data representation in LIMs systems.
Results & Conclusion
Achieving such a balance can only be accomplished through several iterations involving bioinformaticians, bench molecular biologists, and the manufacturers of the equipment and commercial software from which the data is primarily generated. Such an encompassing data structure is described here, developed as the mature successor to the well established and broadly used earlier version. A public repository, AGML Central, is configured with a suite of tools for the conversion from a variety of popular formats, web-based visualization, and interoperation with other tools and repositories, and is particularly mass-spectrometry oriented with I/O for annotation and data analysis.
The demonstration of an experimental Electronic Patient Record (EPR) system built from those technologies that can support viewing of medical imaging exams and graphically-rich clinical reporting tools, while conforming to the newly emerging XML standard for digital documents. In particular, we aim to promote rapid prototyping of new reports by clinical specialists.
We demonstrate the InfoDOM experimental EPR system that is currently being adapted for test-bed use in three hospitals in Cagliari, Italy. For this we are working with specialists in neurology, radiology, and epilepsy.
Early indications are that the rapid prototyping of reports afforded by our EPR system can assist communication between clinical specialists and our system developers. We are now experimenting with new technologies that may provide services to the kind of XML EPR client described here.
Evolutionary trees are central to a wide range of biological studies. In many of these studies, tree nodes and branches need to be associated (or annotated) with various attributes. For example, in studies concerned with organismal relationships, tree nodes are associated with taxonomic names, whereas tree branches have lengths and oftentimes support values. Gene trees used in comparative genomics or phylogenomics are usually annotated with taxonomic information, genome-related data, such as gene names and functional annotations, as well as events such as gene duplications, speciations, or exon shufflings, combined with information related to the evolutionary tree itself. The data standards currently used for evolutionary trees have limited capacities to incorporate such annotations of different data types.
We developed a XML language, named phyloXML, for describing evolutionary trees, as well as various associated data items. PhyloXML provides elements for commonly used items, such as branch lengths, support values, taxonomic names, and gene names and identifiers. By using "property" elements, phyloXML can be adapted to novel and unforeseen use cases. We also developed various software tools for reading, writing, conversion, and visualization of phyloXML formatted data.
PhyloXML is an XML language defined by a complete schema in XSD that allows storing and exchanging the structures of evolutionary trees as well as associated data. More information about phyloXML itself, the XSD schema, as well as tools implementing and supporting phyloXML, is available at .
Proteomics continues to play a critical role in post-genomic science as continued advances in mass spectrometry and analytical chemistry support the separation and identification of increasing numbers of peptides and proteins from their characteristic mass spectra. In order to facilitate the sharing of this data, various standard formats have been, and continue to be, developed. Still not fully mature however, these are not yet able to cope with the increasing number of quantitative proteomic technologies that are being developed.
We propose an extension to the PRIDE and mzData XML schema to accommodate the concept of multiple samples per experiment, and in addition, capture the intensities of the iTRAQTM reporter ions in the entry. A simple Java-client has been developed to capture and convert the raw data from common spectral file formats, which also uses a third-party open source tool for the generation of iTRAQTM reported intensities from Mascot output, into a valid PRIDE XML entry.
We describe an extension to the PRIDE and mzData schemas to enable the capture of quantitative data. Currently this is limited to iTRAQTM data but is readily extensible for other quantitative proteomic technologies. Furthermore, a software tool has been developed which enables conversion from various mass spectrum file formats and corresponding Mascot peptide identifications to PRIDE formatted XML. The tool represents a simple approach to preparing quantitative and qualitative data for submission to repositories such as PRIDE, which is necessary to facilitate data deposition and sharing in public domain database. The software is freely available from .
The Tissue Microarray Data Exchange Specification (TMA DES) is an XML specification for encoding TMA experiment data. While TMA DES data is encoded in XML, the files that describe its syntax, structure, and semantics are not. The DTD format is used to describe the syntax and structure of TMA DES, and the ISO 11179 format is used to define the semantics of TMA DES. However, XML Schema can be used in place of DTDs, and another XML encoded format, RDF, can be used in place of ISO 11179. Encoding all TMA DES data and metadata in XML would simplify the development and usage of programs which validate and parse TMA DES data. XML Schema has advantages over DTDs such as support for data types, and a more powerful means of specifying constraints on data values. An advantage of RDF encoded in XML over ISO 11179 is that XML defines rules for encoding data, whereas ISO 11179 does not.
Materials and Methods:
We created an XML Schema version of the TMA DES DTD. We wrote a program that converted ISO 11179 definitions to RDF encoded in XML, and used it to convert the TMA DES ISO 11179 definitions to RDF.
We validated a sample TMA DES XML file that was supplied with the publication that originally specified TMA DES using our XML Schema. We successfully validated the RDF produced by our ISO 11179 converter with the W3C RDF validation service.
All TMA DES data could be encoded using XML, which simplifies its processing. XML Schema allows datatypes and valid value ranges to be specified for CDEs, which enables a wider range of error checking to be performed using XML Schemas than could be performed using DTDs.
CDEs; DTD; statistical analysis; tissue microarray; TMA DES; XML
Proteomics inherently deals with huge amounts of data. Current mass spectrometers acquire hundreds of thousands of spectra within a single project. Thus, data management and data analysis are a challenge. We have developed a software platform (Proteinscape) that stores all relevant proteomics data efficiently and allows fast access and correlation analysis within proteomics projects.
The software is based on a relational database system using Web-based server-client architecture with intra- and Internet access.
Proteinscape stores relevant data from all steps of proteomics projects—study design, sample treatment, separation techniques (e.g., gel electrophoresis or liquid chromatography), protein digestion, mass spectrometry, and protein database search results. Gel spot data can be imported directly from several 2DE-gel image analysis software packages as well as spot-picking robots. Spectra (MS and MS/MS) are imported automatically during acquisition from MALDI and ESI mass spectrometers.
Many algorithms for automated spectra and search result processing are integrated. PMF spectra are calibrated and filtered for contaminant and polymer peaks (Score-booster). A single non-redundant protein list—containing only proteins that can be distinguished by the MS/MS data—can be generated from MS/MS search results (ProteinExtractor). This algorithm can combine data from different search algorithms or different experiments (MALDI/ESI, or acquisition repetitions) into a single protein list.
Navigation within the database is possible either by using the hierarchy of project, sample, protein/peptide separation, spectrum, and identification results, or by using a gel viewer plug-in. Available features include zooming, annotations (protein, spot name, etc.), export of the annotated image, and links to spot, spectrum, and protein data.
Proteinscape includes sophisticated query tools that allow data retrieval for typical questions in proteome projects. Here we present the benefit and power of usage of 6 years of continuous use of the software in over 70 proteome projects managed in house.
In scientific research, integration and synthesis require a common understanding of where data come from, how much they can be trusted, and what they may be used for. To make such an understanding computer-accessible requires standards for exchanging richly annotated data. The challenges of conveying reusable data are particularly acute in regard to evolutionary comparative analysis, which comprises an ever-expanding list of data types, methods, research aims, and subdisciplines. To facilitate interoperability in evolutionary comparative analysis, we present NeXML, an XML standard (inspired by the current standard, NEXUS) that supports exchange of richly annotated comparative data. NeXML defines syntax for operational taxonomic units, character-state matrices, and phylogenetic trees and networks. Documents can be validated unambiguously. Importantly, any data element can be annotated, to an arbitrary degree of richness, using a system that is both flexible and rigorous. We describe how the use of NeXML by the TreeBASE and Phenoscape projects satisfies user needs that cannot be satisfied with other available file formats. By relying on XML Schema Definition, the design of NeXML facilitates the development and deployment of software for processing, transforming, and querying documents. The adoption of NeXML for practical use is facilitated by the availability of (1) an online manual with code samples and a reference to all defined elements and attributes, (2) programming toolkits in most of the languages used commonly in evolutionary informatics, and (3) input–output support in several widely used software applications. An active, open, community-based development process enables future revision and expansion of NeXML.
Data standards; evolutionary informatics; interoperability; phyloinformatics; semantic web; syntax format
To identify the determinants of successful antiretroviral (ARV) therapy, researchers study the virological responses to treatment-change episodes (TCEs) accompanied by baseline plasma HIV-1 RNA levels, CD4+ T lymphocyte counts, and genotypic resistance data. Such studies, however, often differ in their inclusion and virological response criteria making direct comparisons of study results problematic. Moreover, the absence of a standard method for representing the data comprising a TCE makes it difficult to apply uniform criteria in the analysis of published studies of TCEs.
To facilitate data sharing for TCE analyses, we developed an XML (Extensible Markup Language) Schema that represents the temporal relationship between plasma HIV-1 RNA levels, CD4 counts and genotypic drug resistance data surrounding an ARV treatment change. To demonstrate the adaptability of the TCE XML Schema to different clinical environments, we collaborate with four clinics to create a public repository of about 1,500 TCEs. Despite the nascent state of this TCE XML Repository, we were able to perform an analysis that generated a novel hypothesis pertaining to the optimal use of second-line therapies in resource-limited settings. We also developed an online program (TCE Finder) for searching the TCE XML Repository and another program (TCE Viewer) for generating a graphical depiction of a TCE from a TCE XML Schema document.
The TCE Suite of applications – the XML Schema, Viewer, Finder, and Repository – addresses several major needs in the analysis of the predictors of virological response to ARV therapy. The TCE XML Schema and Viewer facilitate sharing data comprising a TCE. The TCE Repository, the only publicly available collection of TCEs, and the TCE Finder can be used for testing the predictive value of genotypic resistance interpretation systems and potentially for generating and testing novel hypotheses pertaining to the optimal use of salvage ARV therapy.
Human immunodeficiency virus; Antiretroviral treatment; Drug resistance; Clinical outcomes; XML schema; Database
The Human Proteome Organization (HUPO) Proteomics Standard Initiative has been tasked with developing file formats for storing raw data (mzML) and the results of spectral processing (protein identification and quantification) from proteomics experiments (mzIndentML). In order to fully characterize complex experiments, special data types have been designed. Standardized file formats will promote visualization, validation and dissemination of data independent of the vendor-specific binary data storage files. Innovative programmatic solutions for robust and efficient data access to standardized file formats will contribute to more rapid wide-scale acceptance of these file formats by the proteomics community.
In this work, we compare algorithms for accessing spectral data in the mzML file format. As an XML file, mzML files allow efficient parsing of data structures when using XML-specific class types. These classes provide only sequential access to files. However, random access to spectral data is needed in many algorithmic applications for processing proteomics datasets. Here, we demonstrate implementation of memory streams to convert a sequential access into random access. Our application preserves the elegant XML parsing capabilities. Benchmarking file access times in sequential and random access modes show that while for small number of spectra the random access is more time efficient, when retrieving large number of spectra sequential access becomes more efficient. We also provide comparisons to other file accessing methods from academia and industry.
mzML; XML; Sequential file access; Random file access; Proteomics datasets
Crop wild relatives are wild species that are closely related to crops. They are valuable as potential gene donors for crop improvement and may help to ensure food security for the future. However, they are becoming increasingly threatened in the wild and are inadequately conserved, both in situ and ex situ. Information about the conservation status and utilisation potential of crop wild relatives is diverse and dispersed, and no single agreed standard exists for representing such information; yet, this information is vital to ensure these species are effectively conserved and utilised. The European Community-funded project, European Crop Wild Relative Diversity Assessment and Conservation Forum, determined the minimum information requirements for the conservation and utilisation of crop wild relatives and created the Crop Wild Relative Information System, incorporating an eXtensible Markup Language (XML) schema to aid data sharing and exchange.
Crop Wild Relative Markup Language (CWRML) was developed to represent the data necessary for crop wild relative conservation and ensure that they can be effectively utilised for crop improvement. The schema partitions data into taxon-, site-, and population-specific elements, to allow for integration with other more general conservation biology schemata which may emerge as accepted standards in the future. These elements are composed of sub-elements, which are structured in order to facilitate the use of the schema in a variety of crop wild relative conservation and use contexts. Pre-existing standards for data representation in conservation biology were reviewed and incorporated into the schema as restrictions on element data contents, where appropriate.
CWRML provides a flexible data communication format for representing in situ and ex situ conservation status of individual taxa as well as their utilisation potential. The development of the schema highlights a number of instances where additional standards-development may be valuable, particularly with regard to the representation of population-specific data and utilisation potential. As crop wild relatives are intrinsically no different to other wild plant species there is potential for the inclusion of CWRML data elements in the emerging standards for representation of biodiversity data.
The original PRIDE Converter tool greatly simplified the process of submitting mass spectrometry (MS)-based proteomics data to the PRIDE database. However, after much user feedback, it was noted that the tool had some limitations and could not handle several user requirements that were now becoming commonplace. This prompted us to design and implement a whole new suite of tools that would build on the successes of the original PRIDE Converter and allow users to generate submission-ready, well-annotated PRIDE XML files. The PRIDE Converter 2 tool suite allows users to convert search result files into PRIDE XML (the format needed for performing submissions to the PRIDE database), generate mzTab skeleton files that can be used as a basis to submit quantitative and gel-based MS data, and post-process PRIDE XML files by filtering out contaminants and empty spectra, or by merging several PRIDE XML files together. All the tools have both a graphical user interface that provides a dialog-based, user-friendly way to convert and prepare files for submission, as well as a command-line interface that can be used to integrate the tools into existing or novel pipelines, for batch processing and power users. The PRIDE Converter 2 tool suite will thus become a cornerstone in the submission process to PRIDE and, by extension, to the ProteomeXchange consortium of MS-proteomics data repositories.
Objective: The Digital Imaging and Communications in Medicine (DICOM) Structured Reporting (SR) standard improves the expressiveness, precision, and comparability of documentation about diagnostic images and waveforms. It supports the interchange of clinical reports in which critical features shown by images and waveforms can be denoted unambiguously by the observer, indexed, and retrieved selectively by subsequent reviewers. It is essential to provide access to clinical reports across the health care enterprise by using technologies that facilitate information exchange and processing by computers as well as provide support for robust and semantically rich standards, such as DICOM. This is supported by the current trend in the healthcare industry towards the use of Extensible Markup Language (XML) technologies for storage and exchange of medical information. The objective of the work reported here is to develop XML Schema for representing DICOM SR as XML documents.
Design: We briefly describe the document type definition (DTD) for XML and its limitations, followed by XML Schema (the intended replacement for DTD) and its features. A framework for generating XML Schema for representing DICOM SR in XML is presented next.
Measurements: None applicable.
Results: A schema instance based on an SR example in the DICOM specification was created and validated against the schema. The schema is being used extensively in producing reports on Philips Medical Systems ultrasound equipment.
Conclusion: With the framework described it is feasible to generate XML Schema using the existing DICOM SR specification. It can also be applied to generate XML Schemas for other DICOM information objects.
XML is ubiquitously used as an information exchange platform for web-based applications in healthcare, life sciences, and many other domains. Proliferating XML data are now managed through latest native XML database technologies. XML data sources conforming to common XML schemas could be shared and integrated with syntactic interoperability. Semantic interoperability can be achieved through semantic annotations of data models using common data elements linked to concepts from ontologies. In this paper, we present a framework and software system to support the development of semantic interoperable XML based data sources that can be shared through a Grid infrastructure. We also present our work on supporting semantic validated XML data through semantic annotations for XML Schema, semantic validation and semantic authoring of XML data. We demonstrate the use of the system for a biomedical database of medical image annotations and markups.
Biomedical Data Management; XML Database; Data Integration; Semantic Interoperability
Two years, since the World Wide Web Consortium (W3C) has published the first specification of the eXtensible Markup Language (XML) there exist some concrete tools and applications to work with XML-based data. In particular, new generation Web browsers offer great opportunities to develop new kinds of medical, web-based applications. There are several data-exchange formats in medicine, which have been established in the last years: HL-7, DICOM, EDIFACT and, in the case of Germany, xDT. Whereas communication and information exchange becomes increasingly important, the development of appropriate and necessary interfaces causes problems, rising costs and effort. It has been also recognised that it is difficult to define a standardised interchange format, for one of the major future developments in medical telematics: the electronic patient record (EPR) and its availability on the Internet. Whereas XML, especially in an industrial environment, is celebrated as a generic standard and a solution for all problems concerning e-commerce, in a medical context there are only few applications developed. Nevertheless, the medical environment is an appropriate area for building XML applications: as the information and communication management becomes increasingly important in medical businesses, the role of the Internet changes quickly from an information to a communication medium. The first XML based applications in healthcare show us the advantage for a future engagement of the healthcare industry in XML: such applications are open, easy to extend and cost-effective. Additionally, XML is much more than a simple new data interchange format: many proposals for data query (XQL), data presentation (XSL) and other extensions have been proposed to the W3C and partly realised in medical applications.
XML; Standards; Internet; Intranet; Electronic Patient Record
Tissue MicroArrays (TMAs) are a high throughput technology for rapid analysis of protein expression across hundreds of patient samples. Often, data relating to TMAs is specific to the clinical trial or experiment it is being used for, and not interoperable. The Tissue Microarray Data Exchange Specification (TMA DES) is a set of eXtensible Markup Language (XML)-based protocols for storing and sharing digitized Tissue Microarray data. XML data are enclosed by named tags which serve as identifiers. These tag names can be Common Data Elements (CDEs), which have a predefined meaning or semantics. By using this specification in a laboratory setting with increasing demands for digital pathology integration, we found that the data structure lacked the ability to cope with digital slide imaging in respect to web-enabled digital pathology systems and advanced scoring techniques.
Materials and Methods:
By employing user centric design, and observing behavior in relation to TMA scoring and associated data, the TMA DES format was extended to accommodate the current limitations. This was done with specific focus on developing a generic tool for handling any given scoring system, and utilizing data for multiple observations and observers.
DTDs were created to validate the extensions of the TMA DES protocol, and a test set of data containing scores for 6,708 TMA core images was generated. The XML was then read into an image processing algorithm to utilize the digital pathology data extensions, and scoring results were easily stored alongside the existing multiple pathologist scores.
By extending the TMA DES format to include digital pathology data and customizable scoring systems for TMAs, the new system facilitates the collaboration between pathologists and organizations, and can be used in automatic or manual data analysis. This allows complying systems to effectively communicate complex and varied scoring data.
CDEs; DTD; tissue microarray; TMA DES; virtual pathology; XML
OmicsHub Proteomics integrates in one single platform all the steps of a Mass Spectrometry Experiment reducing time and data management complexity. The proteomics data automation and data management/analysis provided by OmicsHub Proteomics solves the typical problems your lab members find on a daily basis and makes life easier when performing tasks such as multiple search engine support, pathways integration or custom report generation for external customers. OmicsHub has been designed as a central data management system to collect, analyze and annotate proteomics experimental data enabling users to automate tasks. OmicsHub Proteomics helps laboratories to easily meet proteomics standards such as PRIDE or FuGE and works with controlled vocabulary experiment annotation. The software enables your lab members to take a greater advantage of the Mascot and Phenyx search engines unique capabilities for protein identification. Multiple searches can be launch at once, allowing peak list data from several spots or chromatograms to be sent concurrently to Mascot/Phenyx. OmicsHub Proteomics works for both LC and Gel workflows. The system allows to store and compare proteomics data generated from different Mass Spectrometry instruments in a single platform instead of having a specific software for each of them. It is a web application which installs in a single server needing just Web Browser to have access to it. All experimental actions are userstamp and datestamp allowing the audit tracking of every action performed in OmicsHub. Some of the OmicsHub Proteomics main features are Protein identification, Biological annotation, Report customization, PRIDE standard, Pathways integration, Group proteins results removing redundancy, Peak filtering and FDR cutoff for decoy databases. OmicsHub Proteomics its flexible enough to parsers for new file formats to be easily imported and fits your budget having a very competitive price for its perpetual license.
Many proteomics initiatives require integration of all information with uniformcriteria from collection of samples and data display to publication of experimental results. The integration and exchanging of these data of different formats and structure imposes a great challenge to us. The XML technology presents a promise in handling this task due to its simplicity and flexibility. Nasopharyngeal carcinoma (NPC) is one of the most common cancers in southern China and Southeast Asia, which has marked geographic and racial differences in incidence. Although there are some cancer proteome databases now, there is still no NPC proteome database.
The raw NPC proteome experiment data were captured into one XML document with Human Proteome Markup Language (HUP-ML) editor and imported into native XML database Xindice. The 2D/MS repository of NPC proteome was constructed with Apache, PHP and Xindice to provide access to the database via Internet. On our website, two methods, keyword query and click query, were provided at the same time to access the entries of the NPC proteome database.
Our 2D/MS repository can be used to share the raw NPC proteomics data that are generated from gel-based proteomics experiments. The database, as well as the PHP source codes for constructing users' own proteome repository, can be accessed at .
A research area that has greatly benefited from the development of new and improved analysis technologies is Proteomics and large amounts of data have been generated by proteomic analysis as a consequence. Previously, the storage, management and analysis of these data have been done manually. This is, however, incompatible with the volume of data generated by modern proteomic analysis. Several attempts have been made to automate the tasks of data analysis and management. In this work we propose PRODIS (Proteomics Database Integrated System), a system for proteomic experimental data management. The proposed system enables an efficient management of the proteomic experimentation workflow, simplifies controlling experiments and associated data and establishes links between similar experiments through the experiment tracking function.
PRODIS is fully web based which simplifies data upload and gives the system the flexibility necessary for use in complex projects. Data from Liquid Chromatography, 2D-PAGE and Mass Spectrometry experiments can be stored in the system. Moreover, it is simple to use, researchers can insert experimental data directly as experiments are performed, without the need to configure the system or change their experiment routine. PRODIS has a number of important features, including a password protected system in which each screen for data upload and retrieval is validated; users have different levels of clearance, which allow the execution of tasks according to the user clearance level. The system allows the upload, parsing of files, storage and display of experiment results and images in the main formats used in proteomics laboratories: for chromatographies the chromatograms and lists of peaks resulting from separation are stored; For 2D-PAGE images of gels and the files resulting from the analysis are stored, containing information on positions of spots as well as its values of intensity, volume, etc; For Mass Spectrometry, PRODIS presents a function for completion of the mapping plate that allows the user to correlate the positions in plates to the samples separated by 2D-PAGE. Furthermore PRODIS allows the tracking of experiments from the first stage until the final step of identification, enabling an efficient management of the complete experimental process.
The construction of data management systems for Proteomics data importing and storing is a relevant subject. PRODIS is a system complementary to other proteomics tools that combines a powerful storage engine (the relational database) and a friendly access interface, aiming to assist Proteomics research directly at data handling and storage.
Researchers who use MEDLINE for text mining, information extraction, or natural language processing may benefit from having a copy of MEDLINE that they can manage locally. The National Library of Medicine (NLM) distributes MEDLINE in eXtensible Markup Language (XML)-formatted text files, but it is difficult to query MEDLINE in that format. We have developed software tools to parse the MEDLINE data files and load their contents into a relational database. Although the task is conceptually straightforward, the size and scope of MEDLINE make the task nontrivial. Given the increasing importance of text analysis in biology and medicine, we believe a local installation of MEDLINE will provide helpful computing infrastructure for researchers.
We developed three software packages that parse and load MEDLINE, and ran each package to install separate instances of the MEDLINE database. For each installation, we collected data on loading time and disk-space utilization to provide examples of the process in different settings. Settings differed in terms of commercial database-management system (IBM DB2 or Oracle 9i), processor (Intel or Sun), programming language of installation software (Java or Perl), and methods employed in different versions of the software. The loading times for the three installations were 76 hours, 196 hours, and 132 hours, and disk-space utilization was 46.3 GB, 37.7 GB, and 31.6 GB, respectively. Loading times varied due to a variety of differences among the systems. Loading time also depended on whether data were written to intermediate files or not, and on whether input files were processed in sequence or in parallel. Disk-space utilization depended on the number of MEDLINE files processed, amount of indexing, and whether abstracts were stored as character large objects or truncated.
Relational database (RDBMS) technology supports indexing and querying of very large datasets, and can accommodate a locally stored version of MEDLINE. RDBMS systems support a wide range of queries and facilitate certain tasks that are not directly supported by the application programming interface to PubMed. Because there is variation in hardware, software, and network infrastructures across sites, we cannot predict the exact time required for a user to load MEDLINE, but our results suggest that performance of the software is reasonable. Our database schemas and conversion software are publicly available at .
The global analysis of proteins is now feasible due to improvements in techniques such as two-dimensional gel electrophoresis (2-DE), mass spectrometry, yeast two-hybrid
systems and the development of bioinformatics applications. The experiments form
the basis of proteomics, and present significant challenges in data analysis, storage and
querying. We argue that a standard format for proteome data is required to enable
the storage, exchange and subsequent re-analysis of large datasets. We describe the
criteria that must be met for the development of a standard for proteomics. We have
developed a model to represent data from 2-DE experiments, including difference
gel electrophoresis along with image analysis and statistical analysis across multiple
gels. This part of proteomics analysis is not represented in current proposals for
proteomics standards. We are working with the Proteomics Standards Initiative to
develop a model encompassing biological sample origin, experimental protocols, a
number of separation techniques and mass spectrometry. The standard format will
facilitate the development of central repositories of data, enabling results to be verified
or re-analysed, and the correlation of results produced by different research groups
using a variety of laboratory techniques.
Motivation: The world-wide community of life scientists has access to a large number of public bioinformatics databases and tools, which are developed and deployed using diverse technologies and designs. More and more of the resources offer programmatic web-service interface. However, efficient use of the resources is hampered by the lack of widely used, standard data-exchange formats for the basic, everyday bioinformatics data types.
Results: BioXSD has been developed as a candidate for standard, canonical exchange format for basic bioinformatics data. BioXSD is represented by a dedicated XML Schema and defines syntax for biological sequences, sequence annotations, alignments and references to resources. We have adapted a set of web services to use BioXSD as the input and output format, and implemented a test-case workflow. This demonstrates that the approach is feasible and provides smooth interoperability. Semantics for BioXSD is provided by annotation with the EDAM ontology. We discuss in a separate section how BioXSD relates to other initiatives and approaches, including existing standards and the Semantic Web.
Availability: The BioXSD 1.0 XML Schema is freely available at http://www.bioxsd.org/BioXSD-1.0.xsd under the Creative Commons BY-ND 3.0 license. The http://bioxsd.org web page offers documentation, examples of data in BioXSD format, example workflows with source codes in common programming languages, an updated list of compatible web services and tools and a repository of feature requests from the community.
Contact: firstname.lastname@example.org; email@example.com; firstname.lastname@example.org
Molecular interaction Information is a key resource in modern biomedical research. Publicly available data have previously been provided in a broad array of diverse formats, making access to this very difficult. The publication and wide implementation of the Human Proteome Organisation Proteomics Standards Initiative Molecular Interactions (HUPO PSI-MI) format in 2004 was a major step towards the establishment of a single, unified format by which molecular interactions should be presented, but focused purely on protein-protein interactions.
The HUPO-PSI has further developed the PSI-MI XML schema to enable the description of interactions between a wider range of molecular types, for example nucleic acids, chemical entities, and molecular complexes. Extensive details about each supported molecular interaction can now be captured, including the biological role of each molecule within that interaction, detailed description of interacting domains, and the kinetic parameters of the interaction. The format is supported by data management and analysis tools and has been adopted by major interaction data providers. Additionally, a simpler, tab-delimited format MITAB2.5 has been developed for the benefit of users who require only minimal information in an easy to access configuration.
The PSI-MI XML2.5 and MITAB2.5 formats have been jointly developed by interaction data producers and providers from both the academic and commercial sector, and are already widely implemented and well supported by an active development community. PSI-MI XML2.5 enables the description of highly detailed molecular interaction data and facilitates data exchange between databases and users without loss of information. MITAB2.5 is a simpler format appropriate for fast Perl parsing or loading into Microsoft Excel.
In today’s proteomics research, various techniques and instrumentation bioinformatics tools are necessary to manage the large amount of heterogeneous data with an automatic quality control to produce reliable and comparable results. Therefore a data-processing pipeline is mandatory for data validation and comparison in a data-warehousing system. The proteome bioinformatics platform ProteinScape has been proven to cover these needs. The reprocessing of HUPO BPP participants’ MS data was done within ProteinScape. The reprocessed information was transferred into the global data repository PRIDE.
ProteinScape as a data-warehousing system covers two main aspects: archiving relevant data of the proteomics workflow and information extraction functionality (protein identification, quantification and generation of biological knowledge). As a strategy for automatic data validation, different protein search engines are integrated. Result analysis is performed using a decoy database search strategy, which allows the measurement of the false-positive identification rate. Peptide identifications across different workflows, different MS techniques, and different search engines are merged to obtain a quality-controlled protein list.
The proteomics identifications database (PRIDE), as a public data repository, is an archiving system where data are finally stored and no longer changed by further processing steps. Data submission to PRIDE is open to proteomics laboratories generating protein and peptide identifications. An export tool has been developed for transferring all relevant HUPO BPP data from ProteinScape into PRIDE using the PRIDE.xml format.
The EU-funded ProDac project will coordinate the development of software tools covering international standards for the representation of proteomics data. The implementation of data submission pipelines and systematic data collection in public standards–compliant repositories will cover all aspects, from the generation of MS data in each laboratory to the conversion of all the annotating information and identifications to a standardized format. Such datasets can be used in the course of publishing in scientific journals.