We use two principal proteomics workflows used in the CPB as exemplars to describe the design and implementation of SemPoD, namely:
1. The first workflow is affinity-purification mass-spectrometry (AP-MS) workflow that enables the identification of specific protein complexes, thus identifying proteins that are associated with one another.
2. The second workflow is the shotgun expression proteomics that identifies and quantifies proteins in an unbiased manner from cells or tissues of interest.
Together, these two workflows account for approximately 50% of all experiments performed in the CPB and have been used in approximately 20 separate projects, generating over 3 Terabytes (TB) of data.
SemPoD was developed using agile software engineering methodology for rapid and iterative development in close consultation with the users. The agile engineering approach was combined with the Ruby-on-Rails web development framework that uses a Model-View-Controller (MVC) architecture pattern. The MVC pattern involves a strict separation of the application logic from the user interface, which allows SemPoD to seamlessly adapt to changing requirements of translational research studies, with a consistent query environment (Figure illustrates the SemPoD architecture).
SemPoD leverages the SysPro ontology as the core resource to support various query functionalities, including "smart filtering" for reducing user effort in composing complex query patterns.
The systems biology provenance (SysPro) ontology
At present, the provenance metadata associated with the different stages of the proteomics workflow at CPB is not collected in a systematic manner. Often, the provenance metadata is stored as hand-written notes in a lab book and is not immediately available for query and analysis of the proteomics dataset. Further, any modification in the experiment protocols or related experiment metadata information makes it difficult to correlate or integrate data from previous runs with new datasets. The use of a variety of terms to describe provenance increases terminological heterogeneity across different projects and makes it difficult to effectively integrate datasets.
Hence, the SysPro ontology was developed to model experiment metadata by re-using and extending existing minimum information reporting guidelines defined by the 'omics community. Several "minimum information" reporting frameworks have been developed and are now part of the minimum reporting guidelines for biological and biomedical investigations (MIBBI) project [
9], which facilitates collection and representation of experiment metadata in a variety of scientific domains. The minimum information required for reporting a molecular interaction experiment (MIMIx) framework [
10] is part of the MIBBI project and extends the minimum information about a proteomics experiment (MIAPE) [
11] framework with additional metadata terms describing interaction information that are used in the experiment workflows at the CPB. Concepts and terms already described in MIMix, for example "interaction detection method", "co-immunoprecipitation" were used as initial concepts in the construction of the SysPro ontology. Further, additional proteomics workflow specific terms were added to SysPro to reflect the specific requirement of provenance modeling in CPB by extending the World Wide Web Consortium (W3C) PROV ontology (PROV-O) [
12].
The PROV-O is a reference ontology being created by the W3C provenance working group to facilitate provenance interoperability with a set of common provenance-specific classes and relationships. The PROV-O terms can be extended by various domain-specific applications, such as SemPoD [
12]. The PROV-O consists of three primary classes namely, (1) prov
http://www.w3.org/ns/prov . Activity that models processes occuring over a period of time, (2) prov:Entity that models resources that are described in provenance assertions, and (3) prov:Agent that represents specific type of prov:Entity or prov:Activity that are responsible for actions associated with prov:Activity. The PROV-O ontology classes are linked together with named relationships, such as prov:used, prov:wasAttributed, which allows effective modeling of provenance assertions, for example cell culture used an "endogeneous" bait type. The SysPro ontology extends the PROV-O classes and relationships to model provenance metadata associated with the AP-MS and shotgun expression proteomics workflows. Figure illustrates the class hierarchy and "instance" values of the class "BaitType" in the SysPro ontology.
The SysPro ontology also facilitates cross-linking of 'omics data with a variety of related genomics and clinical datasets, which are annotated with domain ontologies [
13]. A rapidly increasing number of biomedical domains, such as genetics, infectious diseases, and cancer, have created ontologies to model their domain information. These domain ontologies have significantly enhanced the use of standardized terminology across these communities. The most notable example is the case of Gene Ontology (GO) that is widely used to consistently annotate gene related information across a variety of applications [
14].
To allow experiment data generated in CPB to be linked to external datasets at UniProt (for protein data) and GeneDB, inter-ontology mappings between SysPro, GO, and the Protein Ontology (PRO) [
15] can be semi-automatically created enabling SemPoD to support queries across both internal and external datasets. Currently, SemPoD uses mappings between the SysPro ontology and the underlying proteomics databases for query translation and execution. Figure illustrates the mapping process from the CPB protomics database and SysPro ontology. The SysPro ontology allows SemPoD to not only adapt the functionality of the query environment according to user input, but also improve the performance of SemPoD query modules.
The SemPoD query environment
SemPoD consists of four main components, namely (1) the SysPro ontology browser, (2) the integrated query builder, (3) the result explorer, and (4) the query manager (Figure ).
SemPoD ontology browser and query builder
The SemPoD query builder component (Figure ) is an intuitive and flexible interface that allows researchers to directly browse the SysPro ontology class hierarchy and select appropriate terms to interactively compose expressive queries. Once a SysPro ontology class is selected by the user, the query composer automatically populates the the "drop-down" menu corresponding to the class, which allows the user to easily select specific value. For example, if an user selected the class "Cell line", the coressponding drop-down menu is populated with its "instance" values (Embryonic stem, Epilast stem cell or HCT116) as illustrated in Figure . Further, the users can compose complex query patterns by linking query terms with binary logical connectives("and", "or").
The SemPoD query builder uses the SysPro ontology to support an advanced feature called "smart filtering" that dynamically updates the query interface in response to previous user selections. Figure illustrates this feature, with selection of two classesnamely, "Cell line" and "Bait gene" and the corresponding drop down menus that are automatically populated with instance values of the classes defined in the SysPro ontology. The "smart filtering" approach allows the users to quickly compose large query patterns by significantly reducing the time needed to search and locate appropriate values in the query builder interface.
Further, the "smart filtering" feature leverages instance-level relationships defined in the SysPro ontology, which links only specific instance values with each other. For example, the "EPHB2" instance of class "Bait gene is associated with only "HCT116", which is an instance of class "Cell line". Hence, when the user updates her selection of "bait gene symbol" from "CTNNB1" to "EPHB2", the corresponding instance value for the "Cell line" is automatically updated to "HCT116" (Figure ). As discussed in the previous section, the SysPro ontology re-uses the PROV-O relationships to link both classes and instances reflecting domain-specific information in systems molecular biology. Figure illustrates the use of "prov:hadRole" to link the "Bait gene" and the "Cell line" classes and their instances.
SemPoD result explorer
The user can explore the results of their queries in the SemPoD result explorer (Figure ), which lists the projects datasets that correspond to the experiment metadata criteria used in the query pattern. In addition, the result explorer links directly to the underlying LabKey proteomics data browser [
16], which is used in CPB to store the results (after login credential have been initially verified). The seamless interface with the LabKey allows SemPoD to build on existing data management platforms that are already in use by many 'omics' centers without having to re-implement many features that already present.
SemPoD query manager
The user can also save their queries using the 'Save Query' button in the query builder interface(Figure ). A query name and description can be given to identify the query for later use. Figure showsa screenshot of the query manager with a list of all saved queries. An user can select a specific query from the query list, view the query pattern, and re-execute the query if needed. The ability to store commonly used query patterns that can be retrieved later and also shared with other researchers is an important feature of SemPoD and has received positive feedback from users at CPB.