There are two major requirements on software design within a rapidly changing research environment: the ability to develop and integrate software rapidly, and the assurance that the resulting systems are adaptable to new and unpredictable needs. The methodologies and technologies used within research environments must be able to satisfy these requirements. Methodologies that depend on formalized specifications for processes and data structures are not appropriate for research software infrastructure. An understanding of current informatics technologies, and their suitability, is essential to ensure that research driven software projects can be delivered on schedule. Over the last few years there have been many advances in enterprise computing which means that data management technologies can now effectively be used in rapidly evolving research environments. These advances mean that, through an appropriate choice of technologies, it is now possible to deliver, within a minimal timeframe, a distributed system which is robust, standardized, loosely coupled and interoperable. Enterprise components can now be rapidly configured to deliver a range of rich functionality (e.g. content management, messaging, state based processing, dynamic discovery, high level orchestration). However, inappropriate technology choices can result in an architecture which is not adaptable and difficult to maintain.
Technologies that require a static structure (e.g. schema definitions) are constrained in their usage as they require "top-down" design. This means that the usage of traditional application server based technologies may not be appropriate for many aspects of research, as they are largely designed for tasks where: there exists reasonably stable information which can be structured in a DBMS or similar; business logic is required to operate closely on the data; or integration logic or data transformation is required. Alternatively, less structured data management, such as content repositories, can be suitable within research environments. Content repositories serve a different purpose to application servers, but can be used to serve out similar types of information using the same protocols. Content repository technologies can be more appropriate (than an application server) as they provide a high level of flexibility when dealing with unstructured content (e.g. experiment information).
We are not suggesting that more formalized representations of data do not have their place in research environments. Many data sources can be represented as static snapshots, and when it comes to presenting the research results, the use of more standard software development practices and tools is essential. These static approaches become unsuitable in situations where software is being developed to directly support on-going research projects, with unpredictable life cycles and unforeseen requirements.
Layered content management
To manage data arising from ongoing research experiments we adopted an approach using distributed content repositories. Content repositories allow for the development of a formalized structure that can be associated directly with resources (see Figure ).
Figure 2 Content management systems are designed to handle generic data items. These items are placed within a hierarchical structure. Each session with a content repository is managed, so that changes can be made in a transactional manner. A separate indexing (more ...)
The content can be organized into a (typically) hierarchical structure. The nodes in the hierarchy can have an arbitrary complexity (depending upon the requirements) and also have extensible property sets associated with them. The content system also provides for the usual array of horizontal services (e.g. data querying, browsing, management, transactions, concurrency control, security).
We extended the Java Content Repository (JCR) standard (using the Apache Jackrabbit implementation [9
]) to provide a distributed data management system that consists of a series of interlinked content management systems (CMS). Such a federated content repository solution offers three major advantages over a standard database/application server solution:
• Easy to adapt
. Within a research environment it is generally difficult to hammer down requirements. The requirements change over time, and new functionality is often required at short notice. Content repository systems have a high degree of flexibility, as the content graph can be extended to meet new requirements, and the annotations can be dynamically updated. This means that when new requirements become apparent, the structure can be changed or modified easily, without having to rewrite a schema or change an object layer implementation. It is interesting to note that other disciplines, in particular the design community, have found that such flexibility is invaluable in enabling creativity [10
• Easy to understand. Any data management solution will involve a high level of complexity, especially in a distributed research environment. This complexity, however, does not mean that the principles of how the system operates cannot be easily understood. Allowing the end-users to build up a good mental model of how the system works removes many obstacles with adoption. By portraying the content management as "an intelligent file system" positive transfer of knowledge can occur, so that the system is natural and intuitive to use.
• Easy to access. Within a research environment there is little time to be spared for learning (largely transient) informatics systems. This fact coupled with the low level of formal software training of most researchers (whether computationally inclined or not), means that convenient access to the data is of paramount importance. Content management systems in general, and the system discussed in this paper in particular have little complexity in terms of object models and data access protocols, as the storage structure is simply a hierarchy and a large number of access protocols can be supported (e.g. direct file I/O, RMI, WEBDAV, REST).
Unfortunately, a single content repository system does not offer the level of flexibility that is required. Research groups interact with different facets of research information in diverse ways and with a plethora of goals. This diversity means that the data must be organized in different ways, for example: when capturing the data it may be appropriate to organize information by group and date; when processing the information it can be organized by runs and tool information; and when exploring the information it can be organized by biological significance (e.g. by gene, by strain, by condition).
To allow for this diversity (and adaptability) we deployed a series of loosely coupled distributed content repositories. The resulting distributed content management system (Figure ) allows for different people to interact with the systems directly. To identify resources we have previously [8
] used URNs (encoded as LSIDs). We have since migrated to using HTTP URIs, as we feel that HTTP URIs have advantages over URNs for linking between data sources: URIs (like URNs) can support human readable hierarchical naming schemes; Representational State Transfer (REST) operations can be easily appended to URIs; they do not require additional overhead for resolution; and URIs are better aligned with current community efforts.
Figure 3 Schematic diagram showing how loose coupling between content repositories has been used to provide a distributed content management system. The data management system allows for the ad-hoc linking of items between repositories, using URIs. This allows (more ...)
The main aim of the distributed content repository system is to enable the researchers, and associated research tools, to easily be able to store, manipulate and retrieve the information they want in the form they desire. One repository can span multiple installations, and multiple repositories can reside within one instance. The system we use is designed around a three-layer architecture:
• Instrumentation Layer. This layer is used to capture experimental information. The layer models information in a way that makes storage of the information simpler, so that laboratory scientists can easily add and annotate information that is captured from a variety of instruments. Typically each type of experiment has a distinct (and differently structured) repository.
• Conceptual Layer. This layer is designed to provide a means to generically interact with the information through the use of high level abstract operations. These operations include the aggregation and retrieval of information, and do not necessitate an understanding of the actual information content. Information which is required for manipulating the data is provided as metadata through properties attached to the data files.
• Organizational Layer. This layer provides a project (or researcher) based view on the information, and therefore is designed to have a "biological focus". Typically the content is organized by factors such as disease, organism or molecule. Each different research, or research group, can individually organize and annotate the data to suit their individual requirements.
It is interesting to note that "three layer" architectures commonly occur in many software architectures [11
]. This pattern consistently has a lower level that directly reflects the underlying data storage, a middle layer which provides an abstract conceptualization of the data, and an upper level which presents the data in a convenient manner to external applications (see Figures and ).
Figure 4 The ANSI-SPARC three layer database architecture was proposed in 1975, and is still used in modern RDBMS. The three proposed layers were a physical schema which defined how the data is actually stored (inode information), a conceptual schema which represented (more ...)
Figure 5 Design of a system within an application server typically involves three layers: an infrastructure layer which stores the data; a domain layer which is populated with vertical object models; and an application layer which models how an application can (more ...)
The instrumentation layer is designed to be the interface point with high throughput equipment and laboratory instrumentation. Data is captured by laboratory scientists via equipment and is stored in a repository that is structured for their convenience. In this way the repository is closely aligned to the experiment and experimenter.
The repositories are mainly used to automatically capture information from high throughput experiments. We are currently using instrument repositories to capture information for high throughput imaging, proteomic and genomic experiments. The instrumentation repositories generally have relatively simple structures as their purpose is to facilitate data collection. For example, when capturing information for genomics, the data is organized in the repository in a fairly flat hierarchy containing a simple name and time stamp mechanism.
We specify a limited set of properties (based upon an ontology with an associated controlled vocabulary) which are used to describe the metadata associated with the experiment. This controlled metadata is used in conjunction with any ad-hoc or free text annotations which may be added at a later time by the experimenter. Where possible the metadata is gathered directly from the instrumentation (e.g. by parsing data files generated by microscopy control equipment such as IPLAB) otherwise, we interface with bespoke sample tracking systems (e.g. for genomics we bridge with the sample tracking SLIMArray [12
] software). The ontologies used are based upon community standards (e.g. OME [13
], MAGE [14
• URI based retrieval – all referenceable items within the repository can be retrieved using a HTTP URI. This interface was built both to provide convenient access to the data and also to allow for the construction of semantic web based applications on top of a JCR instance.
• Standardized access components for retrieving, searching and publishing information to a JCR instance. We have used these components in a variety of applications, including: within a servlet based web application; within a standalone application(s); and as part of a GenePattern [15
] tool set.
The conceptual layer is designed to aid in tasks that are fundamental in many different research investigations. The layer provides a high level of abstraction for tasks associated with data processing (the "analysis system") and data manipulation (the "aggregation system"). These systems provide common (horizontal) functionality, and are designed to be used by analysis and data processing subsystems.
The analysis system is used to manage, audit and store the results of data analysis/processing pipelines. Each pipeline run is associated with a node in the analysis content repository. Each node defines a structure that accommodates the required inputs, the outputs and logs produced by the process, and the current status (and history) of the execution. In this way, each processing pipeline is loosely coupled with the analysis system, which provides a searchable storage area for input/output files and a logging service. Other software (e.g. administration/monitoring applications) can query the repository to find the status of any distributed processes that has been registered. A process that uses the analysis system typically has the same life cycle: when a processing pipeline starts it queries the content repository to find the required input node (the querying is based upon the properties that have been set on the node); status information is sent to the status node (e.g. state transitions, failures, logging); and output data is stored in the output nodes. Due to the loose coupling and simplicity we have used the analysis system to support numerous data processing pipelines, in particular in processing genomics data (see Figures and ).
Figure 6 Example view on a gene expression array normalization and analysis pipeline web application. This web application is used to normalize sets of gene expression experiments together so that they can be compared. The researcher can use a variety of search (more ...)
Figure 7 Administration and inspection tools can be built against the analysis repository, which show information about specific analysis that was undertaken. These tools simply display the information about completed analysis runs. They can be searched, and the (more ...)
The genomics normalization pipeline (Figure ) uses the analysis system directly. As the analysis system uses a standard pattern of nodes (input, output and status hierarchies), any analysis pipeline can make use of it. The genomics normalization pipeline simply retrieves the required input parameters, executes the job (and updates the status), and then writes the results. By using a pattern of nodes, applications can be loosely coupled so that they can be developed independently of each other with few dependencies e.g. a job submission application can write to the input nodes (Figure ), and a separate processing application can receive notifications that are triggered through changes to the status nodes (Figure ). The loose coupling provides a high level of reliability and allows for rapid application development. The content system also provides for a means to ensure that we are able to achieve reliable auditing of each analysis run. The basic functionality for reading/writing information is provided through standardized modules.
One of the aims of the computing advances over the last few years has involved the concept of run time aggregation. This approach is epitomized by the semantic web, where people can mash information from a variety of data sources into a single graph. We provide a run time aggregation system specifically for the aggregation of data from different analyses.
When an analysis run has finished, the observing aggregation system can trigger an indexing operation on the results. The indexing is controlled using the properties that are associated with the output of the analysis run (typically the properties related to the rows or columns that are to be indexed). As all data within any of the content systems is indexed it is possible to retrieve data items using free text or direct querying. If, for example, an application (or user) wished to aggregate all data pertaining to a specific gene then it can query the aggregation subsystem (and other content repositories) so that the aggregated information retrieved would consist of both nodes (representing both specific data and any associated metadata) and specific rows (or columns) from data files. In this way, it is possible to extract relevant information from a variety of data sources upon demand. For example, the end results of the normalization pipeline shown in Figure is a large replicate combined data matrix of how mouse gene expression levels change over a myriad of conditions; the aggregation system can be used to automatically extract subsets of this information directly so that they can be presented in a uniform manner (see Figure ).
Figure 8 The above example Web Application uses a portlet engine to display information from a variety of repositories (and other data sources). The aggregation system can be used to pull together information from different sources. It can be used directly to (more ...)
The Web Portal (Figure ) uses the standardized service to retrieve information from a number of repositories. The service can query multiple repositories, so that the portal makes a request to retrieve URIs for specific types of information that match a specific gene id. This information is then displayed in each of the portlets. The aggregation system is used to allow for each part of data files (e.g. a specific row in a matrix that corresponds to a particular gene/probe) to be uniquely identified by URIs.
The organization layer allows for the construction of a scientific oriented façade on top of the other repositories. This façade provides a scientific project based structure, which allows for the custom organization of data and for project specific annotations to be added. This layer provides more than a materialized view of the other layers, as extra information and data relevant to the specific scientific project can be added, searched and retrieved. As all "referenceable" items within any of the repositories are exposed as HTTP URIs, these can be used to provide links between distributed repositories. As all items are uniquely addressable, a generic browsing application can be built (see Figure ) which provides a view on the interlinked repositories. The browsing application can be used to annotate items so that they can be searched, to organize distributed items, and to link items from different repositories.
Figure 9 Purpose built Web Application for browsing distributed repositories. Although there are a variety of tools available for browsing single site (and single workspace) repositories, the above application allows for the transparent linking and browsing of (more ...)
Repository level operations that are provided through http encoding.
HTTP URI based operations are provided through a REST API for operating on individual objects.
Figure 10 Proteomics data browsing web application. This is an example of a web application that has been constructed on top of the organizational layer, the application allows for browsing/sorting by annotations and linking of experiments relating to a specific (more ...)
The organizational repositories represent a domain-oriented view on the information. The information can be securely retrieved using a variety of protocols (see Figure ). The flexible access to the data means that: scientists can browse, arrange and annotate the information using a hierarchy that best suits their needs; and domain-centric applications can be rapidly developed using these structures as interfaces to the data.
Figure 11 The use of communication standards allows for a high degree of flexibility in delivery of information. Data stored in files can be either imported or linked into the content repositories, and then annotated. RMI interfaces expose the basic read/write (more ...)