The SeqAdapt default download consists of a preconfigured bundle of services. Each of these services can be interchanged with alternative implementations. Additionally the underlying system makes it possible to develop custom applications, and to rapidly integrate new analysis tools (see Integrating New Analyses section).
The software infrastructure is comprised of three main services: a sample tracking system, a data management system and a process management system (see Figure ). The sample tracking system's functionality includes sample submission, annotation with controlled vocabularies and file management. The data management system uses Addama to organize the data and to trigger new analyses. The process management system (the Addama Robot) is a lightweight and generic system for executing processing pipelines and persisting inputs and outputs. The Addama Robot allows for analyses to be run directly on a dedicated machine which has been configured for a specific analysis (e.g. has all the data files, dependencies and resources required for the specific analysis). Typically the analyses are run on either a server machine or, if the analysis is still under in-house development, on the specific developers computer. The robot is responsible for monitoring jobs and the transparent transportation of required data between the repository and the analysis environment.
The data management system uses standard open technologies including: content repositories which are used to generically store all experimental information; information indexing services, which provide for search capabilities across all data and metadata stored; and service registries, which allow for run-time discovery of different content repositories and associated services. Addama also provides an abstraction so that a set of interlinked content repositories can be accessed through a single web application layer. This layer is exposed with a JSON based RESTful web service. The process management system coordinates jobs using a Java Message Service (JMS). With this configuration, the system can scale from a single computer to a distributed set of execution agents on multiple servers to listening to a JMS message queue.
All components of the SeqAdapt system are loosely coupled to allow for easy replacement with alternative systems. These alternative systems could include different sample tracking systems (e.g. Sequencescape [
16]) different persistence stores (e.g. RDBMS) and different analysis tools (e.g. RNA-Seq tools).
This system allows for rapid integration of scientific algorithms using the standardized Addama framework. The integration system is designed to be flexible, and allows for any command line analysis tools to be "plugged in". The framework is suited for developing small-scale analyses as well as for large scale processing that requires scaled-up distribution. Further, all components needed for this system are provided in an easy-to-install package.
The default download for SeqAdapt has been set up for Chip-Seq analysis, and will process data from the Illumina Genome Analyzer II using the MACS algorithm (see the Availability section for download details). The default install allows the user to submit an analysis, monitor its progress, and then view the result files.
Integrating New Analyses
As discussed above, the enterprise system can be extended in a number of ways. As the system is based upon distributed loosely coupled services, these services can be replaced with ones offering the same interface. Each of the major components are built against technology standards (e.g. REST Web Services serve out JSON, the repository is standardized to the JCR specification).
New analysis pipelines can be added by writing a mapping module and then registering the analysis with Addama. A benefit of the Addama systems is that a script can execute in a preexisting development environment, eliminating the time consuming task of replicating software installations on a processing server. When algorithms have reached a mature state and are used widely, the system scales up to have the execution agent installed on many servers. A key benefit of this system is that it manages all of the non-scientific functionality needed in this type of processing. This freedom from writing boilerplate infrastructure code allows the computational biologist to focus on developing the needed scientific software.
The Addama Robot (see Figure ) allows for the rapid integration of these tools in a relatively short period of time. Integration of these tools requires that the developer has a rudimentary understanding of Addama, and also understands how the specific analysis tool works (in terms of data formats).
By way of example we have integrated the ERANGE [
17] RNA-Seq analysis tool. Once ERANGE was installed the integration work was completed in less than a day. Any analysis script that can be run from the command line can be integrated in the same manner. The steps involved in such integration are:
1. Define input location. This is done by providing a command-line executable wrapper script. This script will define all of the inputs to the ERANGE analysis and execute it. It will read the inputs from a local JSON file downloaded from the Addama service by the Robot.
2. Control outputs by configuring the wrapper script to write all results to an "/outputs" directory. This directly will be in the same location as the script, the creation of the directory will be handled by the Robot as well. Similarly any log information (e.g. errors, debugging messages) should be written to a "/log" directory.
3. Register the script by configuring the Addama Robot. The robot uses a properties file that defines the wrapper script that is to be executed, and a local path where each run will be output. Update these properties to reflect the locations of the ERANGE script and the directory where it write inputs, outputs and log messages.
4. Enable user submissions. To make submitting simple for the user, an optional web application may be developed. This application will take the expected inputs and send them to the Addama system via the REST interface. This same page can also be used to query Addama for the results of the Robot analysis and display those for the user as well.
The robot automates the tasks that are required to integrate the analysis with the enterprise system. When the analysis is triggered the robot is responsible for the delivery of the inputs to the analysis, starting the analysis and monitoring the outputs. When completed the outputs, and any associated logs are loaded back, into Addama.
Walkthrough
A walk-through showing the default workflow for SeqAdapt is given in Figures ,, and . This walk-through shows how the system can be used to capture information about a Chip-Seq experiment, store the results and then analyze the reads using MACS.