Current problems in the biosciences typically involve several domains of research. They require a scientist to work with different and diverse sets of data. The reconstruction of a metabolic network from sequencing data, for example, employs many of the data types found along the axis of the central dogma, including reconstruction of genome sequences, gene prediction, determination of encoded protein families, and from there to the substrates of enzymes, which then form the metabolic network. In order to work with such a processing pipeline, a scientist has to copy/paste and often transform the data between several bioinformatics web portals by hand. The manual approach involves repetitive tasks and cannot be considered effective or scalable.
Especially the processing and analysis of small molecules comprises tasks like filtering, transformation, curation or migration of chemical data, information retrieval with substructures, reactions, or pharmacophores as well as the analysis of molecular data with statistics, clustering or machine learning to support chemical diversity requirements or to generate quantitative structure activity/property relationships (QSAR/QSPR models). These processing and analysis procedures itself are of increasing importance for research areas like metabolomics or drug discovery. The power and flexibility of the corresponding computational tools become essential success factors for the whole research process.
The workflow paradigm addresses the above issues with the supply of sets of elementary workers (activities) that can be flexibly assembled in a graphical manner to allow complex procedures to be performed in an effective manner - without the need of specific code development or software programming skills. Scientific workflows allow the combination of a wide spectrum of algorithms and resources in a single workspace [
1-
3]. Earlier problems with iterations over large data sets [
4] are completely resolved in version 2.0 due to new implementations in Taverna. Taverna 2 allows control structures such as "while" loops or "if-then-else" constructs. Termination criteria for loops may now be evaluated by listening to a state port [
5]. In addition the user interface of the Taverna 2 workbench has clearly improved: The design and manipulation of workflows in a graphical workflow editor is now supported. Features like copy/paste and undo/redo simplify workflow creation and maintenance [
6].
The CDK-Taverna project aims at building a free open-source cheminformatics pipelining solution through combination of different open-source projects such as Taverna [
7], the Chemistry Development Kit (CDK) [
8,
9], or the Waikato Environment for Knowledge Analysis (WEKA) [
10]. A first integrated version 1.0 of CDK-Taverna was recently released to the public [
4]. To extend usability and power of CDK-Taverna for different molecular research purposes the development of version 2.0 was motivated.
Implementation
The CDK-Taverna 2.0 plug-in makes use of the Taverna plug-in manager for its installation. The manager fetches all necessary information about the plug-in from a XML file which is located at
http://www.ts-concepts.de/cdk-taverna2/plugin/. The information provided therein contains the name of the plug-in, its version, the repository location and the required Taverna version. Upon submitting the URL to the plug-in manager it downloads all necessary dependencies automatically from the web. After a subsequent restart the plug-in is enabled and the workers are visible in the services. The plug-in uses Taverna version 2.2.1 [
6], CDK version 1.3.8 [
11] and WEKA version 3.6.4 [
12]. Like its predecessor it uses the Maven 2 build system [
13] as well as the Taverna workbench for automated dependency management.
CDK-Taverna 2.0 worker implementation
The CDK-Taverna 2.0 plug-in is designed to be easily extendible: The implementation allows to create new workers by simply inheriting from the single abstract class org.openscience.cdk.applications.taverna.AbstractCDKActivity (which is the analogue of the CDKLocalWorker interface of CDK-Taverna version 1.0). The class is located in the cdk-taverna-2-activity module. It provides all necessary data for the underlying worker registration mechanism which frees the software developer from handling these tasks manually. The methods which need to be overwritten in order to implement a worker are:
• public void addInputPorts(), public void addOutputPorts(): Specify the ports for passing data between workers.
• public String getActivityName(), public String getFolderName(): Return name and folder of a worker.
• public void work(): Entry point for the worker's central algorithm that performs its core function.
• public String getDescription(): Provides descriptive text that explains a worker's function.
• public HashMap <String, Object> getAdditionalProperties(): Specifies additional properties like file extensions, the number of concurrent threads to use, etc.
Finally a new worker has to be registered to be available in the Taverna workbench. For this purpose Taverna offers the class net.sf.taverna.t2.spi.SPIRegistry.SPIRegistry to register Service Provider Interfaces (SPI). It is necessary to add the new worker's full name including its package declaration to the file org.openscience.cdk.applications.taverna.AbstractCDKActivity which contains the names and packages of all available workers. This file is located at cdk-taverna-2-activity-ui/src/main/resources/META-INF/services.
Besides the basic implementation it is possible to define a configuration panel for a worker which allows the specification of parameters. A configuration panel has to inherit from the abstract class org.openscience.cdk.applications.taverna.ActivityConfigurationPanel. The GUI element itself has to be defined in the constructor of the class and may contain any Java Swing element. The following methods are the backbone of a configuration panel:
• public boolean checkValues(): Validates all GUI values.
• public boolean isConfigurationChanged(): After the validity check this method is used to compare the current worker settings with the GUI settings to detect changes.
• public void noteConfiguration(): The properties of the worker are saved in a bean structure. The changes of the configuration bean object are updated by this method.
• public void refreshConfiguration(): Updates the GUI values itself.
• public CDKActivityConfigurationBean getConfiguration(): Access to the configuration bean.
Requirements
CDK-Taverna 2.0 supports 64-bit computing by use with a Java 64-bit virtual machine. The CDK-Taverna 2.0 plug-in is written in Java and requires Java 6 or higher. The latest Java version is available at
http://www.java.com/de/download/. The CDK-Taverna 2.0 plug-in is developed and tested on Microsoft Windows 7 as well as Linux and Mac OS/X (32 and 64-bit).