The advent of the post-genomic era in biology has led to a dramatic increase in the amount of multi-dimensional, quantitative data that must be analysed by the bioinformatician. This is especially true in the case of genome-scale analyses of the transcriptome, proteome and metabolome, particularly when such measurements have been made in parallel using high throughput technologies involving microarrays and mass spectrometry techniques [1
]. Analyses of these data rely on the performance of in silico
experiments, involving the inductive detection of patterns in the data to which some phenotypic significance can be attributed [3
]. Such analyses usually rely on statistical testing and linking the results of these tests with information stored in biological databases to summarise and develop conclusions. For example, the analysis of gene expression data generated from microarray experiments consists of a number of steps. The process begins with the normalization and standardization of transcript data, followed by statistical evaluation, and finally, interpretation of the statistical results via the annotation of genes with information relating to their biological function [4
There are a number of issues associated with the use of computational tools in the analysis of quantitative data. Firstly, learning how to use such tools for statistical analyses can require significant time and effort. This is especially true for mathematical tools such as MATLAB [5
] and R [6
] which require prior knowledge of their programming languages and the functions within them in order to implement statistical algorithms. Secondly, there is the overhead of transferring data between computational resources during each step of a data analysis pipeline which is made more difficult due to the inconsistent nature of the user interface to the tools. For example, a user may access R from the command line whilst the querying of online sequence databases is made through the use of a web browser. Piping the output of one resource to another will therefore require intermediate staging of the data so that they may be passed manually amongst multiple tools [7
]. Thirdly, the interoperability of computational tools can be awkward due to the heterogeneity of data in bioinformatics. The output data provided by a database service may be incompatible as input to the next analysis service both in terms of its structure and its semantics. In these cases, data have to be reconciled by a transformation step in order for them to be consumable by the next service.
experiments on bioinformatics data may be realised as workflows consisting of a pre-defined series of tasks that are related to one another by the flow of data between them. Such workflows can be constructed and enacted using applications such as Kepler [8
], Triana [9
] and Pegasus [10
] that automatically direct the flow of data between the information repositories and computational tools responsible for performing the tasks within an in silico
experiment. These workflow systems enable the use of distributed resources which have been deployed using web services, a distributed computing architecture that uses existing Internet communication and data exchange standards to support interoperable application-to-application interaction over a network [11
]. Web service-enabled resources provide a web-based application programming interface (API) that is published in a machine-processable format such as Web Services Description Language (WSDL) [12
]. Interaction of client applications with the web service is independent of the computing platform used to host the service resource. Other systems interact with the web service in a manner prescribed by its interface using messages which may be enclosed in a SOAP envelope and are typically conveyed on the web in the form of XML.
Grid project has developed a workflow system called Taverna [13
] which is capable of invoking several types of local and online tools that can perform the various tasks of a constructed workflow [14
]. Different processor implementations are used to invoke applications depending on the type of invocation mechanism including web services which are described in WSDL documents as well as those deployed using the Soaplab [15
] and BioMoby [16
] frameworks. Workflows consisting of these and other types of processors are composed in the Scufl workflow language using the Taverna workbench, typically by an expert user of analysis and data services [14
]. In this paper, we report on how the Taverna workflow system can be used for the statistical analysis of quantitative, post-genomic data. Using an example from the transcriptomics domain, we show a workflow which retrieves data using customised maxdBrowse web services from the maxdLoad2 microarray database [17
]. This workflow then performs statistical analysis on the gene expression data using R to generate a list of differentially-expressed genes which is followed by the annotation of the genes with information stored in biological databases. Furthermore, we show how extra functionality can be incorporated into Taverna using a plugin mechanism that has been developed into its new software architecture, thereby enabling it to be tailored for use in different scientific domains including transcriptomics.