The growth in size, types and subtypes, and overall number of large scale ‘omics datasets has led to a large increase in the diversity, complexity, and rate of change of bioinformatics analyses. In turn, these complex and evolving needs have led to an increasing emphasis on development of analysis pipelines. Here, we developed a series of workflows that addressed microarray data set analysis. Microarray analysis pipelines are complex, subject to alteration with the development of new analytical tools, and also growing in complexity due to the development of new types of resources for output analysis, such as gene pathway systems [
38], and new demands for integration with other sources of data (e.g. gene expression data with metabolomic data - [
39]). We found that the Kepler system has functional and appropriate capabilities for microarray data analysis pipeline creation. The Kepler workflow model, and in particular its graphical implementation, provides significant advantages for pipeline creation by promoting conceptual and visual clarity. Finally, Kepler workflow programming allows a variety of overall design approaches; we create microarray data analysis workflows using several types of programming approaches.
Microarray data analysis will become more complex in the future. First, there are increasing current needs for microarray analysis involving integration of data from different gene expression platforms (e.g. integrating data from Illumina and Affymetrix gene expression systems) and different designs (e.g. exon microarrays; [
3]). Analyses involving different platforms will require much more extensive analysis pipelines. Also, there are increasing needs for comparison of gene expression data with ChIP-chip or ChIP-seq (chromatin immunoprecipitation followed by sequencing) and comparative genomic hybridization (CGH; to determine copy number variation in genomic segments) datasets, which again will require more extensive pipelines [
3]. As analyses become increasingly complex, it will become critical to use workflow systems to allow modifiable, extensible, and easily comprehensible analysis pipelines. We suggest that the workflows produced in this report represent a critical set of components for these future, more extensive pipelines.
Kepler features such as (1) actors with clear inputs and outputs and (2) the graphical nature of workflow development and representation offered significant advantages to us as compared to traditional shell scripting. For example, simple visual inspection quickly reveals that the gene expression workflow (Figure ) includes a “correspondence analysis”, which is an optional step in gene expression workflows. The approach for eliminating this step is also clear: simple removal of the actor. Similarly, if the method for determining differentially expressed genes (DEGs) should be changed, the workflow visually indicates the correct actor to alter - the ‘DEG selection’ actor must be modified. Most importantly, we found that Kepler offered special advantages for bioinformatics pipelines involving a large number of potential parameters: even a brief glance at the gene expression workflow (Figure ) or the primer design workflow (Figure ) shows which parameters are usually changed and which parameters are usually kept constant. This is enabled by the nature of the Kepler canvas using positioning, text font, text size, and text color to organize parameters into clear visual groups.
Kepler enables a large number of potential software design strategies. This seems to be an advantage in that there is not necessarily a single best design approach for bioinformatics pipelines. For example, an actor could be implemented as a large block of Python code or, alternatively, as a composite actor embodying a workflow. Both implementations of this actor could look identical from a usage standpoint. Overall, the Kepler model favors actors passing data to each other. However, in some cases (e.g. Figure ), we may wish to have the actors simply compose a script to be run externally. Kepler can easily support these different software design paradigms. This means that existing shell scripts and pipelines can often be easily moved into the Kepler system. Hence, a set of useful Kepler workflows and reusable actors can be quickly produced.
Runtime workflow errors are indicated in Kepler with a red box around the actor that encounters the error. This has the advantage of the indicating clearly which stage of the workflow needs to be investigated. However, the actual error output that is passed through Kepler is phrased in programming terms, which will be difficult to interpret for the end user. This problem is shared with many bioinformatics pipelines that rely on simple shell scripting and errors in R/Bioconductor can also be difficult to understand. This problem can be addressed by having the experienced bioinformatician add error-catching tests to the pipelines to produce informative error outputs.
Kepler offers advantages and disadvantages compared to other bioinformatics pipeline development approaches. In the much used “shell-scripting” approach, a series of external programs are invoked. Shell scripting has the advantage of being a well-known and easily accessible approach that can produce very rapid development, in particular for small tasks [
23]. However, shell scripts for complex tasks can require careful study to understand. Furthermore, the mixture of programs and parameters at the point of invocation makes changing parameters confusing. Finally, packaging of shell scripts for distribution involves finding each subprogram and dependency. In contrast, Kepler workflows are often single files because the R, Python, Java etc. code is included in the workflow file (but external programs must be included as separate files). Similarly, developing pipelines within a single programming framework (e.g. BioPerl or R/BioConductor) is attractive for the knowledgeable programmer, but can be very difficult for the outside user to follow, extend, or modify. A second major approach is the use of web-based pipelines. These pipelines often indicate the analysis steps and parameters clearly (e.g. [
10]), feature user interfaces (webpages) that are already known to the user, and can feature attractive, informative outputs that document the parameters that were used. However, these pipelines are usually inaccessible for modification and extension. Also, upload of large datasets to the servers can be slow and problematic.
A third option is the use of other workflow systems (e.g. Taverna: [
14]; jORCA: [
17]). Workflow systems generally share many properties (for review, see [
12,
15]). Hence, selection of a system may rely on specific features of the system. Taverna has been used for many bioinformatics pipelines with a strong emphasis on use of web services [
25]. However, microarray analysis currently features relatively large primary data files and the file sizes are increasing in many cases (e.g. NimbleGen ChIP-chip experiments now are available in 2.1 million probes/array formats). Moving large files from internet site to site, as is necessary for Web services, is usually slow and can be error-prone, producing issues with speed of execution and effectiveness of using web-service oriented frameworks in this exact application area. Hence, we strongly prefer microarray workflows featuring local resources (data and programs). In addition, the published Taverna workflow system for Affymetrix microarrays [
20] uses a more complex Rclient/server model rather than the simple local invocation of R in our workflows. Our use of R is the basic use introduced in R courses and books (e.g. [
34]). Therefore, we believe that our approach is conceptually simpler, has a simpler implementation, and is more familiar to the majority of bioinformaticians, making this workflow easier to comprehend. Also, Kepler displays the parameters and parameter values simply and clearly on the actual canvas with the workflow, and changing these values is very simple (click on value on screen, change value, new value is shown on screen). We strongly prefer this clear display of parameters, which is not present in Taverna workflows. Finally, Kepler has a well-developed provenance module that has great promise for bioinformatics workflows.
Galaxy [
11] is an increasingly popular platform for bioinformatics analyses. At this time, there are no gene expression microarray analysis capabilities in Galaxy. To bring new tools into Galaxy, a description of the tool is required to be produced. In contrast, external tools (e.g. local standalone programs) can be simply and directly invoked within Kepler actors without any need for descriptions of allowable inputs, outputs, or other formats. Furthermore, the current Galaxy canvas does not display parameters with values, unlike the Kepler canvas. Therefore, Kepler offers significant advantages in that parameter values are directly displayed and easily changed and also it is much quicker to integrate existing tools into Kepler. Workflows in this report often invoke external programs and provide many examples of this approach. Finally, Galaxy is based on a web model instead of being a simple standalone program like Kepler. Using Galaxy from a remote location can produce significant inconvenience of uploading data to a server, and a local installation of Galaxy will still use a web interface, a more complicated arrangement than the simple, standalone program that is Kepler.
Kepler has some disadvantages. Like other scientific workflow systems, using Kepler requires a time investment in learning the Kepler environment and the Kepler approach to actor creation. We expect that our production of a series of working bioinformatics pipelines will enable this process. Second, we found that the theoretically best approach for Kepler development - passing data as structures (e.g. arrays or objects) - led to some problems with execution time (e.g. Affymetrix gene expression workflow). We were able to develop a straightforward solution to this issue (creating a script then executing). Finally, there are far fewer prepackaged bioinformatics resources in Kepler as compared to Taverna, but the bioKepler project [
40] should change this situation.
Making Kepler workflows more widely accessible, e.g., through popular workflow repositories [
25], will allow pipeline developers to share workflows more easily, learn from each other, and thus make pipeline development increasingly straightforward