The approach to semantics-based service composition that we present in this paper builds upon the Bio-jETI [7
] framework for model-based, graphical design, execution and management of bioinformatics analysis processes. It has been used in a number of different bioinformatics projects [29
] and is continuously evolving as new service libraries and service and software technologies become established. Technically, Bio-jETI uses the jABC modeling framework [33
] as an intuitive, graphical user interface and the jETI electronic tool integration platform [35
] for dealing with remote services. Using the jABC technology, process models, called Service Logic Graphs
) are constructed graphically by placing process building blocks, called Service Independent Building Blocks
), on a canvas and connecting them according to the flow of control. Figure shows a screenshot of the graphical user interface of the jABC. SLGs are directly executable by an interpreter component, and they can be compiled into a variety of target languages via the GeneSys code generation framework [37
]. As Figure (bottom) illustrates, GeneSys provides the means for transforming SLGs into native, stand-alone programm code (e.g., Java, C++) as well as into other workflow languages (e.g., BPEL).
Figure 2 Bio-jETI GUI. The jABC framework, which provides the graphical user interface for Bio-jETI, supports the orchestration of processes from heterogeneous services. Workflow models are constructed graphically by placing process building blocks from a library (more ...)
Figure 3 Relationship between SLTL and workflow languages. SLTL is designed to specify linear workflows on an abstract level. In conjunction with a set of services and adequate semantic information about the domain, it serves as input for the synthesis algorithm, (more ...)
Workflow development in Bio-jETI is already supported by several plugins of the jABC framework, for instance providing functionality for component validation or step-wise execution of the process model for debugging purposes. Now we are going to exploit further jABC technology, such as model checking and workflow synthesis, in order to enable Bio-jETI to support the development of processes in terms of service semantics.
Model checking [38
] can be used for reasoning about properties of process models. This can help to detect problems like undefined data identifiers, missing computations, or type mismatches. Solving these problems might require the introduction of further computational steps, for instance a series of conversion services in case of a data type mismatch. The approach here is to automate the creation of such process parts via workflow synthesis methodology [40
] that allows for the automatic creation of (linear) workflows according to high-level, logical specifications. Figure (top) illustrates the relationship between our specification language SLTL (Semantic Linear Time Logic) and the actual Bio-jETI workflow models, the SLGs: Provided with a logical specification of the process and semantically annotated services, the workflow synthesis algorithm generates linear sequences of services, which can be further edited and combined into complex process models on the SLG level.
For the study that we present in this paper we used a SIB collection offering various remote and local services. Examples for contained remote services are the data retrieval services provided by the EBI (European Bioinformatics Institute) [44
], sequence analysis algorithms offered by BiBiServ (the Bielefeld Bioinformatics Server) [46
], web services hosted by the DDBJ (DNA Data Bank of Japan) [47
], and some tools of the EMBOSS suite [48
]. On the local side, there are specialized components such as visualizer for phylogenetic trees [49
] and more generic ones like SIBs that realize user interaction or functionality for file management. Table lists the fragment of the library that is relevant for our examples.
Exemplary set of services. Fragment of a component library that we used in the examples. The table lists the names of the building blocks (SIBs) along with function descriptions and selected service predicates.
In the jABC, the SIBs are displayed to the user in a taxonomic view, classified according to their position in the file system (by default) or to any other useful criterion, like the provider or the kind of service. The SIBs have user-level documentation, explaining what the underlying tool or algorithm does, that is derived directly from the provider's service descriptions. In addition, the SIBs provide information about their input and output types via a specific interface. This is already an integral part of the semantic information that helps to systematically survey large SIB libraries and it is used by our process synthesis and model checking methods. It is, in addition, possible to add arbitrary annotations to the SIB instances and by doing so providing further (semantic) information that is taken into account by our formal methodologies.
The knowledge base that is needed for the process synthesis consists, furthermore, of service and type taxonomies that classify the services and types, respectively. Taxonomies are simple ontologies that relate entities in terms of is-a and has-a relations. These classifications provide sufficient information for our synthesis methodologies.
We assume simple taxonomies for our examples, which have the generic OWL type Thing at the root. Going downwards, classifications are introduced, for instance refining the generic type into integers and strings, whereas the latter is further distinguished into alignments, trees, sequences, tool outputs, and so on. Figure shows the service taxonomy for the services that we use in our examples, edited in the OntEd ontology editor plugin of the jABC. The corresponding type taxonomy classifying the involved data types is given in Figure .
Service taxonomy. Service taxonomy for the services that we use in our examples, edited in OntEd, the ontology editing plugin of the jABC.
Type taxonomy. Type taxonomy classifying the data types involved in our examples.
The basic input and output information for the services is defined in terms of the data types contained in the type taxonomy. Table lists the set of data types that is relevant for our examples. The services are characterized by input-output-pairs of types, where the input or output may well be empty (as it is the case, e.g., for ShowInputDialog and Archaeopteryx), respectively. Services may also provide multiple possible transformations and thus achieve polymorphism. For instance, BiBiServ's ClustalW can process sequences in FASTA or in SequenceML format, and produces a FASTA or AlignmentML output, accordingly.
Exemplary set of types. The set of data types that was used in the example processes.
Example 1: a simple phylogenetic analysis workflow
When developing bioinformatics analysis workflows, users often have a clear idea about the inputs and final results, while their conception of the process that actually produces the desired outputs is only vague. Figure (upper left) shows a stub for a workflow: the start SIB (left) is an input dialog for a nucleic or amino acid sequence, which is followed by a SIB running a BLAST query with the sequence having been input in order to find homologous sequences. The workflow ends by invoking Archaeopteryx to display a phylogenetic tree (right). The configuration of the SIBs is sound at the component level, as the Local Checker plugin (producing the small overlay icons top left) confirms. However, there are errors regarding the correct configuration of the model as a whole, as the required input type for Archaeopteryx, some phylogenetic tree format, is not produced previously in the process. This is detected by our model checker GEAR (indicated by a red overlay icon with a white cross in the top right corner of the SIBs), that checks a temporal formula covering the following constraint (please refer to the Methods section for details on the model checking procedure):
An experienced bioinformatician might be aware of the problem immediately, due to his familiarity with the involved tools. This is, however, only a small workflow. An automatic, semantically supported detection of misconfigurations and modeling errors unfolds its full potential when processes become more complex, and it is not feasible for the in silico researcher to dive into the documentations of all services or to explore their behaviour by trial-and-error executions.
Once detected, there are different ways to fix the problem. One can look for replacements for one of the involved SIBs that essentially compute the same results, but provide them in a data format that fits in the surrounding process. Another approach, assuming that the user has chosen these services for good reason, is to search for a sequence of additional services that resolve the mismatch and insert them into the process. Such data mediation sub-workflows are usually linear. They can consist of type conversions that simply adapt the involved data, or also of real computational services when the match can not be realized so easily.
As a means for resolving the violation of property *, the example process model stub implies a process specification adequate as input for our workflow synthesis algorithm (please refer to the Methods section for details). In a high-level formulation, it reads:
Utilizing the semantically annotated SIB collection and domain information from above, and computing the shortest service combination that satisfies the specification, our synthesis algorithm proposes the following simple four-step workflow for the above query (bottom left in Figure ):
Figure 6 Example 1. A simple phylogenetic analysis process. The upper left shows an erroneous stub for a simple phylogenetic analysis process, it lacks a sequence of services leading from a BLAST result to a phylogenetic tree. Below is the appropriate sequence (more ...)
1. Extract the IDs of the hits from the BLAST result (using a regular expression).
2. Turn the matches into a comma-separated list.
3. Call DBFetch (fetching the corresponding sequences from a database).
4. Run emma (computing a multiple sequence alignment and phylogenetic tree).
The generated sequence of SIBs can now be inserted into the process stub and all parameters configured appropriately. As Figure (right) shows, neither the local nor the model checking does reveal errors any more. The process is now ready for execution. Figure illustrates the corresponding runtime behaviour: the workflow starts by asking the user for a query sequence, then performs a similarity search, data retrieval and sequence analysis before it finally displays the resulting phylogenetic tree.
Figure 7 Execution of example 1. Execution of the simple phylogenetic analysis process. The execution begins with an interactive step, where a dialog is displayed in which the query sequence is entered (top). After some non-interactive steps, the finally available (more ...)
Example 2: Blast-ClustalW workflow
A simple phylogenetic analysis like in the previous example is an often recurring element of complex in silico
experiments. In many cases, however, a customized, more specific processing of intermediate results is required, like in the Blast-ClustalW workflow [50
] that is one of the DDBJ's sample workflows for the Web API for bioinformatics [51
]. It is the archetype for our second example.
The Blast-ClustalW workflow [50
] has the same inputs and outputs as the simple phylogenetic workflow from example 1: It finds homologuous sequences for an input DNA sequence via BLAST and computes a hypothesis about the phylogenetic relationship of the obtained sequences (using ClustalW). The proposed analysis procedure consists of four major computation steps (the blue rectangles in Figure , whereby steps 2 and 3 have to be repeated for each Blast hit that is taken into account (not evident from the figure):
Figure 8 Blast-ClustalW workflow. Blast-ClustalW workflow as sketched by the DDBJ (following ).
1. Call the Blast web service to search the DDBJ database for homologues of a nucleic acid sequence. The input is a 16S RNA sequence in FASTA format, the output lists the database IDs of the similar sequences and basic information about the local alignment, e.g. its range within the sequences.
2. Call the GetEntry web service with a database ID from the Blast output to retrieve the corresponding database entry.
3. Extract accession number, organism name and sequence from the database entry. Trim the sequence to the relevant region using the start and end positions of the local alignment that are available from the BLAST result.
4. Call the ClustalW web service to compute a global alignment and a phylogenetic tree for the prepared sequences.
Due to the loop that is required for repeating steps 2 and 3 a certain number of times, this process can not be created completely by our current synthesis algorithm, which is restricted to produce linear sequences of services. It is, however, possible to predefine a sparse process model in which the looping behaviour and other crucial parts are manually predefined, and to subsequently fill in linear parts of the process automatically.
Figure (top) shows an advanced, but still incomplete model of the Blast-ClustalW workflow. Like in example 1, the process begins with displaying a dialog for entering the query sequence (start SIB top left). The result of the subsequent Blast web service invocation is split into the separate results (SIB get blast hits). Before the loop is entered, a maximum is set for number of hits that is to be considered in the analysis. For this defined maximum number of hits, the loop's body is executed. The current hit is split into its seperate elements, e.g. accession number, score, and the start and end position of the local alignment that produced by BLAST within the whole sequence. The accession number is used to check whether the sequence corresponding to the current hit has already been added to the analysis in order to avoid duplicate sequences. If a duplicate is detected, the maximimum number of hits is incremented, so that another hit can be taken into account. Otherwise, the corresponding entry is fetched from the database using the DDBJ's GetEntry web service (SIB getFASTA_DDBJEntry). The SIBs extract organism and extract sequence are then applied to extract the corresponding information from the DDBJ entry by means of a regular expression. The sequence is formatted, i.e. whitespaces removed, and the start and end positions that are known from the BLAST result are used to cut the subsequence that actually contributed to the local alignment during the BLAST search. The prepared sequence is then added to the analysis (SIB append sequence). Note that in contrast to the original representation of Figure , we see here the structure and the data-driven loops of the actual workflow. Finally, the resulting phylogenetic tree is displayed by Archaeopteryx.
Figure 9 Example 2. The more complex Blast-ClustalW workflow. The model checking detects three errors for the original process (top). To bridge the gap between the available sequences and the required tree, the emma web service can be inserted, computing a multiple (more ...)
At this state of the process, the local checking of the components detects no errors, but the model checker reveals problems (overlay icons top right): As in the previous example, the SIB Archaeopteryx uses a variable tree, which is not defined before. Moreover, the SIBs extract organism and extract sequence use a variable ddbjentry, which is defined with an incompatible type. Details on the model checking procedure can be found in the Methods section.
To resolve the first problem, we proceed similar as in example 1, by providing the synthesis algorithm with a temporal formula that asks for a sequence of services that takes a set of sequences as input (which is the last intermediate result that is computed previous to Archaeopteryx in the process) and produces a phylogenetic tree (the input that Archaeopteryx expects). As Figure (center) shows, a single call to emma is one of the (shortest) sequences that fulfils this request.
The second problem is the presence of a type ddbjfasta where the type ddbjentry is expected. To solve this mismatch, we ask our synthesis algorithm for a way to derive the latter from the former. It returns with an empty result (see Figure , center), which means that our SIB collection can not provide an appropriate sequence of services. We exclude the type ddbjfasta and the SIB getFASTA_DDBJEntry, by which is it produced, and try our luck with the type ddbjaccession, which has been defined last, as starting point for the synthesis. The answer is a service sequence consisting of the SIB getDDBJEntry (center), by which we can now substitute the improper data retrieval SIB from above.
The bottom of Figure shows the completely assembled process. We omit to demonstrate its execution behaviour, as it is very similar to that of example 1.
Discussion and perspectives
By means of two examples, the previous sections demonstrated the local checking, model checking and workflow synthesis methodology that is currently available in the jABC framework and thus part of Bio-jETI. The Local Checker plugin provides domain-independent functionality and is already conveniently integrated in the framework. We are now working on a user-friendly integration of the domain-specific model checking and synthesis techniques, especially with regard to the bioinformatics application domain. This ongoing work spans three dimensions, which are discussed in the following sections: domain modeling, model checking, and model synthesis.
This dimension is the heart of making information technology available to biologists, as it enables them to express their problems in their own language terms – on the basis of adequately designed ontologies. It raises the issue where the domain knowledge ideally comes from. It is, of course, possible for each user to define custom service and type taxonomies, allowing for exactly the generalization and refinement that is required for the special case. However, as the tools and algorithms that are used are mostly third-party services, it is desirable to automatically retrieve domain information from a public knowledge repository as well. Therefore we plan to incorporate knowledge from different publicly available ontologies, like BioMoby [17
] and SSWAP [20
], and to integrate it into the service and type taxonomies for use by our synthesis methodology.
It is, of course, also necessary that the services themselves are equipped with meta-information in terms of these ontologies. Again, we are looking at BioMoby with interest: numerous institutions have registered their web services at Moby Central, describing functionality and data types in pre-defined structures using a common terminology. Although BioMoby does not yet use standardized description formalisms like SAWSDL, it is already clear that there is semantic information available that we can use as predicates for automatic service classification.
Furthermore it will be interesting to consider the incorporation of more content-oriented ontologies like the Gene Ontology [22
] or the OBO (Open Biomedical Ontologies) [23
] into our process development framework. This would allow the software to not only support the process development on a technical level, but also in terms of the underlying biological and experimental questions. Additional sources of information, like the provenance ontologies of [52
] could be also easily exploited by our synthesis and verification methods.
This dimension is meant to systematically and automatically provide biologists with the required IT knowledge in a seamless way, similar to a spell checker which hints at orthographical mistakes – perhaps already indicating a proposal for correction. Immediate concrete examples of detectable issues are (cf. the examples presented earlier):
• Missing resources: a process step is missing, so that a required resource is not fetched/produced.
• Mismatching data types: a certain service is not able to work on the data format provided by its predecessor.
However, this is only a first step. Based on adequate domain modeling, made explicit via ontologies/taxonomies, model checking can capture semantic properties to guarantee not only the executability of the biological analysis process but also a good deal of its purpose, and rules of best practice, like:
• All experimental data will eventually be stored in the project repository.
• Unexpected analysis results will always lead to an alert.
• Chargeable services will not be called before permission is given by the user.
On a more technical side, model checking allows us also to apply the mature process analysis methodology that has been established in programming language compilers in the last decades [53
] and has shown to be realizable via model checking [54
]. By providing a predefined set of desirable process properties to the model checker we plan to achieve a thorough monitoring of safety and liveness properties within the framework. Similar to the built-in code checks that most Integrated (Software) Development Environments provide, this would help Bio-jETI users to avoid the most common mistakes at process design time. In addition, the list of verified properties is extendable by the user, and can thus be easily adapted to specific requirements of the application domain.
This dimension can be seen as a step beyond model checking: The biologist does not have to care about data types at all – the synthesis automatically makes the match by inserting required transformation programs. This is similar to a spell checker which automatically corrects the text, thus freeing the writer from dealing with orthography at all. (In our model-based framework, things are well-founded, without the uncertainties of natural language. Please do not be put off by this example because of annoying experiences with spell checkers!)
The potential of this technology goes even further: ultimately, biologists will be able to specify their requests in a very sparse way, e.g. by just giving the essential corner stones, and the synthesis will complete this request to a running process. In our text writing analogy, this might look like a mechanism that automatically generates syntactically and intentionally correct text from text fragments according to predefined rules that capture syntax and intention. For instance, the fragments "ten cars", "1000 Euro for shipping", "19% value added tax", "four days" and "Mercedes", may be sufficient to synthesize a letter in which a logistics company offers its services to Mercedes according to a specific request.
Back to biology, the fragments "DNA sequences", "phylogenetic tree", and "visualization", may automatically lead to a process that fetches EBI sequence data, sends them in adequate form to a tool that is able to produce a phylogenetic tree, and then transfers the result to an adequate viewer. Typically there are many processes that solve such a request. Thus our synthesis algorithm provides the choice of producing a default solution according to a predefined heuristics, or to propose sets of alternative solutions for the biologist to select.