Technological changes and new developments in computer science and IT occur even faster than in the rapidly changing domains of genomics, proteomics etc. Recently, several new technologies and trends such as Web 2.0, Service Oriented Architectures (SOA) and other web-related technologies e.g. Ajax have been introduced. Since many bioinformatics tools and biological databases are deployed through and depend on the internet, these new technologies seem to be of considerable importance for users as well as developers of tools. Frequently, it does not seem to be clear which technology to use since it might be outdated soon or other service providers do not yet support it. This can lead to confusion although web service technologies are supposed to provide better service interoperability by standardising protocols and message exchange patterns. The term
web service was originally coined as a specific W3C standard [
1], however, more recently it has been used to refer to any
method of programmatic access over the underlying technologies of the Web (and indeed to refer to some methods that do not in fact use any web technology). In bioinformatics, the term ‘Web service’ has often been used for services returning web pages, but in the remainder of this article we will use it to refer to the
programmatic interface exclusively.
Building web accessible interfaces to bioinformatics resources using Common Gateway Interface (CGI) scripts or servlets is now common practice. Though building web sites that are scalable, reliable and user friendly can still be a challenge, thousands of bioinformatics sites provide human-readable content via such means. In other words, end users can point their web browsers to such sites to obtain data or launch applications such as sequence search and analysis. Another important step is to make resources available not just for manual interaction through a web browser, but also for programmatic access in programming languages.
Following the trend of web services and SOAs in general, this article addresses the following questions:
- What web service technologies are commonly used to support sequence annotation? We answer this question by limiting ourselves to a selected but representative list of tools and services.
- What are the specific requirements of sequence annotation and which technologies address them?
- What are possible usage scenarios and best practices?
- How can data integration be addressed given the usage of web services?
All authors of this article are involved in the practice of design, implementation and/or deployment of web services in the context of sequence analysis. They met at a workshop in Geneva [
2] during spring 2007 and continued to debate using e-mail discussions until early 2008. Part of the authors are also members of the EMBRACE consortium [
3] but not all—hence the opinions expressed here are not necessarily those of EMBRACE. While the authors cannot reach a full agreement with respect to technology choices, this article summarises the key concepts and the challenges where they can agree altogether. Many of the on-going discussions in the IT community are driven by certain opinions and interests rather than pure facts. However, we have attempted to avoid this pitfall.
In general, we focus on the design and technology choices that are necessary when providing a web services-based interface on top of a certain application logic (bioinformatics tool) or database. A possible way to proceed is depicted in which also provides the logical structure of this article. Given a certain use, case that is implemented by an application program (the application logic) a service provider can decide to offer this service over the internet via a ‘conventional’ web site to allow users to access the service via web browsers. One can also provide a programmatic interface to the service—this is the main focus of this article. The actual problem domain is characterised (further details in ‘characteristics of protein sequence data’ section), and existing technology needs to be reviewed [‘W3C web services (SOAP-based web services)’ and ‘REST services’ sections] and checked if applicable to the domain (‘web services and the relation to biological properties’ section). It is then advisable to follow certain best practice approaches (‘best practices’ section) to allow services to be compatible and inter-operable with each other. Additionally, the integration and exchange of data provided and produced by different web services is another important topic which needs considerable effort. We will discuss possible syntactic and semantic data integration approaches in ‘data integration’ section. In general, the steps depicted in should be applied whenever a new service is designed. In an optimal case, data and service integration issues should already be considered at the time the public interface of the service is designed in order to avoid unnecessary data conversion steps once a service has been deployed.
In order to establish a context, we focus on the use case of biological sequence analysis and annotation which requires access to different data sources and tools. This is a representative domain requiring programmatic access at different levels in the overall workflow of sequence annotation. We begin by looking at how UniProtKB/Swiss-Prot is used by biologists and annotated by curators [
4]. We then attempt to describe the characteristics of data and tools that are relevant to sequence analysis and annotation and follow all the steps outlined in . For each of the steps we give certain recommendations that can be helpful to other service providers and users that engage themselves in web services and SOA.