|Home | About | Journals | Submit | Contact Us | Français|
Biomedical applications have become increasingly complex, and they often require large-scale high-performance computing resources with a large number of processors and memory. The complexity of application deployment and the advances in cluster, grid and cloud computing require new modes of support for biomedical research. Scientific Software as a Service (sSaaS) enables scalable and transparent access to biomedical applications through simple standards-based Web interfaces. Towards this end, we built a production web server (http://ws.nbcr.net) in August 2007 to support the bioinformatics application called MEME. The server has grown since to include docking analysis with AutoDock and AutoDock Vina, electrostatic calculations using PDB2PQR and APBS, and off-target analysis using SMAP. All the applications on the servers are powered by Opal, a toolkit that allows users to wrap scientific applications easily as web services without any modification to the scientific codes, by writing simple XML configuration files. Opal allows both web forms-based access and programmatic access of all our applications. The Opal toolkit currently supports SOAP-based Web service access to a number of popular applications from the National Biomedical Computation Resource (NBCR) and affiliated collaborative and service projects. In addition, Opal’s programmatic access capability allows our applications to be accessed through many workflow tools, including Vision, Kepler, Nimrod/K and VisTrails. From mid-August 2007 to the end of 2009, we have successfully executed 239 814 jobs. The number of successfully executed jobs more than doubled from 205 to 411 per day between 2008 and 2009. The Opal-enabled service model is useful for a wide range of applications. It provides for interoperation with other applications with Web Service interfaces, and allows application developers to focus on the scientific tool and workflow development. Web server availability: http://ws.nbcr.net.
Biomedical data have become increasingly complex, and the applications that analyze them often require large-scale high-performance computing resources with a large number of processors and memory. The recently launched 1000 Genomes Project (http://www.1000genomes.org) promises to increase greatly the effectiveness of genome-wide association studies (GWAS), and the potential for pharmacogenomic discoveries and personalized medicine (1). Efforts to define ‘human druggable genome’ to identify novel drugs or repurpose existing drugs have resulted in new computational tools and databases that enable access to real time or precomputed analysis results using distributed computing resources (2). The National Biomedical Computation Resource (NBCR, http://www.nbcr.net) provides biomedical computing resources with a focus on multi-scale biomedical research, dealing with scales at the molecular level for Computer Aided Drug Discovery (CADD), subcellular level for calcium signaling and organ level for patient specific modeling. The complexity of application deployment and the advances in distributed computing require new modes of support for biomedical research. The sSaaS enables scalable and transparent access to biomedical applications running remotely on a desktop, cluster, grid or cloud computing resources through open standards-based Web services. The availability of these application-specific services enables the adoption of workflow tools to handle complex analytical procedures. The NBCR web server http://ws.nbcr.net contains a registry of several classes of biomedical application services, including docking and virtual screening using Autodock 4 (3) and Autodock Vina (4); electrostatics analysis using PDB2PQR (5,6) and APBS (7); off-target analysis using SMAP (8,9–11); and motif discovery using the MEME suite (12). While each application may be used in many different scenarios, they are provided by NBCR within the context of CADD, and more applications services dealing with other aspects of multiscale modeling will become available over time. These applications are exposed as Web services using the open source Opal toolkit (13,14), which enables users to access applications using web interfaces automatically generated in the Opal Dashboard or programmatically using the SOAP protocol with clients in Perl, Python, or Java programming languages. The availability, stability and scalability provided by Opal or similar Web services enable the availability of the same or alternative applications to be provided by different providers, resulting in better cyberinfrastructure for biomedical research.
The rest of the paper is organized as follows: the next section deals with the usage scenario and description of the available applications; the subsequent section discusses the programmatic access of the Web services using Python as an example; the following section deals with advanced use cases including workflow composition; thereafter, the Opal toolkit and key new features are dealt with; in the next section, a, there is a discussion of this paper and related works; and the last section discusses the conclusions and future work.
Some of the hurdles in making scientific software accessible have been the installation, maintenance and upgrades, let alone the hardware and environmental cost of maintaining a machine room. Through our experience providing MEME as a biomedical community service, we have developed the Opal toolkit (14), which enables scientific application developers to set up web services with both web form access and programmatic access easily without needing to change any of their own codes. We have since expanded the number of applications provided through the Opal toolkit, and built advanced workflows that leverage these distributed and scalable Web services within the context of CADD, to help increase the translational impact of the computational tools from NBCR and other related activities.
The Opal Dashboard (Figure 1A) (15) at http://ws.nbcr.net provides a listing of all the Web services, including applications for docking analysis, electrostatics calculations, off-target analysis and motif search/discovery. The ‘List of Applications’ tab shows the list of applications. The ‘Search’ textbox allows a user to use keywords to search for applications. Each service has a service name, a basic description of the application, a tutorial link and a programmatic access (Web service) URL. The service name links to an automatically generated simple or a customized web page. The former accepts command line arguments for an application, whereas the latter provides the command line arguments in a web form (Figure 1B). The tutorial link contains instructions for learning and sample input and output for testing. The Web service URL is used only for programmatic access of the application.
The application services are provided to facilitate various aspects of CADD. A summary is provided in Table 1, along with the references for the application. While there may be alternative applications for each purpose, due to space constraint, we will only discuss similar applications from the literature or online services in the ‘Discussion’ section. Some usage scenarios of these services are as follows: a user has a particular drug target receptor in mind, e.g. the influenza neuraminidase (NA), and would like to search the NCI diversity set (NCIDS http://dtp.nci.nih.gov/branches/dscb/div2_explanation.html) library for possible hits (16). The partial charges for the receptor may be added using PDB2PQR, or added with the Prepare Receptor service (part of MGLTools, currently at v1.5.4) using default parameters. A user may then use AutoDockTools (part of MGLTools, currently at v1.5.4), to specify the grid box center, spacing and number of grid points to generate a custom grid parameter file (GPF), or provide a reference GPF template file with the desired grid box information. As the atom types created in the GPF is dependent upon the ligand or a library of ligands, the user would use the Prepare GPF for Library service available at http://kryptonite.nbcr.net/opal2/dashboard, where a user may upload his own library already prepared for use with AutoDock4, or have it prepared with the Prepare Ligand service, or just choose a library already available on the server. The NCIDS and ZINC (17) libraries are some of the example libraries available. Once AutoGrid is complete, the AutoDock Virtual Screening service may be used for the virtual screening. If the user is interested in finding PDB structures that have a similar binding site to the NA protein, then he or she may use the SMAP analysis to perform an off-target analysis. If the user is interested in identifying conserved motifs in the NA protein for all pandemic N1 proteins, the MEME suite of programs may be used for such purposes, followed by mutational analysis, including electrostatic analysis to examine the effect of these mutations.
There has been a growth in the number of jobs for various applications on our web server. Figure 2 shows a comparison of the numbers of jobs successfully executed in 2008 and 2009. We can see from the graph that the usage for our applications increased and the total number of jobs executed in 2009 was approximately double of that in 2008. There were 75 059 jobs successfully executed in 2008, and 150 041 in 2009.
The Opal toolkit consists of the Opal server and the Opal client (Figure 3). The Opal server is responsible to managing and providing the web services, while the Opal client accesses web services on the Opal server. The set up of the Opal server is discussed in the ‘Opal server management and configuration’ section, and is only of interest to users who want to provide Opal services for their own applications. The Opal client enables easy programmatic access of any Opal service. A generic Opal client, currently available in Python and Java, supports user actions such as launching a job, querying job status, getting job outputs and works with any Opal Web service. An Opal client in Perl is provided as an example for MEME, and a generic Perl client will be made available soon.
Figure 3 illustrates how the Python client may be used to access the PDB2PQR service (Table 2). More Opal client documentation is available at http://www.nbcr.net/pub/wiki/index.php?title=Opal_Client.
For example, a user can launch a PDB2PQR job with the following command to convert ‘1a1p.pdb’ to ‘output.pqr’, assuming ‘1a1p.pdb’ is already in the user’s local directory:
python GenericServiceClient.py \
-r launchJob \
-a “–ff=amber 1a1p.pdb output.pqr” \
The output for this command will include a job ID, e.g. app1234567890, along with a job base output URL. With the job ID the user can query the job status as in the command below:
python GenericServiceClient.py \
-r queryStatus \
A notable feature of Opal Web services is that the job output URL may be used as the input URL for compatible Web services. This is discussed in the next section for workflow composition.
As indicated in the section on usage scenario, CADD processes are very complex, and involve many steps, some automatically, others requiring manual intervention. A number of workflow tools, both commercial and open source, have been developed to support the construction and execution of complex workflows. Some well known examples include Pipeline Pilot (http://accelrys.com/products/pipeline-pilot/), Taverna (20,21), Kepler (22), Vistrails (23) and Vision (4). In all these tools, a user is allowed to run complex experiments automatically, after careful parameter tuning for the individual steps. Often, an experiment step is represented in these workflow tools as a node, which has input and output ports. Two nodes can be connected together so that the outputs of the first node may be inputs of the second node and may trigger the execution of the second node. For example, a user often thinks of an AutoDock VS experiment as three major steps, prepare the receptor, compute the grids, and run the virtual screening. The completion of a ‘prepare receptor’ node triggers the execution of the ‘compute grids’ node, which in turn triggers the execution of the ‘AutoDock virtual screening’ node.
Many of these Opal web services are easily accessible in Vision (4) using the Vision Web service module. Vision provides a visual programming environment, and is easily integrated with the Python Molecular Viewer (PMV) and AutoDockTools (ADT) to automate many of the setup, visualization and analysis of AutoDock experiments. The Vision Web service module leverages the automatic interface generation feature to provide users with the same user interface in the web form, as configured on the server side.
The workflow illustrated in Figure 4 accesses the Prepare Receptor Web services from http://ws.nbcr.net, and Prepare GPF, Autogrid, and Autodock Virtual Screening Web services from http://kryptonite.nbcr.net. The Opal output URL is often used as the input URL for the next step of the workflow. These workflow units are now packaged into the MGLTools and allow advanced users to build customized workflows.
For users who wish to use preexisting workflows, the AutoDock VS experiment may be accessed using Vision networks developed as part of a prototype of the NBCR CADD pipeline (http://nbcr.net/pub/wiki/index.php?title=CADD_Pipeline). In addition, users may also take advantage pre-built workflows for post-analysis and visualization of docking results. For example, a user may use a Vision network to iterate through a virtual screening result directory with the different docking log files, and play through the different ligand conformations using built-in functionalities of ADT. Thus, more customized analysis routines may be published and shared with many users.
A full description of workflow tools such as Kepler, Nimrod/K (24) or VisTrails is beyond the scope of this paper. We simply point out that a generic Opal client is available as an actor in Kepler, which can be used to call any Opal service. To get the Opal actor in Kepler, then user can type ‘web services’ in ‘Search Components’ textbox and then drag ‘OpalClient’ to the workflow composition canvas in Kepler. Then the user can then click on ‘OpalClient’ to enter a web service URL. After the user clicks on ‘commit’ and then clicks on the ‘OpalClient’ actor again, the updated actor will show the automatically generated Opal web form. Similarly, the Kepler based Nimrod/K allows dynamic concurrent execution of Kepler’s actors. A tutorial on using Kepler with Nimrod/K and Opal services can be found on http://nbcr.net/pub/wiki/index.php?title=Kepler. Similar to Vision and Kepler, all Opal services are callable from VisTrails, for which an automatic interface generation of the Opal web form will become available in the near future. Documentation on installing the required components for using VisTrails with Opal services and tutorials can be found on http://nbcr.sdsc.edu/pub/wiki/index.php?title=VisTrails. In VisTrails, the user can use the ‘ExecuteOpalJob’ node to launch an Opal job. This node has four inputs, command line arguments, number of processors (for parallel jobs only), list of input files and the web service URL.
The Web services described earlier have been deployed using the Opal toolkit, which enables advanced users to automatically wrap scientific applications running on cluster and Grid resources as Web services, and to provide Web-based and programmatic access to them, without any modification to the scientific codes. Thus, together with the generic clients or workflow tools described above, a user may access or develop workflows that leverage not only those Opal services provided by NBCR, but also the ones provided by anyone using the Opal toolkit. As the advantages of using the Opal toolkit has been published extensively elsewhere (15,13,14), we will only highlight a few key points relevant to the management and usage of the Opal services. Additional documentation on installation may be found online at http://opal.nbcr.net.
The Opal Dashboard (15), available with any Opal installation, provides a consolidated point of entry for accessing information about Opal services, and invoking remote scientific applications (Figure 1A). Opal also provides a mechanism to automatically create Web interfaces for job submission using provided application specific metadata. A detailed description of automatic interface generation is presented in ref. 15. Briefly, advanced users can make sophisticated configuration files to generate advanced web forms, which contain an input field for each input parameter, as specified by the command-line arguments of the scientific application. Parameters may be grouped together and check boxes and radio buttons may be added (Figure 1B). This feature is great for making an application quickly available to the intended users through the web, in addition to the programmatic access through SOAP Web service calls. Users may also design their own web pages and make Web service calls to Opal services if desired, as is the case for MEME and PDB2PQR.
The Opal server logs some information for job, including job ID, job URL, start time, activation time, end time, client IP, service name and job result status. Using this information, the Opal Dashboard provides a variety of charts showing usage information, as shown in Figure 5. Note that these charts are completely anonymous—no user information is displayed on the charts. Each Opal job is associated with a unique random ID given only to the user who submitted the job. Immediately after the user submits a job on the web form, the user gets a unique URL that automatically updates the status of the job. More details about Opal state management is available in ref. 13. A new Opal version to be released in May 2010 includes email notification of the job URL and status updates. Another feature already available is necessitated by the popularity of our service. Opal 2.2 and higher supports a new IP address filtering feature that helps prevent Denial of Service attacks. An administrator may choose to enable this feature to deny all job requests from certain IPs, once the number of jobs per IP for the past hour is greater than the limit has been specified. IPs may also be exempted from any restrictions.
The Web services described in this paper provides a set of popular computational tools that may be used within the context of CADD, though not limited to it. The use of Opal toolkit may help provide a common platform for rapid deployment of scientific Web services. For example, many existing applications that perform similar functions, such as AlignAce (25) as opposed to MEME in motif discovery; use of semi-empirical quantum-chemical techniques (26) as opposed to PDB2PQR or AutoDockTools in partical charge calculation; Dock (27) as opposed to AutoDock in docking; webPIPSA (28) as opposed to APBS in electrostatics analysis, may be coupled in workflows or compared for consensus. Users are encouraged to study the original references for these application services, acquire the appropriate training and validate any predictions with experiments.
As the Web services we provide increase in popularity, there may be several ways to overcome the problem with scalability. While we have implemented a filter based upon the number of requests per hour, we are preparing an increased allocation of compute resources in response to any increased usage. In addition, we explore the use of Amazon EC2 commercial providers to become on-demand purveyor of the services through virtual machines that contain preconfigured Opal services in multicore machines or virtual clusters. On the other hand, as more identical or alternative applications become available as Web services, with the help of the Opal toolkit, it may relieve the pressure on any one provider. In other words, the Opal Web service interface remains unchanged, yet the resources behind the services will be dynamically allocated from different providers.
While Opal provides a simple toolkit to deploy Web services, other methods are also available (14). For example, EBI has a number of services deployed using the SoapLab toolkit (http://www.ebi.ac.uk/soaplab/), with good support by the Taverna workflow engine. The Protein Data Bank (PDB) provides a plethora of Web services related to structure based drug design. We anticipate the proliferation of Web services, and emerging standards to allow researchers to focus more on the biological problems, rather than deploying applications, and advance the cyberinfrastructure for biomedical research.
Since August 2007, the NBCR web server http://ws.nbcr.net has been continuously providing biomedical scientists easy access to a large collection of biomedical applications, with about a quarter of a million of jobs completed by the end of 2009. The key features of Opal Web services include the availability of sample clients for accessing Opal services, ease of access from workflow tools, and scalability through distributed computing using desktop, cluster, grid and cloud resources.
We plan to add several new features to Opal. Most notably, we plan to provide a REpresentational State Transfer (REST)-based API to invoke Opal services. REST is much more lightweight than SOAP-based implementations, currently used by Opal. Opal users will then be able to invoke Opal services with the help of basic HTTP tools such as curl. Second, we will develop a registry that can aggregate access to Opal services hosted at various locations. Third, we are in the process of improving the automatic interface generation to have better support for Ajax (29), making the user interface more contemporary and easier to use. Additional effort may also be required to ensure fair use of the resources, and to ensure the best possible security and privacy of user data through the development of personalized data services. These improvements will make the Opal toolkit easier to use, and increase the number of Web services available for biomedical research.
More scientific applications will be provided in support of the Relaxed Complex Scheme in CADD (16), and the NBCR Summer Institute (http://si.nbcr.net) provides annual training on tools and services described in the paper, and expose users to emerging technologies that may benefit biomedical computing in the near future.
Funding for open access charge: National Center for Research Resources (NCRR); National Institutes of Health (NIH); (P41 RR08605 award to NBCR).
Conflict of interest statement. None declared.
We appreciate the constructive suggestions of the editor in improving the manuscript. In addition, we thank Stefano Forli and Alex Perryman for AutoDock4 ligand libraries; John Irwin from UCSF for the ZINC library. We also wish to thank the reviewers of this manuscript for critical comments and suggestions.