|Home | About | Journals | Submit | Contact Us | Français|
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Supplementary information: Supplementary data are available at Bioinformatics online.
Finding information about a biological entity is a step tightly bound to molecular biology research. Despite ongoing efforts, this task is both tedious and time consuming, and tends to become a challenge as the amount of information increases steadily. Currently available systems like Whatizit (Rebholz-Schuhmann et al., 2008), allow the user to paste a free text section into a web page, which then links the recognized biological entities to existing databases. Our aim is to assist researchers with an easy-to-use interface which provides them with summary information about the biological entities mentioned in commonly used document types. In this applications note, we present OnTheFly, a tool that allows automated tagging of proteins, genes and chemicals and interaction network generation from widely used files like PDF, Microsoft Office files, as well as plain text files. In the following sections, we describe the functionality and the architecture of OnTheFly and we comment on its performance. We then demonstrate the functionality using a full text PDF article as an example.
OnTheFly is a service to automatically annotate document files such as Microsoft Word, Excel, Power Point, PDF or plain text files. After submitting the files to the service, the system returns a tagged HTML version of the documents. Gene, protein and chemical names are highlighted and by clicking on them the user activates a pop-up window which contains relevant information about the entity. The presented information includes domains, sequence, organism, sub-cellular localization for proteins, formula for chemicals and protein–chemical and chemical–chemical interactions for both entity types. This functionality is provided by the Reflect server (http://reflect.ws).
OnTheFly can furthermore generate interaction networks for a set of bioentities (genes, proteins, chemicals) extracted from the STITCH database (Kuhn et al., 2008). The user can select the preferred organism whose protein aliases will be used for the tagging and network generation; the default organism is set to Homo sapiens. The size of the network and the number of interactors per recognized entity can be manually defined by the user. The network generation is not restricted to one document but can be applied to a set of documents simultaneously.
Lists, summarizing the identified bioentities are also generated. These lists contain the ID of the bioentities together with the organism and description. These summary results contain information about bioentities found in the set of the selected files.
The performance of the service can be assessed in a number of ways, such as the quality of the document conversion, the time required to tag a document and the accuracy of the annotation. The used file converters are able to maintain most of the layout of the documents, including column separation, tables and figures. The time to process a full text article of about 15 pages with images and tables ranges typically between 15 to 20 s. This time includes the whole process including the communication with the server.
The name tagging performance of the Reflect server is comparable to other available methods. More information can be found under the FAQ section on the web server.
To demonstrate the functionality of OnTheFly a full text article on protein–protein interaction predictions (Pitre et al., 2006) stored locally as a PDF file, has been processed. Figure 1A below shows a table section of the resulting HTML file with the tagged protein identifiers. Figure 1C shows the corresponding automatically retrieved association network of these entities using the STITCH database.
OnTheFly uses the client-server architecture shown in Figure 1. The front end is an Applet written in Java 1.5. It can be accessed directly either from the server web page or as a Java desktop application. The Applet technology was chosen as it allows file drag-and-drop functionality and thus maximizes the ease of use. The server side components consist of a set of document converters along with the software modules required to invoke the Reflect service for tagging. The commercially available document converters currently employed (ultrashareware 2008; verypdf 2008) were selected based on their ability to maintain the layout of the original document, availability of a command line modus and their processing speed. Any data exchange is based on the HTTP protocol.
We hope the OnTheFly service provides a powerful tool for researchers in the life science field. It is an attractive tool, not only for readers of scientific literature, but also for data annotators and experimentalists who want to link their in-house documents to literature and other biological databases.
Conflict of Interest: none declared.