Searching for accurate and functionally related biological concepts through stacks of journals and articles is time consuming. Furthermore, establishing the relationships between genes, transcripts, proteins, expression, and molecular structures using the web is often an error-prone process. Scientists need to use different web resources, which have different search engines that are syntactically and semantically incompatible: results are returned in heterogeneous formats, making deriving a coherent view of the biological meaning of these data cumbersome.
The availability of new tools and libraries for the development of search engines and web portals, allows us to build a system that enables interoperability between distinct data resources and channel these through a single hub. Scientists can now quickly search and identify biological entities, relationships and simply navigate to expert primary resources.
We present here a high-performance, full-feature text search engine that finds and displays biological entities and their associations (i.e. the relationship between genomic sequences, transcripts, proteins and their function, molecular structures, gene expression profiles, protein–protein interactions, pathways and published scientific and patent literature), in much the same way as scientists do searches in a library. This web-based search engine is called the ‘EB-eye’ and is built on the free Open Source Apache Lucene JavaTM
What is the EB-eye?
EB-eye is a catalogue of biological entities, similar to a library catalogue that describes publications, and contains enough information to allow for efficient searching. Unlike indexing warehouses such as Entrez [2
], SRS [3
] and MRS [4
], which provide complete access to the data and allow searching over fields with specialist value and which are difficult to search without prior knowledge of their contents, EB-eye focuses on indexing selected textual content, which are the most meaningful while searching biological data (e.g. database names and database identifiers, gene names and synonyms, protein names, chemistry identifiers, reaction equations, authors, titles, various types of descriptions and importantly, cross-references that link entries between distinct databases). This excludes data which requires specialist searching, such as sequences, structure coordinates, expression profiles and ambiguous data, such as numeric counts, for which there exists search tools associated with the primary resources. EB-eye improves the user’s experience by providing a search engine that presents consistent result pages and navigation across all the data resources maintained by EMBL-EBI.
EB-eye is not limited to specific data formats. As well as indexing database dumps, such as flat-files from EMBL-Bank [5
] and UniProt [6
], database specific XML, web content (e.g. HTML and XHTML), EB-eye uses a custom XML dump format, which has been designed for data resources without an export format. These dumps focus on the essential content required to describe a biological concept in the database, and are produced by the data providers.
The EB-eye is composed of a set of modules, each designed to carry out a distinct task (). In the following we will describe the web interface, the back-end data-management and indexing system, and finally, the SOAP Web Services interface that is used to integrate and/or embed its functionality into analytical pipe-lines, external applications and web portals.
Modules available in the overall architecture for the search engine back-end.