With the tremendous growth of the World Wide Web, search engines became key tools to find documents. Search engines retrieve documents for a user's keywords from a large index and rank them by various criteria. While such keyword-based search is fast and powerful to retrieve single documents, it is far from the vision of answering a user's questions by "understanding" the user's query and answers in the documents as put forward already in the late 1960s [1
Consider e.g. a biomedical researcher, who might ask questions such as the following: Which model organisms are used to study the Fgf8 protein? Which processes are osteoclasts involved in? What are common histone modifications? Which diseases are associated with wnt signaling? Which functions does Rag C have? Which disease can be linked to fever, anterior mediastinal mass, and central necrosis? What is the role of PrnP in mad cow disease?
The Web holds answers to these questions, but classical keyword-based search is not suitable to answer them, since the keywords are required to appear literally in text. However, documents do contain statements such as e.g. "wnt signalling is linked to cancer" or "we studied fgf8 expression in Zebrafish development". If there is background knowledge that cancer is a disease and that zebrafish is a model organism, then the above questions can be answered.
The use of such knowledge is at the heart of the semantic web, which promotes the use of formal statements and reasoning to deliver advanced services not available on the Web now [2
]. To facilitate machine-readability and knowledge processing, a set of standards, query languages, and the semantic stack was proposed by the World Wide Web Consortium (W3C). The stack comprises at the base unique identifiers and XML as common markup language. On top of XML, it defines the Resource Description Framework (RDF) to capture subject-predicate-object triples. Furthermore, there is the modelling language RDF Schema (RDFS) and the query language SPARQL. The basic class definitions and triples of RDF are extended at the next level by the Web ontology language (OWL), which provides description logic as modelling language and by a rule layer [3
Besides the expressiveness of OWL, mark up for vocabularies and meta-data emerged such as Simple Knowledge Organisation Systems (SKOS) [4
], Dublin Core [5
], Friend of a Friend (FOAF) [6
] and the Semantically-Interlinked Online Communities Project (SIOC) [7
]. Additionally, there are formats to embed semantic annotations within web documents, such as embedded RDF (eRDF), Microformats [8
] or Resource Description Framework in attributes (RDFa) [9
All of the above standards serve the need to formally represent knowledge and facilitate reasoning over this knowledge. They require explicit statements of knowledge. As a consequence, the amount of such structured data is still small in comparison to the unstructured data. Thus, to support semantic search there are essentially two approaches: Those, searching structured documents and reasoning over them and those, searching unstructured documents and extracting knowledge and reasoning over it. The knowledge extraction step of the latter uses combinations of natural language processing, information retrieval, text-mining, and ontologies for the knowledge extraction.
Table summarises a number of semantic search engines, which work on structured and unstructured documents. The former comprise Swoogle [10
], Semantic Web Search Engine (SWSE) [11
], WikiDB [12
], Sindice [13
], Watson [14
], Falcons [15
], and CORESE [16
]. They include existing RDF repositories and crawl the internet for formal statements, e.g. OWL files. A search retrieves a list of results with URIs. For SWSE and Falcon the result is enriched with a description and a filtering mechanism for result types. CORESE uses conceptual graphs for matching a query to its databases. WikiDB is slightly different from the others in that it extracts formal knowledge implicit in meta tags of Wikipedia pages and converts it into RDF offering querying with SPARQL.
Comparison of semantic search engines
As mentioned, the above systems are limited by the availability of structured documents, a problem addressed by approaches such as the semantic media wiki [17
] and large efforts such as Freebase [18
], which provides an environment to author formal statements. The second class of tools works on unstructured text and therefore does not suffer from this limit. The systems can be distinguished by the document source they work on (Web, Biomedical, Wiki), the use of background knowledge in the form of ontologies, the use of text-mining techniques such as stemming, concept identification, deep/shallow parsing.
Hakia, START [19
], Ask.com, BrainBoost (Answers.com), AnswerBus [20
], Cuil [21
], Clusty [22
], and Carrot [23
] are engines that work on Web documents. Hakia, START and AnswerBus use natural language processing to understand documents, while Cuil, Clusty and Carrot cluster search results and aim to label clusters with phrases, which are offered as related queries. Cuil, Clusty and Carrot are not semantic search engines in a strict sense, since these phrases are not part of an ontology or vocabulary. However, they do have the benefit of being generally applicable and Cuil offers definitions for phrases where available. Ask.com uses its ExpertRank, an algorithm for computing query-specific communities and ranking in real-time, to identify relevant pages [24
]. They include structured knowledge to generate answers.
BrainBoost is a meta-search engine. It uses the proprietary AnswerRank algorithm applying machine learning and natural language processing. It ranks answers extracted from the top websites.
Wikipedia is a valuable resource to answer questions and hence some engines are specifically applied to it.
PowerSet applies e.g. natural language processing to Wiki in a similar manner to Hakia. QuAliM [25
] uses a pattern based approach for sentence analysis. Semantic type checking for answers and a fallback mechanism with web search is implemented in QuAliM.
The above tools are intended to be general and as a result they generally do not cover the biomedical domain well. Searching for example for a protein such as Fgf8, PowerSet and Hakia offer information on the protein, but are not able to find zebrafish as a model organism.
Engines such as askMedline, EAGLi [26
], GoPubMed [27
], ClusterMed, IHOP [28
], EBIMed [29
], XplorMed [30
], Textpresso [31
] and Chilibot [32
] address this by processing biomedical literature in full text (Textpresso) or abstracts as available in the PubMed literature database. With a focused domain, these engines can use background knowledge. GoPubMed and EBIMed use e.g. the GeneOntology and the Medical Subject Headings, MeSH; XplorMed filters by eight MeSH categories and extracts topic keywords co-occurrences; Chilibot extracts relations and generates hypotheses; IHOP uses genes and proteins as hyperlinks between sentences and abstracts; EAGli and askMedline process questions as input for the search.
Finally, besides all of the automated approaches, Google, Yahoo! and Microsoft use humans to answer questions in their services Google Answers, Yahoo! Answers and MSN Live Search QnA.
Closely related to semantic search, is semantic hyperlinking as implemented in the Conceptual Open Hypermedia Service (COHSE). COHSE annotates a given web page with concepts and offers services based on the identified concepts [33
None of the above systems combines the simplicity of keyword search on the vast amounts of Web documents with the use of biomedical background knowledge to filter large keyword results with biomedical ontologies. Here, we address this by introducing the GoWeb search engine. GoWeb issues queries to Yahoo and indexes the snippets semantically with ontology terms. These are then offered to filter results by concepts. In order to demonstrate the power of this approach in question answering, we evaluate GoWeb on three benchmarks with questions on gene/function, symptom/disease, and protein/disease relationships and compare it to existing solutions.