The addition of ten additional species to the database was done using the existing PhenoGO data extraction pipeline. Gene Ontology annotations for
Schizosaccharomyces pombe,
Saccharomyces cerevisiae,
Caenorhabditis elegans,
Drosophila melanogaster,
Danio rerio,
Gallus gallus,
Homo sapiens,
Bos taurus,
Mus musculus, and
Rattus norvegicus were downloaded from the current annotations section of the Gene Ontology website at
http://geneontology.org. Phenotypic associations were made using a combination of methods utilizing natural language processing and computational terminology approaches. The natural language processing approach applied the BioMedLEE NLP engine [
25] to derive annotated lists of genes, their related GO terms, and phenotypic associations given a list of PubMed abstracts. Additional mappings are derived using the existing MeSH annotations found in abstracts. The resultant output was then processed with the PhenOS system, yielding the final gene-GO-phenotype entries. The method is described in detail in [
29].
Diseases were annotated through the extension and expansion of the original processing pipeline designed for the annotation of cellular and anatomical contexts. First, the two paths of the encoding pipeline were modified to handle disease and clinical finding associated phenotypic context. Disease and clinical finding-related semantic types from the UMLS were introduced into the BioMedLEE knowledge base to supplement the NLP-driven encoding of disease phenotypes while disease associated MeSH headings were added into the system to enable direct extraction of these annotations. To ensure consistency, disease and clinical finding-associated MeSH headings and UMLS terms were chosen using the same semantic type filtering rules. Additionally, grammar rules specific for the recognition of diseases and clinical findings from the MedLEE system were also added to the BioMedLEE ruleset to enable the encoding of the new class of contexts [
24].
The gene accession number-GO code-phenotype entries resulting from this pipeline are enriched with full-text annotations for terms and names to enhance data readability and searchability using a series of Perl scripts which match gene accession numbers and GO identifiers to their names and descriptions. Data correlating identifier codes and accession numbers are taken from the Gene Ontology description files and the gene description files from UniGene, UniProt, MGI, RGD, SGD, Wormbase, and Flybase.
A web portal was developed to provide access and filtering functionality for the database. This portal provides two modes of querying the data. The first is a simple query which users are first exposed to on the front page of the portal. It allows for a search by all the fields of the database, including Pubmed ID, gene accession number, gene name, gene description, GO ID code, GO Term name, phenotype or experimental context code, and phenotype or experimental context description. This query mechanism is designed to provide users with a large number of results from the database, essentially corresponding to a logical OR query for all the query terms. An advanced query system is also made available to provide more exact results. The advanced query allows for searches based on the same fields as the basic interface, however it is focused on providing sets of results passing a number of strict criteria. This equates to a logical AND query between all the search terms specified by the user in specific fields. The interface also makes use of the structured organization of the Gene Ontology, the UMLS, and the Cell Ontology to provide hierarchical query functionality for the GO and context fields. This is done through the generation of a number of ancestor-descendent tables which are recursively processed at query time to determine all descendents or descendents and subclasses of user-specified contextual or GO terms.
The comprehensive evaluation was completed independently by two reviewers, each of whom reviewed 300 entries from the human and mouse subsets of the database. These entries were randomly retrieved directly from the PhenoGO mySQL database in 100 entry sets and stratified by context type. These four context types were defined by the BioMedLEE NLP engine; 'cell' involving annotations pertaining to cells and cell types, 'anatomy' encompassing annotations related to anatomies and morphologies, and 'problem' and 'problemdescr' describing diseases and disorders. The context types 'problem' and 'problemdescr' were merged into a general class encompassing both diseases and clinical phenotypes due to their similarity. Evaluation of this class was achieved using 50 random entries examined by two reviewers independently. Confidence intervals are calculated using the confidence level for proportions equation.
Our evaluation metrics were structured such that a true positive is only scored when the pipeline is able to both accurately encode a phenotype and associate it to its corresponding Gene-GO pair. Precision was measured by manually evaluating the entries recalled from the random draw and determining the percentage of correct annotations out of the total drawn entries. Recall was evaluated by randomly drawing encoded sentences from the NLP evaluated literature and computing the fraction which were seen in the encoded dataset.