High throughput experiments such as sequencing projects, microarrays and proteomic methods produce huge amounts of data, usually in the form of long lists of genes or proteins. Proper annotation of such lists is crucial for understanding and interpretation of experimental results.
The best annotation quality is still generated by curators who manually review the existing literature. This approach for describing genes and/or their products is utilized by most large projects like FlyBase [1
], UniProt [2
] and others. The best known and most widely used public annotation database is Gene Ontology (GO) [3
], which maintains a controlled vocabulary (called terms) organized into three hierarchies: biological process, molecular function and cellular component. These GO terms are utilized by e.g. the GOA project [4
] that provides Gene Ontology annotations for the UniProt database, International Protein Index (IPI) and other major databases such as Ensembl and NCBI. All GO annotations can also be searched by gene/protein name or GO term using the AmiGO [5
] official browser.
Beside Gene Ontology annotations, manual descriptions are also generated by a number of other projects. Examples include KEGG [6
] which is focused on metabolic pathways, and UniProt which provides, in the comments section, biological knowledge usually directly linked to publications.
The existing manual annotation systems, despite their undoubted importance, are also acknowledged to have serious limitations. One of the main obstacles of using manually curated descriptions is that high coverage is available mostly in very broad categories. In the case of GO, diving deeper into their DAG structure shows the annotations are not of equal quality and a large proportion of them come from automatic computer predictions (i.e using motif databases or simple sequence similarity) [7
]. Broad protein classes (methyltransferases, kinases, phosphatases, etc) are assigned and are automatically inherited if a "new" protein has a domain with a GO label attached. As manually curated annotation databases are notoriously incomplete (see e.g. [7
]) this raises the problem of using more specific GO terms for statistical analysis of gene datasets.
Another hindrance is that many genomes are poorly, or not at all covered by GO terms (or other manually curated annotations), even for such prominent species as Gallus gallus
or the slime mold [8
]. Computational methods have been developed to at least partially overcome the coverage problems mentioned. BLAST2GO [9
] or KAAS [10
] both based on sequence similarity, help "transfer" annotations between orthologious genes/proteins. Another way to circumvent the coverage problem is to use published information about genes/proteins and automatically categorize them based on specific features (e.g. combinations of keywords/terms). Some services that analyze texts in this aspect include GOCAT [11
] which can categorize any text according to its similarity to GO vocabulary, or GOAnnotator [12
] which acts similarly by fetching possible text evidences for electronic GO annotations.
Such approaches generally give quite good results and are currently intensively being developed. Better performance is reached by introducing new aspects in the scoring methods, such as proximity between words in text and weighing words according to the amount of information they carry [13
]. It is possible to compare various computational approaches in a contest organized by the BioCreAtive initiative [14
Unfortunately using annotations based on GO terms has one more serious disadvantage, which is the paucity of the language used. Gene Ontology categorizes language in order to decrease its complexity and to provide a unified way to describe certain biological aspects. Although in some cases this controlled vocabulary brings significant advantages, it also de facto
means that annotators have to discard specific information or features present in publications in order to comply with a GO term [3
]. This becomes a highly visible problem when one tries to differentiate between proteins with similar but not identical functions – (e.g H4 and H2B-Alpha from S. pombe
). Also not all of the concepts are covered by the GO thesaurus (e.g. cellular component-apicoplast).
Nevertheless, GO terms are frequently used to analyze sets of genes to discover over-representation of particular functions or categories in the dataset. Dedicated software such as GeneMerge [15
] or GeneTools [16
] is available for this type of analysis.
We still lack annotation services, which could annotate genes/proteins using sequence homology and multiple sources of information (created both manually and automatically). Popular services like SMART [17
], PFAM [18
] or PROSITE [19
] can be searched using sequence homology but they gather only manual annotations (manually described protein domains and protein motifs). InterPro [20
] from EBI combines most of the different protein signature databases into one sequence analysis service. With IntePro2go [21
] one can additionally associate GO terms with different InterPro entries. The main problem with this system is the contrasting specificity of IntePro signatures: some of them are very frequent in nature and associated with very broad molecular functions, while others are highly specific. Also all these services/databases use manually curated information more or less shared among them.
Taking into account the decreasing costs of high-throughput experiments and the increasing number of organisms studied, there seems to be an expanding gap between available manual annotations and the demand to describe novel genomes, proteomes, etc. A promising way to bridge this gap is to develop and/or improve methods for automatic annotation generation. Such methods should be based both on available manual curation and on automatic information extraction from other sources. Also a recent publication by Jaeger et. al.
] shows other types of data such as conserved networks (including ortholog) can be utilized to create functional annotations based on literature analysis.
Taking the literature into consideration brings advantages to the annotation process: publication abstracts are currently the vastest source of biological data available, which grows almost exponentially and contains information not yet found or described by manual curators. It is also valuable that various statistical approaches can be applied to extract informative and specific textual features from a set of documents [23
]. Additionally, by processing the "raw" texts we omit the language paucity problem and enrich the retrieved knowledge with additional keywords like frequently used synonyms of gene names, chemical compounds, functionally interacting proteins, etc.
There are services implementing this idea. MedBlast [25
] is a simple web-based application which can gather Medline abstracts connected with a given gene. It uses BLAST to find abstracts linked to homologous sequences but can also find abstracts with the gene and organism name derived from annotated protein entries. The output is a set of Medline documents which should be read by the user. A more sophisticated approach was proposed by Attwood and co-workers. METIS [26
] also uses BLAST homology search to gather literature (abstracts), but the collected text is passed into two sentence classification components (set of SVMs and BioIE [27
]). They find and categorize informative sentences, relating them to: structure, function, disease and localization categories. METIS also calculates simple statistical properties, i.e. n-grams of word distribution and frequently occurring words. Unfortunately METIS is unable to annotate more than one protein during one run. It also displays only frequently occurring words, which often appear in the whole Medline, thus are neither very informative nor specific, e.g. cell, protein, etc.
To help automate the annotation of a large set of proteins using literature derived knowledge we created HT-SAS (High Throughput Sequence Annotation Service). It brings homology search and textual feature extraction from literature together and is easily operated through a user friendly web interface. The input for HT-SAS are protein sequences. Homology search is performed using BLAST algorithm [28
] against the UniProt database. HT-SAS utilizes links between UniProt protein entries and Medline to gather publication abstracts. The obtained sets of abstracts are searched for statistically important words using a modified TF-IDF approach. This feature extraction approach is more informative than simple word frequency and provides specific keywords aiding the annotation task. It can often provide significant keywords even for a limited number of abstracts, which is usually the case for poorly described or unknown proteins.
The interactive web interface allows easy access to source publications, homology results, and also Gene Ontology and UniProt manual annotations, if available. The results page is generated in the form of sortable tables for fast result overview and comparison. The URL is: http://htsas.ibb.waw.pl