In this article, we described a novel function prediction algorithm, ESG, which extracts function annotation from the sequence similarity space that is extended by the iterative database search. To clarify characteristics of ESG, we compared performance of ESG with two other methods, the Top PSI-BLAST method and PFP. Top PSI-BLAST represents the typical homology search used in large-scale genome annotation where annotation is transferred to the query protein from the most significant hit in a search. PFP is a previously developed method by our group that was proved successful in automated function prediction. On the benchmarking dataset of 2400 sequences taken from 12 organisms, ESG consistently showed the best funsim score among the three methods. Top PSI-BLAST performed significantly worse than ESG and PFP.
At this juncture, we briefly discuss differences in design and concept of PFP and ESG. PFP extracts relevant GO annotation even from sequences with insignificant E-values by summing up scores reflecting the E-value of sequences. FAM further expands PFP's sensitivity to capture related GO terms. ESG, in contrast, limits GO terms to predict, by examining consistency of appearance of the GO terms in a number of searches in the vicinity of the query sequence on the sequence similarity space. Thus, PFP is designed to enhance sensitivity, while ESG has better precision for GO term prediction. Another difference is that ESG assigns probability to predicted GO terms using a rigorous statistical framework as opposed to PFP, which assigns a custom PFP score to GO terms and computes the P-value from background distribution of the custom PFP score.
Biological implication by the success of ESG and PFP is that there exist functional commonalities among proteins which are not traditionally considered as homologous, and importantly, such common function can be captured by making use of very weakly similar sequences in a database search. To exemplify this statement, ESG search for a query protein, P76216, is examined. P76216 is involved in biological processes GO:0019544 arginine catabolic process to glutamate, GO:0006525 arginine metabolic process and GO:0006950 response to stress. And it has molecular functions GO:0009015 N-succinylarginine dihydrolase activity, GO:0005515 protein binding, and GO:0016787 hydrolase activity. ESG could predict all terms except for GO:0005515 with the probability cutoff of 0.35. illustrates that many sequences with an insignificant E-value hold common annotation to the query and some of them are recaptured by the second-level searches. Out of the 1615 sequences found, 268 have common annotations to the query. Note that this hit rate is far better than random: a randomly selected set of same size from the database contain 112 proteins that had at least one common annotation.
There is a strong need for accurate automatic function prediction methods as the number of sequenced genomes is rapidly increasing. Various efforts have been made to address this goal including classification of large protein sequence space (Kaplan et al.
; Loewenstein and Linial, 2008
), considering protein structures (Yeats et al.
), and pathway data (Kanehisa et al.
). A recent trend is to consider heterogeneous experimental data sources such as microarray and protein–protein interaction. Although such new data can provide additional function information, obviously major sources of function information reside in sequence databases. Thus, sequence-based methods should remain at the center of gene function annotation and it needs to be re-examined with a fresh perspective to investigate the complex relationship between sequence and functional similarity. ESG together with PFP shows a promising future direction with strong evidences that there are still rich sources of functional information in weakly similar sequences, which are previously underestimated.
Funding: National Institutes of Health (GM075004 and GM077905 in parts); National Science Foundation (DMS0604776 and DMS800568 in parts). CP was supported by the Chung-Ang University Research Grants in 2008.
Conflict of Interest: none declared.