|Home | About | Journals | Submit | Contact Us | Français|
Despite advances in the gene annotation process, the functions of a large portion of the gene products remain insufficiently characterized. In addition, the “in silico” prediction of novel Gene Ontology (GO) annotations for partially characterized gene functions or processes is highly dependent on reverse genetic or function genomics approaches.
We propose a novel approach, Information Theory-based Semantic Similarity (ITSS), to automatically predict molecular functions of genes based on Gene Ontology annotations. We have demonstrated using a 10-fold cross-validation that the ITSS algorithm obtains prediction accuracies (Precision 97%, Recall 77%) comparable to other machine learning algorithms when applied to similarly dense annotated portions of the GO datasets. In addition, such method can generate highly accurate predictions in sparsely annotated portions of GO, in which previous algorithm failed to do so. As a result, our technique generates an order of magnitude more gene function predictions than previous methods. Further, this paper presents the first historical rollback validation for the predicted GO annotations, which may represent more realistic conditions for an evaluation than generally used cross-validations type of evaluations. By manually assessing a random sample of 100 predictions conducted in a historical roll-back evaluation, we estimate that a minimum precision of 51% (95% confidence interval: 43%–58%) can be achieved for the human GO Annotation file dated 2003.
The program is available on request. The 97,732 positive predictions of novel gene annotations from the 2005 GO Annotation dataset are available at http://phenos.bsd.uchicago.edu/mphenogo/prediction_result_2005.txt.
During the post-genomic era, annotating gene functions using standardized vocabularies, such as the Gene Ontology (GO), has become a critical task for biologists after massive amounts of genes have been identified and sequenced. GO is organized as a hierarchical structure containing ontological knowledge of biology, which has been manually developed by human experts (Ashburner, et al., 2000). Despite advances in the gene annotation process, the functions of a large portion of the gene products are still insufficiently characterized. For example, though the GO annotations for Homo sapiens genes have increased by 66% from 2003 to 2005, the GO Consortium currently provides annotations for about 16,000 out of about 25,000 human genes, an indication that a large number of molecular functions and biological processes associated with human genes remain to be discovered.
Predicting annotations of gene products generally involves physics-based and knowledge-based approaches. In general, physics-based approaches depend on direct experimental information about genes, while the knowledge-based approaches rely on existing knowledge (e.g., results from physics-based experiments, bio-medical literature, GO Annotation datasets, etc.).
The common basis for the physics-based approaches is that two genes are more likely to share similar annotations if they also possess other analogous aspects of their physical properties. If the annotations of one of the two genes are known, then these can be utilized as predictors for the annotations of the other gene. Several existing data sources may be used with physics-based approaches, including protein sequences (Hennig, et al., 2003; Jensen, et al., 2002; Jones, et al., 2005; Khan, et al., 2003; Vinayagam, et al., 2004), 3D structures of proteins (Laskowski, et al., 2005), protein-protein interactions (Deng, et al., 2003; Deng, et al., 2004; Deng, et al., 2003; Letovsky and Kasif, 2003; Schwikowski, et al., 2000; Vazquez, et al., 2003), and gene expressions (Walker, et al., 1999; Yu, et al., 2005; Zhou, et al., 2005). The physics-based techniques associated with each of these data sources are summarized in the following sections.
Sequence-based approaches often start by determining those genes which have similar sequences to the gene in question. The most frequently used measurement for ascertaining gene sequence similarity is the BLAST (Altschul, et al., 1990) algorithm (Hennig, et al., 2003; Jones, et al., 2005; Khan, et al., 2003; Vinayagam, et al., 2004). When using the best BLAST method, Jones et al. report a precision of 41% and recall of 56%, while Khan et al. report a precision of 65%. Vinayagam et al. state an average precision of 80% when utilizing BLAST in combination with organism-wide cross validations. Additionally, other sequence-derived features, such as local consensus sequence motifs, features associated with post-translational modifications and protein sorting, can be used (Jensen, et al., 2002). After finding similar sequences, subsequent algorithms can be applied to determine the relevant target annotations from a set of the candidate annotations associated with the similar sequences. Such algorithms include simple voting schemes (Khan, et al., 2003), taxonomy-based approaches (Jones, et al., 2005) and machine learning algorithms, such as support vector machine (Vinayagam, et al., 2004) and neural networks (Jensen, et al., 2002).
Prediction of protein functions through the use of 3D structures depends on calculations of local structure similarity between the protein in question and the available 3D templates (Laskowski, et al., 2005). Laskowski et al. report a good accuracy for their method: a value of 0.82 for the area under the Receiver Operating Characteristic (ROC) curve, where the optimal score is 1.
When inferring protein functions from protein-protein interactions, a function is assigned to a protein based on the functional labels of its interacting proteins. There are several methods for determining the final function assignment, including majority rule (Schwikowski, et al., 2000), probabilistic approaches (Letovsky and Kasif, 2003) and global optimization techniques (Vazquez, et al., 2003). Utilizing a majority rule approach, Schwikowski et al. report a precision of 72%, while Letovsky et al. achieve a precision of 85% with a recall of 34% through the application of probabilistic techniques. Vazquez et al. obtained success rates between 0.4 and 0.94.
Gene annotations can also be predicted by finding hypothetical annotations for genes possessing similar expression patterns, assuming that genes may share the same functions if they are expressed in similar patterns (Butte, et al., 2000; Kuramochi and Karypis, 2001; Walker, et al., 1999; Wolfe, et al., 2005; Yu, et al., 2005; Zhou, et al., 2005). In particular, Kuramochi and Karypis report precisions ranging from 25% to 50%, and recalls ranging from 40% to 56% using different computational algorithms.
In contrast to physics-based predictions that require significant laboratory experiments, knowledge-based approaches rely on existing knowledge sources. One such knowledge-based approach is a hybrid technique, which combines multiple physics-based methods. Since it remains difficult to establish a prediction if only a few physical properties of a gene are known, it follows that a predictive method based on a single physics-based measurement would be more limited in its predictive capability than hybrid approaches relying on more than one physical property. Hybrid approaches can improve the accuracy of protein function prediction as compared to a single information source (Chen and Xu, 2004; Joshi, et al., 2004; Kemmeren, et al., 2005; Troyanskaya, et al., 2003). For example, Botstein and Colleagues describe a Bayesian network that incorporates expert knowledge about the predictions accuracies based on various data sources, which could increase the proportion of true positives by 17% (Troyanskaya, et al., 2003). Bayesian approaches have also been used in identifying candidate genes for complex disorders (Bertsch, et al., 2005). Alternatively, Aerts et al. recently reported on the use of order statistics (one of fundamental tools in non-parametric statistics) to combine multiple heterogeneous data sources for prioritizing candidate genes for specific diseases (Aerts, et al., 2006).
Existing biomedical literature, a resource previously untapped by traditional well-curated physics-based databases, can provide additional knowledge. Techniques for extracting knowledge include the use of simple keyword-based mappings between databases (Perez, et al., 2004), statistical analysis of keywords (Andrade and Valencia, 1998), and maximum entropy analysis by Altman and colleagues (Raychaudhuri, et al., 2002). For example, Butte et al. have constructed a comprehensive phenome-genome networks using text mapping techniques (Butte and Kohane, 2006). They extracted phenome-genome relations from the text in the Gene Expression Omibus (GEO) (Wheeler, et al., 2000) and mapped this text to Unified Medical Language System (UMLS) concepts (Lindberg, 1990). Alternatively, others have used Natural Language Processing (NLP) techniques based on language patterns (Chiang and Yu, 2003; Koike, et al., 2005) and grammars (Chen and Friedman, 2004) to detect relationships among concepts present in the biomedical literature. For example, studies conducted by Perez et al. show a resulting recall of 8% and precision of 67% for assigning GO terms for genes using text indexing (Perez, et al., 2004). Higher accuracies have been reported upon with respect to utilizing NLP techniques. For instance, we report 64.0% precision and 77.1% recall (Chen and Friedman, 2004) for automatically extracting phenotypes for specific genes. Two advantages of knowledge-based techniques are that they can (1) increase prediction accuracy for gene-phenotype relations by combining multiple physics-based techniques with existing knowledge sources, and (2) make knowledge embedded in literature more accessible. However, these knowledge-based techniques are limited by their dependence on pre-existing knowledge.
Of particular interest to this study are prediction methods based on GO annotation patterns. In contrast to independently mining the biomedical literature or utilizing heterogeneous datasets resulting from physics-based measurements, the Gene Ontology Annotations provide standardized and integrated gene function annotations, incorporating relevant literature and physics-based measurements, many of which have been manually curated. It is therefore a unique data source for inferring such annotations based on multiple physical properties. Roth and colleagues proposed the first accurate methods for doing so that are solely based on existing GO annotation patterns (King, et al., 2003). They used machine learning algorithms to analyze the GO annotation patterns and predict gene functions (King, et al., 2003). Using known GO annotations as a gold standard, King et al. obtained a precision of 97.7% using decision trees, and a precision of 93.7% and recall of 50% using 10-fold cross validation techniques over Bayesian networks incorporating 170 specifically selected GO terms from the Saccharomyces Genome Database (SGD) (Cherry, et al., 1998). These 170 GO terms were selected from a corpus of 2,261 such concepts in SGD dataset based on certain criteria; the most significant criteria was that selected GO terms had to be referenced in at least 10 annotations. Roth and colleagues also obtained a precision of 87.5% using decision trees, and a precision of 78.7% and recall of 50% using 10-fold cross validation techniques over Bayesian networks composed of 218 specifically selected GO terms from FlyBase (Mitchell, et al., 2003). These 218 GO terms were selected from a total of 3,859 such concepts in the FlyBase dataset; they were annotated with at least 10 genes, and were subject to other restrictions.
In this paper, we describe a novel technique, named Information Theory-based Semantic Similarity (ITSS), for predicting new gene annotations based exclusively on existing GO annotations, and present the results of an evaluation, which show a higher recall than the previously reported methods. Studies have shown that information theory-based semantic similarity between gene GO annotations were highly correlated to the similarity between gene sequences (Lord, et al., 2003; Lord, et al., 2003) and the similarity between gene expression profiles (Wang, et al., 2005; Wang, et al., 2004). Butte and Kohane have mined molecular networks with a related information theoretic method to cluster related genes or Relevance Networks(Butte, et al., 2000). Therefore, in light of the proven capability of methods for predicting gene annotations from gene sequences and expression profiles, it is possible to speculate that the semantic similarity of gene annotations can be used as another potential information source for gene annotation prediction. In this research, we hypothesize that semantic similarity measurements between groups of concepts based on information theory can be used to predict new annotations associated with a gene. The basis of the ITSS approach we propose is a K-Nearest Neighbor algorithm (Duda and Hart, 1973) that uses information theory-based semantic similarity measurement as the metric for assigning new edges (relationships) to a node (concept) in a network. The predictions described in this paper rely on the calculation of two semantic similarity scores: (1) between two genes concepts in the Gene Ontology Annotations, and (2) between two groups of Gene Ontology concepts within the ontology. These calculations are generic because they can calculate a semantic similarity value for any pair of concepts or for any two groups of concepts as long as the concepts are in the same ontology. As a result, our proposed algorithm is also generic.
The significant accomplishments achieved through the use of this innovative technique are that it (1) relies on the ontological knowledge of GO using semantic similarity scores to calculate predictions of novel gene annotations, and (2) provides more predictions than previously evaluated prediction methods. The calculations of semantic similarity between two concepts, or groups of concepts, have been performed extensively in the domain of computer science for information retrieval and natural language processing tasks (Jiang and Conrath, 1999; Lee, et al., 1993; Rada and Bicknell, 1989). In addition, the use of semantic similarity as a metric for k-nearest neighbor machine learning tasks has also been reported upon in the computer science literature (Yuseop, et al., 2001). Recently, calculations of semantic similarity have also been utilized within the biological domain for predicting protein-protein interaction networks (Wu, et al., 2006), and investigating the relations between gene sequences and GO annotations (Lord, et al., 2003; Lord, et al., 2003), and between gene expression profiles (microarrays) and GO annotations (Wang, et al., 2005; Wang, et al., 2004). Previous studies have integrated semantic similarity and k-nearest neighbor methodologies to improve missing value estimations in microarray data (Tuikkala, et al., 2006), and analyze gene expression data in coordination with an ontology-driven clustering method (Wang, et al., 2005). However, to our knowledge, the proposed method is the first use of an information theory-based semantic similarity approach for assigning novel gene functions to known genes directly from the geometry of the network of the Gene Ontology Annotations and the overarching Gene Ontology. We use knowledge from the GO hierarchies to derive predictions with the hopes that by maximizing the number of utilized GO concepts, these predictions will be based upon as much information as possible. The accuracy of the ITSS method has been established for a significantly broader number of GO annotations than previous methods (King, et al., 2003), which were evaluated over a small number of GO terms that were associated with at least ten genes . To our knowledge, no prediction method has been demonstrated to be accurate for GO terms having less than ten associated genes; the vast majority of GO terms utilized in the annotations have less than ten associated genes (e.g., 82.5% of GO terms in Homo sapiens annotations are associated with less than ten genes).
We have developed the Information Theory-based Semantic Similarity (ITSS) algorithm to assign unknown annotations to a gene based on the similarity of its known annotations and those of other genes. In this section, we will first introduce the algorithm used for calculating semantic similarity between any two concepts. Next, we will describe the algorithm for calculating the semantic similarity between any two groups of concepts, and last, we will explain how semantic similarity can be used as a metric in the k-nearest neighbor algorithm for predicting new GO annotations for a gene.
The first algorithm is for calculating the semantic similarity between “any two concepts” in an ontology (e.g. Gene Ontology). For example, the simplified ontology seen in Figure 1a consists of nine different concepts. “Any two concepts” means that the algorithm can be used to calculate the semantic similarity between any two concepts, including identical concepts.
The Gene Ontology (GO) is comprised of three sub-ontologies, “molecular functions”, “cellular components”, and “biological processes”. Because these three sub-ontologies contain disjoint types of entities, they are considered to be different ontologies in our methods. Therefore, the algorithms described in this section will calculate the semantic similarity between any two concepts from the same sub-ontology in GO. If the two concepts are in different sub-ontologies of GO, then semantic similarity is equal to be zero. For example, the semantic similarity can be calculated between the two concepts “oxidoreductase activity” and “peptidase activity”, which are both from the same sub-ontology of GO, “molecular function”.
There are generally three main algorithms, based on information theory, for calculating the semantic similarity between two concepts in an ontology, which were respectively proposed by Resnik (Resnik, 1995), Lin (Lin, 1998) and Jiang (Jiang and Conrath, 1997). More detailed descriptions of these algorithms, and their relevance within the biomedical domain, can be found in a recent article by (Pedersen, et al., 2006). In our study, we used Lin’s algorithm because it returns a normalized value between 0 and 1. Lin’s algorithm for calculating the semantic similarity between concepts a and b is defined as:
To compute the semantic similarity between “protein binding” and “single-stranded DNA binding”, we note that “protein binding” has 561 descendants, “single-stranded DNA binding” has 2 descendants, and the entire “molecular function” hierarchy contains 7063 concepts. Thus p(“protein binding”)=(1+561)/7063=0.0796 and p(“single-stranded DNA binding”)=(1+2)/7063=0.000425. Their minimum “subsumer” is “binding”, which has 961 descendants with p(“binding”)=(1+961)/7063=0.136. Therefore, the semantic similarity according to Lin’s algorithm is 2×(−log0.136)/[−log0.0796−log0.000425]=0.388.
The second algorithm calculates the semantic similarity between “any two groups” of concepts within an ontology based on the similarity between a pair of GO concepts calculated as described in the first step. “Any two groups” of concepts means that the two groups can be obtained in any way as long as they are all in the same ontology. For example, using the ontology seen in Figure 1b, we can arbitrarily select groups of concepts, such as groups I, II, and III. The semantic similarities can be calculated between any two of these arbitrarily defined groups. These groups can also share identical concepts as shown in groups II and III.
In this particular research, we define a group of concepts as those GO concepts that are associated with a single gene. For example, all of the concepts within the “molecular function” sub-ontology that are associated with the gene BRCA1 (breast cancer 1, early onset) compose a group, which contains the concepts “DNA binding”, “protein binding” and “transcription coactivator activity”. All of the concepts within the “molecular function” sub-ontology that are associated with the gene BRCA2 (breast cancer 2, early onset) comprise another group, which contains the concepts “nucleic acid binding”, “protein binding”, and “single-stranded DNA binding”. The semantic similarity between these two groups tells how similar the genes BRCA1 and BRCA2 are in terms of their molecular functions.
Based on the previously described measurement for determining the degree of similarity for a pair of concepts, we used the following “pair-wise” method for calculating the semantic similarity between two groups of concepts within an ontology. The pair-wise algorithm (Jiang and Conrath, 1999) was compared to the “cross-join” algorithm (Wang, et al., 2004), and was found systematically superior in three preliminary studies. Before performing the semantic similarity calculation, the concepts within one group are paired-up with those of another group. This pairing process is illustrated in Figure 2. First, for each concept in group A, the most similar concept is found in group B. Then, for each concept in group B, the most similar concept is found in group A. If two concepts across the groups are reciprocally found to be most similar to one another, these two concepts are considered to be a pair. All of the reciprocal pairs constitute a set P. The “pair-wise” formula for two groups of concepts is:
Based on the metric of semantic similarity between two concepts or two groups of concepts, the ITSS method employs the simple k-nearest neighbor classification algorithm (Duda and Hart, 1973) to predict new annotations for a gene.
The process of k-nearest neighbor is illustrated in the example in Table 1, and detailed below:
To obtain the best predictions, there are two parameters of the ITSS algorithm that must be optimized: (1) number of neighbors (k value), and (2) similarity threshold (t value in Equation 2). The optimal k value was determined by randomly selecting 100 or 500 genes to comprise a testing set and applying different values for k. The optimal threshold value t was determined by using the entire datasets of genes. The values of k and t were judged as optimal when the prediction F values are maximal.
The performance of the different prediction algorithms was assessed by comparing the areas under the resulting Receiver Operating Characteristic (ROC) curves. Detailed descriptions for calculating and comparing the areas under ROC curves can be found in two papers published by Hanley and McNeil (Hanley and McNeil, 1982; Hanley and McNeil, 1983). To summarize, the area A under an ROC curve is calculated using the “trapezoidal rule”. The standard error of the area under an ROC curve is calculated using the following equation:
Where A is the area under the curve, na and nn are the number of positive and negative results, respectively, taken from the gold standard, and Q1 and Q2 are estimated by Q1=A/(2-A) and Q2=2A2/(1+A). The standard error of the difference between two areas A1 and A2 is calculated using the equation:
The z score is equal to |A1-A2|/SE(A1-A2), which indicates how far and in which direction the observation deviates from its distribution's mean expressed in units of its distribution's standard deviation. The multiple comparisons were adjusted using the most conservative approach, a Bonferroni-type correction (Sokal and Rohlf, 1995), for multiple posteriori comparisons with two types of random controls.
We used a Gene Ontology file that contains hierarchical relations organized as three exclusive axes of biological concepts. The version of GO used in the study, dated August 2005, was downloaded from http://www.geneontology.org/GO.downloads.shtml, and contains 9,633 distinct Biological Processes (P), 1,570 distinct Cellular Components (C) and 7,063 distinct Molecular Functions (F), excluding the 1,000 terms annotated as obsolete.
Two GO Annotation files for Homo sapiens, which contain annotations relating human genes to their Biological Processes (Ps), Molecular Functions (Fs) and Cellular Components (Cs) in GO, were used in the study. The older, historical GO annotation file, referred to as GOAh in this paper, is dated March 2003, and was obtained directly from NIH/NLM/NCBI. GOAh contains 51,830 distinct gene-GO entries, including 11,221 distinct human genes and 3,448 unique GO terms. The second GO Annotation file, referred as GOAr, is dated August 2005, and was downloaded from Entrez Gene of NCBI (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene). It contains 86,348 distinct gene-GO entries, including 15,442 distinct human genes and 4,610 distinct GO terms. A detailed comparison between the numbers of distinct GO terms and their annotated genes found in these two files is summarized in Table 2.
In order to evaluate the accuracy of the predictions, we conducted three experiments:
To evaluate the ITSS approach in comparison to other machine learning algorithms that do not consider semantic distances between concepts, we compared the prediction results of the proposed ITSS algorithm to those of the Decision Tree and Bayesian Network studies performed by Roth and colleagues (King, et al., 2003). To obtain fair comparisons, we repeated the experimental methods of King et al. as precisely as possible. King et al. used SGD and FlyBase GOA files dated on January 22, 2002. As these original files were not available, we used SGD and FlyBase GOA files from 2005, and removed entries later than January 22, 2002 according to their PubMed IDs, to produce datasets of relatively similar size to those utilized by King et al. King et al. used a SGD file containing 6,403 genes and a FlyBase file containing 13,500 genes; we calculated datasets containing 6,099 and 11,142 genes, respectively.
We replicated the 10-fold cross-validation methods utilized by King et al. and also followed their procedures to get similar sets of GO terms. For this study, we used 165 GO terms for testing in both the SGD and FlyBase datasets, much smaller than the entire SGD (2,261) or FlyBase (3,859) datasets, because, in accordance with the criteria for selecting GO terms employed by King et al., GO terms associated with less than ten genes were removed. The theoretical component of this experiment is a 10-fold cross-validation, in which known annotations are used as the gold standard. The genes in each GOA file were randomly partitioned into 10 sets of approximately equal size. Each of these 10 sets of genes will be used as testing set, in turn, and the aggregate of the remaining nine sets as the training set. The task was to predict if a gene in the testing set is associated with a certain GO term. The gold standard for these validations was the corresponding GO Annotation files, as utilized in the experiments of King et al. The knowledge of whether the gene is associated with the target term is hidden from the ITSS algorithm, as its primary function is to provide a binary prediction for the term.
If a prediction was positive, i.e. the gene was indeed assigned to the term in the GOA file, then this prediction was considered to be a True Positive (TP). If a prediction was positive but the gene was not assigned to the term in the GOA file, then this prediction was counted as a False Positive (FP). If a prediction was negative but the gene was not assigned to the term in the GOA file, then this prediction was counted as a True Negative (TN). If a prediction was negative and the gene was assigned to the term in the GOA file, then this prediction was considered to be a False Negative (FN). The True-Positive Rate (TPR), equal to TP/(TP+FN), and the False-Positive Rate (FPR), equal to FP/(FP+TN), were then calculated. The prediction results were represented as Receiver Operating Characteristic (ROC) Curves, in which the FPR is plotted on the x-axis and the TPR on the y-axis (Metz, 1978). The cutoff values of the ITSS algorithm were changed to generate the different data points on the curves (see Methods section). Before the validations were executed, we optimized the two parameters of the ITSS algorithm: k value and threshold t. The detailed optimization processes are described in the Methods section.
In order to demonstrate that the predictions are effective, we employed two types of random controls. The first control, Random Algorithm (RA), assigned a random decimal number to the value of calculated semantic similarity used in the ITSS. In the second control, Random Data (RD), the ITSS algorithm was applied to a fictitious GOAr annotation file constructed by randomly shuffling the relationships between genes and their GO annotations (permutation resampling) (Westfall and Young, 1993) to purposely randomize annotation patterns while preserving the total number of occurrences of each annotated GO terms. RA and RD provide an estimate of the maximum number of false positive predictions which can be useful to understand the meaning of the observed uncorroborated predictions in the full study.
After performing the optimization techniques described in the Methods section, we obtained the best values for the parameters of the ITSS algorithm (k = 4, t = 1). The comparisons of ROC curves are shown in Supplement 1. Because in biological predictions a low FPR is usually more desirable than a high TPR, we compared, using the ROC area comparison method of (Hanley and McNeil, 1983), only the areas of those ROC curves where the FPR was below 0.001. In the SGD dataset, as shown in Supplement 1a, the decision trees utilized by King et al. performed slightly better than ITSS algorithm, but the difference was not statistically significant (z=1.538, p=0.124), and the results of the proposed ITSS algorithm were a little better than those associated with the Bayesian Networks utilized by King et al., but again, the difference was not statistically significant (z=0.439, p=0.67). In the FlyBase dataset, as shown in Supplement 1b, the ITSS algorithm produced significantly better predictions than either the decision trees (z=2.34, p=0.019) or Bayesian Networks (z=4.9, p=8.4×10−7) of King et al. Due to the nature of the ITSS method, it cannot operate at a recall of exactly 50%. In the SGD dataset, when recall is 56.7%, the precision is 94.6%, and in the FlyBase dataset, when the recall is 52.9%, the precision of the ITSS algorithm is 94.3%. These are very comparable to the results of the King et al. studies, which, when recall is 50%, show precisions of 97.7% and 93.7%, respectively when using decision trees and Bayesian Networks, respectively, over the SGD dataset. Additionally, for the FlyBase dataset, King et al. present precisions of 87.5% and 79.7% when using decision trees and Bayesian Networks, respectively, with recall equal to 50%.
The experiments of King et al. were conducted using a relatively small dataset comprised of only 170 GO terms and 6,403 genes. The corpus of GO term used with the SGD dataset was reduced using the selection criteria of King et al., the most significant being that the selected GO terms must have at least ten annotations. To investigate the effects of this GO term selection process on prediction, we applied the ITSS algorithm to the overall SGD dataset comprised of 2,261 distinct GO terms and 6,099 genes. We stratified the accuracy of the calculated predictions according to the number of genes associated with each GO term (see Methods section for details), and found that the ITSS algorithm performed well (above 0.6 precision and 0.5 recall) for those GO terms that were associated with three or more genes. Therefore, in both SGD and FlyBase datasets, we also performed 10-fold cross-validations incorporating all GO terms with at least three associated genes. These results are summarized in Table 3. The total number of TP predictions resulting from these experiments was over three times larger than those presented by King et al. Furthermore, when the recall was approximately 50%, the precision of the ITSS algorithm was also about 50%.
For the Homo sapiens dataset, we initially conducted 10-fold cross-validations using both the GOAr and GOAh files. After removing those GO terms marked as “obsolete” and the three ambiguous terms “biological process unknown”, “molecular function unknown”, and “cellular component unknown”, we further limited our dataset to include only those GO terms that had at least three associated genes. As a result, we obtained 2,072 distinct GO terms from the GOAr file,and 1,390 dsitinct GO terms from the GOAh file. Because at least one annotation is necessary as the clue for making the prediction and one as the target GO term, we also limited the datasets used in these validation experiments to include only those genes with at least two associated GO terms. We obtained 13,509 and 11,076 such genes from the GOAr and GOAh files, respectively. Thus, we generated 27,990,648 (2,072 GO terms ×13,509 genes) predictions applying the 10-fold cross-validation methods over the GOAr dataset, among which 77,602 positively corresponded to the gold standard. We also derived 15,395,640 (1,390 GO terms × 11,076 genes) predictions based on the GOAh dataset, among which 44,020 predictions were positive according to the gold standard.
Figure 3 shows the predictions resulting from the application of the ITSS algorithm to the GOAr file during the 10-fold cross-validation experiment as TP-FP curves with a variable parameter t (threshold). The predictions resulting from the 10-fold cross-validation utilizing the GOAh file can be found in Supplement 2. In general, the results from GOAh and GOAr are very similar. In this evaluation, when applying the optimization methods to the ITSS algorithm, as described in Methods section, we obtained the best predictions when k = 4 and t = 1. The impact of parameter t is illustrated in Figure 4. As shown in Figure 3, the ITSS algorithm provides significantly better predictions over the GOAr dataset than either of the two controls (z >217, p<2.2×10−16 when compared to RA, and z>216, p<2.2×10−16 when compared to RD, see Methods, subsection “Statistical Analysis”). Details of the statistical analysis method can be found in Methods section. The maximum precision was 90% when the recall was 36%, and the maximum recall was 74% when the precision was 45%.
To evaluate the ITSS algorithm in situations that are more similar to real–life, in the second evaluation for Homo sapiens dataset we predicted new annotations using the older GO association file GOAh (2003) and then validated the newly predicted annotations using the newer association file GOAr (2005). Similar to the previously described 10-fold cross-validation methods, we excluded from the GOAh file those GO terms marked as “obsolete” and the three ambiguous terms “biological process unknown”, “molecular function unknown”, and “cellular component unknown”. We further limited our testing dataset to include only those GO terms that had at least three associated genes. Ultimately, we obtained 9,589 genes and 1,377 GO terms from the GOAh file. The entire network of annotations from GOAh was used to perform predictions, and the entire network of GOAr was used as a gold standard for this evaluation.
In contrast to the previously described 10-fold cross-validation based on known annotations, this historical validation focused on predicting annotations that were unknown in 2003, but became part of the GO annotation databases between 2003 and 2005. The gold standard annotations are from the more recent Annotation file GOAr. The ROC curve in the historical rollback validation is shown in Figure 5 and the impact of parameter t is illustrated in Figure 6.
By comparing Figures 4 and and6,6, we can see that the impacts of changing parameter t for the two types of validations are in opposite directions. In the 10-fold cross-validation, when the parameter t was decreased from 1 to 0.7, indicating that more concepts with lower semantic similarity are considered when calculating a similarity score between two genes, the predictions decreased as well, reaching minimum when t = 0. On the contrary, in the historical rollback validation, when t is decreased from 1 to 0.7, the F value of the prediction results increased by 12.7%, reaching a peak when t = 0.7 and decrease slightly when t > 0.7. The implication of this phenomenon will be discussed further in Discussion section. We have also analyzed a smaller historical roll back where only 10% of annotations were removed according to their reverse chronology and obtained similar accuracy scores, which were significantly lower than the cross-over evaluation where 10% of the results were removed randomly (results not shown).
To further validate the effectiveness of the ITSS method, an expert manually examined a sample of 100 positive predictions from GOAh that were randomly selected from a corpus of the 2,704 most plausible positive predictions obtained by using the best parameters for the ITSS algorithm: k = 4, t = 0.7, and a cutoff equal to 3. This set of 2,704 positive predictions can be found at http://phenos.bsd.uchicago.edu/mphenogo/prediction_2003_2704.txt. The expert was a senior postdoctoral molecular biology research scientist with more than ten years of laboratory experience.
A summary of the manual assessment results is provided in Table 4. Of the 100 assessed predictions, 51 were considered correct and validated in the scientific literature according to the expert. 49 of the examined predictions were uncorroborated, but not necessarily wrong. Of the corroborated predictions, 17 were found directly in the GOAr file, and 19 others were found to be a parent of the GO concept associated to the gene in the GOAr file. For example, the gene MTERF was predicted to be associated with DNA binding (GO:0003677), which is the direct parent of double-stranded DNA binding (GO:0003690) in the GOAr file, and was judged to be correct by the expert. Thus, accepting direct parents to be correct predictions, as they are related on a high level, improved the measurement of the precision significantly. An additional six predictions could have been determined to be correct by extending the gold standard to include all ancestors of the concepts in GOAr. None of the uncorroborated predictions would have been erroneously assigned a true positive value, as judged by the expert, if all ancestors of the concepts in GOAr file were to be accepted as the gold standard. However, by extending the gold standard to include all descendents of GO terms found in the GOAr file, 20 uncorroborated facts and 8 additional corroborated ones would be generated. There was one prediction (gene TNFSF15 associated with GO:0007267: cell-cell signaling) in which the GO term cell-cell signaling has no hierarchical relations with any terms in the GOAr file. This prediction was validated by the expert based on published literature (Haridas, et al., 1999). Therefore, the final number of correct predictions, as validated by the expert, was 51, yielding a precision of 51%. (95% confidence interval: 43%–58%, n=100). Additional details and bibliographic references can be found at http://phenos.bsd.uchicago.edu/mphenogo/resultRandom100.txt. We also applied the ITSS prediction algorithm to the GOAr file, and generated 97,732 new positive predictions (available at http://phenos.bsd.uchicago.edu/mphenogo/prediction_result_2005.txt).
The comparisons with the previous studies of King et al. (2003) show that the ITSS prediction approach, which is based on k-nearest neighbor and semantic similarity theories, has a comparable or better prediction capability than the best implementations of decision trees or Bayesian Networks when applied to similar datasets. One important difference of the ITSS prediction approach as compared to the alternatives is that it can provide predictions for sparsely annotated gene functions and processes. Although, when the ITSS algorithm was applied to the sparsely annotated GO terms, the precision of the resulting predictions dropped from over 90% to approximately 50% for a constant recall of about 50%. We demonstrated that the ITSS method is capable of generating predictions for previously untapped GO terms, ultimately providing a three-fold increase in the number of TP predictions. This functionality is particularly important because GO terms with less than ten gene annotations, which were excluded from previous prediction studies (King, et al., 2003), occupy over 80% of total number of annotated GO terms that represent biological processes, cellular components, and molecular functions. Any additional valid predictions are likely to yield a higher impact than for those GO terms that already have a significant number of annotations.
When compared to the two controls, the results of both the 10-fold cross-validation and historical validation in the Homo sapiens datasets confirm that the integration of k-nearest neighbor and semantic similarity methodologies with information theory is a valuable technique for predicting new gene annotations. This study provides the first example of the application of a prediction algorithm to Homo sapiens datasets. As expected from the validation experiments over yeast (SGD) and fly (FlyBase) data, the ITSS algorithm performs significantly better than either the random algorithm or random data controls. Depending on the criteria, the precision of the ITSS algorithm can be estimated between 17% and 51%. Moreover, this experiment over the Homo sapiens dataset illustrated that the task of predicting future gene annotations is more difficult than calculating contemporary ones.
Indeed, the largest difference between the prediction results of the 10-fold cross-validations and historical validation is that the 10-fold cross-validations overestimate the accuracy of the ITSS algorithm. However, the historical validation more accurately simulates predictions occurring in real-life. A manual assessment comparing the results of previous prediction methods over the GO Annotations to those resulting from the ITSS algorithm corroborate this point. For example, King et al. observed 38% precision for SGD and 44% precision for FlyBase in their manual revisions (King, et al., 2003). To investigate this discrepancy between the accuracy of the predictions resulting from the historical validation versus those of the 10-fold cross-validations, we compared the validation results of 10-fold cross-validations conducted on both the more recent GOAr and the older GOAh files. However, we did not find any obvious differences in the prediction results between these two files, indicating that the discrepancy observed in the historical validation was not caused by an intrinsic structural problem with the more recent dataset GOAr, which was used as a gold standard. A reasonable explanation for the higher accuracy observed in the 10-fold cross-over designs is that, in genomic research, functionally related genes are more likely to be discovered together. Therefore, the contemporary annotations in a GO Annotation file are more likely to be updated according to units of functionally related gene groups. The functionally related genes in one group often share common annotations, and thus, can be good predictors for each other. Due to its random splitting of testing and training sets, 10-fold cross-validations more likely predict a target gene that is within a specific functional group. The annotations of other genes in the same group will provide a good basis for the predictions of the annotation for the target gene. However, predicting new annotations using historical data is likely to be more difficult because those annotations that can be easily inferred may have already been added to the GOA files, and there exists fewer patterns for predicting new annotations. In this case, other indirect relations between concepts can be measured using semantic similarity techniques, and may provide more useful information for predicting new annotations. In addition, the benchmarks associated with the historical validations are minimal or incomplete estimations. Considering the short period of only two years between the creation of the GOAh and GOAr files, new annotations will be continuously added later. As scientific research and annotation tasks advance, the false positives found in historical validation may evolve into true positives. Therefore, the observed precision and recall rates may both increase with time. The current precision and recall results for the historical validation are essentially conservative minimum estimated values. Thus a combination of cross-validation and historical validation methods will provide a more comprehensive evaluation tool for prediction algorithms. As this is the first validation of its kind over the GO Annotations, it is still unknown if other machine learning approaches, when applied to the GO Annotations, will also experience lower prediction benchmarks in the historical validation than resulting from the 10-fold cross-validation. Since the nature of the prediction task changes based on the validation method, it is reasonable to assume that other machine learning approaches will experience similar discrepancies.
It is worth noting that the impact of applying semantic similarity metrics to these two types of validations is dichotomous. The extension of the threshold t to include non-identical concepts (e.g. t = 0.7) improves the historical validation results by up to 12.7%. This is not the case for cross-validation methods. Actually, the optimal value in 10-fold cross-validation is t = 1, the maximum possible value for t, which means that only two identical concepts are considered to be relevant and the implicit hierarchical knowledge contained in the ontologies are not utilized to infer concept relations. This indicates that the GO Annotation files contain patterns that are good enough for conducting validation studies based on known annotations. However, for validation studies based on historical data, which are closer to realistic conditions, the threshold t must be lowered to include more ontological knowledge in order to relate two different concepts. This demonstrates that superficial patterns based only on identical concepts are not enough for predicting new gene annotations in realistic setting, and the semantic relations between concepts organized in an ontology must be used. Since evaluations of other algorithms that use superficial patterns, which are subsequently validated using historical data have not been reported upon, we cannot perform an explicit comparison between the performance of ITSS and other algorithms in a historical validation. However, in light of the comparable prediction results to other machine learning algorithms in cross-validation, and positive impact of semantic similarity on the performance of the ITSS algorithm, it is quite likely that in situations closely resembling real-life this approach will outperform those based on superficial annotation patterns.
The manual assessment results show that the precision of the ITSS algorithm could be increased further since many predicted annotations are semantically compatible to true knowledge, and will be judged as correct. The expert judged some predictions to be correct based on the semantic knowledge about the predicted GO concepts. For instance, the ITSS method predicted that the MTERF gene has the function DNA binding (GO:0003677) but, in the more recent GO Annotation file, the gene was annotated with the term double-stranded DNA binding (GO:0003690). Therefore, the expert was able to determine the prediction to be correct based on semantic knowledge contained within the Gene Ontology, because DNA binding is an ancestor concept of double-stranded DNA binding. As a result of incorporating such semantic knowledge, the precision of the ITSS algorithm tripled from 17% when validated against the GOAr file to 51% when assessed by the expert. This indicates that with the use of optimal parameters, the ITSS prediction algorithm can obtain a precision of at least 51%, which is greater than the 44% for FlyBase and the 38% for SGD that were reported by King et al. (King, et al., 2003).
The results of the manual validation indicate that the ITSS method may help experts find annotation omissions, and keep the associated “computer executable knowledge” up-to-date. An example of missing annotations is when the corroborating evidence for a particular annotation exists in the literature before actually being added to the corresponding GOA file. Actually, this annotation lag is very common according to our results as illustrated by most of the evidence utilized in the manual assessment being dated before the date of GOAh (2003). For example, the evidence that the gene GP6 (glycoprotein VI) has a “receptor activity” was published in 2000 (Ezumi, et al., 2000). However, the annotation “receptor activity” for GP6 was not yet added to the GO Annotation files as of 2003 (GOAh), but appeared in the GOAr file dated 2005. Therefore, by applying the ITSS algorithm to the GOAh file, the association between “receptor activity” and GP6 was predicted as novel because in 2002 the fact was not annotated in the GOAh file, though the publication was otherwise available since 2000.
The first advantage of the proposed ITSS approach is that, when compared to other machine learning algorithms, the incorporation of semantic similarity and k-nearest neighbor algorithms makes better use of the knowledge embedded in GO, resulting in the generation of better predictions when applied to real-life predictions. Second, the ITSS algorithm provides significantly more predictions over a broader number of GO terms than previously evaluated methods with a reasonable reduction in precision. The third advantage is that no sequence, gene expression or other external information is required to make a prediction using the ITSS algorithm. This is valuable as additional information about a gene is not always available. Lastly, in contrast to other machine learning methods that provide a prediction with no understanding of the predictive process, the proposed similarity-based algorithms are straightforward and readily understandable to humans, and one can easily judge the reliability of a prediction by investigating similar genes. Therefore when a predicted annotation makes sense to biologists, they can decide whether or not to further pursue existing proof of the hypothesis and can also substantiate a hypothesis using the evidence associated with a prediction. There are also some obvious limitations of the ITSS approach, including its dependence on the known curated annotations of a gene in GOA. The method cannot predict new annotations if the gene does not have any known annotations, and in this case, other external information sources would have to be employed. In addition, when GOA contains only one similar gene (cutoff = 0) with the predicted function, the false positive rates are high but this is true for other machine learning approaches as well.
Because this work was conducted to demonstrate the feasibility of using the implicit semantic knowledge in GO as a valuable information source for predicting new gene functions, we employed a straightforward k-nearest neighbor algorithm that utilized semantic distances to prove the effectiveness of the proposed ITSS method. It is possible that a more advanced hybrid data mining approach that uses semantic distance, such as the Bayesian Networks, decision trees, neural networks, support vector machines etc., could better use the semantic knowledge of GO to obtain better predictions. Moreover, the combination of the semantic knowledge of GO with other information sources may also yield better predictions. Therefore, the future direction of our research includes investigating alternate prediction or data mining algorithms that utilize the semantic knowledge of GO and combining other information sources to aid in generating more accurate predictions.
In this study, we developed a high throughput computational approach capable of automatically predicting gene ontology annotations with higher overall accuracy than previous methods for a broader range of GO terms. A significant advantage of the ITSS prediction approach is that it can accurately provide predictions for sparsely annotated gene functions and processes, thus generating an order of magnitude more predictions in GO annotations than previous comparable methods. To our knowledge, this is the first study demonstrating the feasibility of using the similarity-based algorithm for GO terms annotations. In addition, we conducted in depth evaluations and demonstrated the higher level of difficulty involved in predicting future GO annotations in a dataset where most recent annotations were removed (historical roll-back method) as compared to the conventional 10 fold cross-over validation that deprives the dataset from contemporary annotations. This can be interpreted as an indication that, as observed by other researchers(Rzhetsky, et al., 2006), discoveries are communicated in temporal clusters in the literature and thus predicting a single gene annotation missing from an available cluster in a contemporary cross-over validation is easier than predicting every gene when the whole cluster of gene annotations is missing in a historical roll-back validation, The novel prediction method has been shown in a conservative roll-back validations to faithfully recapitulate known “future” biological knowledge artificially removed from the dataset, a situation more comparable to real-life. This method holds promise in facilitating a high throughput approach to generating hypotheses in genomic and biomedical research and it is likely to be applicable to other networks of annotations as well.
The authors would like to thank Donna Maglott for providing the historical GO annotation file, Xialu Li for conducting the expert evaluation, and Tara Borlawsky for her editorial assistance. This study is partially supported by the by the National Library of Medicine Grants 1K22 LM008308-02, R01 LM007659-04, and R01 LM008635-01, and by the National Center for the Multiscale Analysis of Genomic and Cellular Networks (MAGNet) Grant 1U54CA121852-01A.