An abundance of high-throughput laboratory techniques and computational methods has led to a deluge of human genomic and proteomic data (Bieri et al. 2007
; Crosby et al. 2007
; Eppig et al. 2007
; Nash et al. 2007
). Even when an individual researcher can assemble all available evidence, they are left with the task of weighing this evidence to infer likely functions for genes of interest.
Fewer than one-third of human genes have a Gene Ontology (GO) annotation based on evidence derived from specific study of that gene (as opposed to prediction; see ), providing little guidance to researchers wanting to investigate sparsely annotated genes. Computational integration of diverse evidence can help in assigning function, and ideally these inferences can reflect the shades of gray in our current knowledge, as opposed to the “black or white” annotation that is most appropriate for archival annotation.
Distinct Ensembl Gene IDs grouped into broad classes depending on status of association with at least one GO term.
Here we integrated an extensive set of Homo sapiens
data and inferred quantitative associations between 21,341 human genes (12,925 of which had no existing associations on the basis of direct–ie, nonpredicted–evidence) and each of 4333 GO terms. Our models exploit both gene features and gene−gene relationships by using both guilt-by-profiling (GBP) and guilt-by-association (GBA) approaches to function prediction (Taşan et al. 2008
; Tian et al. 2008
). We provide estimates of our models’ accuracy, including a prospective evaluation in which we consider annotations that were made after our training data set was assembled. Literature-based follow-up investigations are performed for a sample of high-confidence novel predictions. In the course of making these predictions, we constructed multiple functional linkage networks [FLNs–where an edge between two genes indicates some level of shared function (Lee et al. 2004
)], capturing different categories of biological relationships. We find that FLNs are independently useful, which we illustrate here by identifying candidate glioma-related genes given only “seed” glioma-related genes identified from systematic unbiased genome-wide association (GWA) studies. All gene/term prediction scores from this project—as well as our FLNs—are made freely available to the public via a web-accessible resource (Beaver et al. 2010
), which has been adapted here to host quantitative function annotations for H. sapiens