Prior to the “genomic era,” when the acquisition of DNA sequence involved significant labor and expense, the sequencing of genes was strongly linked to the experimental characterization of their products. Sequencing at that time directly resulted from the need to understand an experimentally determined phenotype or biochemical activity. Now that DNA sequencing has become orders of magnitude faster and less expensive, focus has shifted to sequencing entire genomes. Since biochemistry and genetics have not, by and large, enjoyed the same improvement of scale, public sequence repositories now predominantly contain putative protein sequences for which there is no direct experimental evidence of function. Computational approaches attempt to leverage evidence associated with the ever-smaller fraction of experimentally analyzed proteins to predict function for these putative proteins. Maximizing our understanding of function over the universe of proteins in toto requires not only robust computational methods of inference but also a judicious allocation of experimental resources, focusing on proteins whose experimental characterization will maximize the number and accuracy of follow-on predictions.
COMBREX (COMputational BRidges to EXperiments, http://combrex.bu.edu) is an NIH-funded enterprise that has brought computational and experimental biologists together, with the goal of greatly improving our overall understanding of microbial protein function ,. Since its inception, it has made significant progress toward the following goals: identifying the minority of proteins that have already been experimentally characterized, serving as a public repository of novel protein function predictions made by diverse methods, producing a clear chain of evidence from experiment to prediction, identifying (“recommending”) those functional predictions whose verification will contribute most to our overall understanding of protein function, and actually funding the experiments to test function. The recommendation system is a proof of concept based on active learning principles and includes, for a given protein, criteria including phylogenetic distribution of its protein family, biological and clinical phenotypes associated with it, the availability of protein structure data, and its sequence distance from experimentally determined proteins or from the other proteins in its family.
COMBREX comprises several interrelated efforts. First, the project is building a community of researchers (the COMBREX Community) committed to achieving the goals above. Second, the project maintains a web-accessible database (the COMBREX Database) of known and predicted functions for microbial proteins. The database search features enable biologists to identify predictions whose experimental verification is particularly important. Finally, the project issues small monetary awards (COMBREX grants) to biologists to fund the experimental testing of such predictions. In this paper, we provide a brief review of COMBREX, focusing on its overall design, its computational resources, and the experimental results from the first phase of the project.