In the past decade, many technology platforms have been developed that allow researchers to generate data on a genome scale. For example, profiling technologies have been developed for highly parallel measurements of gene expression, copy number, genotype, and epigenetic state. The data derived from these high-throughput approaches can then be used to generate new hypotheses or inferences of gene function. In contrast to experiments focusing on a specific gene or gene family, these genome-scale experiments typically result in the identification of a list of candidate genes that are relatively unfamiliar to any single researcher. Hit lists identified by these methods can often span many protein classes and signaling pathways. In many cases, these genes may also have little or no previous functional characterization.
Researchers are then faced with the daunting task of prioritizing these candidate genes for detailed functional and mechanistic studies. Dozens of gene annotation resources and model organism databases serve prominent roles in the genetics and genomics communities. Take, for example, the instance where a researcher has identified hundreds or even thousands of differentially expressed genes between a cancer sample and a matched control. In prioritizing this gene list, many researchers would commonly search Entrez Gene [
1] and Ensembl [
2] as a first stop for many descriptions of critical gene annotation information, including primary sequence data, genome position, associated Gene Ontology terms, gene structure, and genetic variation. Other researchers may then consult the Mouse Genome Database (MGD) [
3] and Rat Genome Database (RGD) [
4] for annotation focused on these model organisms, including knockout phenotypes and quantitative trait loci. Molecular and cellular biologists may then visit the STRING database for protein interaction data [
5]. Other researchers may query reference Gene Atlas expression data using the SymAtlas web site [
6]. In addition, there are a wide variety of gene annotation sites targeting more specific communities, including a database describing the targets of the transcription factor CREB [
7], the Allen Brain Atlas showing high-resolution expression information by
in situ hybridization in the mouse brain [
8], and the TargetScan database for microRNA target prediction [
9]. Hundreds of such online resources for mammalian gene annotation currently exist [
10,
11].
Although the wide breadth of available resources is clearly a benefit to the community, there is no single resource that completely describes everything that a researcher might want to know about a gene's function. Each gene annotation resource presents a particular slice of the available gene annotation, generally corresponding to the developers' view of what their users are interested in. Consequently, many researchers (and in particular researchers who are investigating candidate genes from genome-wide analyses) end up visiting many different sites for each gene of interest in order to get as complete a picture as possible of gene function.
This system is highly inefficient and cumbersome for end users. User interfaces vary dramatically, and researchers must learn and remember how to navigate each site. Each site often accepts a different set of gene identifiers (Entrez Gene, Ensembl, Refseq, Unigene, and so on), making it difficult for users to find their gene of interest. This problem is even more complicated in cases where the official HUGO Gene Nomenclature Committee (HGNC) gene symbol is not the most commonly used symbol in the literature (for example, TP53 and P53). Finally, new online resources are continually being developed, and staying abreast of these tools and evaluating their utility is a time-consuming and recurring task.
Moreover, this system is highly inefficient for web site developers. For example, every gene portal needs to implement at some level the basic functionality for searching for genes (for example, by symbol, identifier, location, sequence). Every gene portal also must make some effort at resolving synonyms that have been historically used in the literature. And every gene portal must also implement a mechanism for data updates from primary sources. Overall, a relatively large percentage of development effort duplicates existing but essential functionality that is common to all gene portals, and a relatively small percentage of effort is devoted to the innovative data and features of any specific gene portal.
Here, we introduce BioGPS, a gene annotation portal based on a loose federation of existing genetic and genomic resources. BioGPS allows users to easily explore the landscape of gene annotation resources for one or more genes of interest. BioGPS currently focuses on annotation for human, mouse, and rat genes. BioGPS also emphasizes two key design features. First, BioGPS is based on a simple, unstructured plugin interface that allows for simple community extensibility. Second, BioGPS also implements a powerful user interface that enables precise user customizability. In sum, these two design principles enable BioGPS to harness the principle of community intelligence toward the goal of efficiently organizing and querying online gene annotation resources.