|Home | About | Journals | Submit | Contact Us | Français|
The explosion in the number of functional genomic datasets generated with tools such as DNA microarrays has created a critical need for resources that facilitate the interpretation of large-scale biological data. SOURCE is a web-based database that brings together information from a broad range of resources, and provides it in manner particularly useful for genome-scale analyses. SOURCE's GeneReports include aliases, chromosomal location, functional descriptions, GeneOntology annotations, gene expression data, and links to external databases. We curate published microarray gene expression datasets and allow users to rapidly identify sets of co-regulated genes across a variety of tissues and a large number of conditions using a simple and intuitive interface. SOURCE provides content both in gene and cDNA clone-centric pages, and thus simplifies analysis of datasets generated using cDNA microarrays. SOURCE is continuously updated and contains the most recent and accurate information available for human, mouse, and rat genes. By allowing dynamic linking to individual gene or clone reports, SOURCE facilitates browsing of large genomic datasets. Finally, SOURCEs batch interface allows rapid extraction of data for thousands of genes or clones at once and thus facilitates statistical analyses such as assessing the enrichment of functional attributes within clusters of genes. SOURCE is available at http://source.stanford.edu.
The recent emergence of high throughput structural and functional genomic technologies has led to the rapid growth of genome-scale datasets. The analysis of such datasets largely depends on rapid access to previously described features of the genes being studied. Today, diverse publicly available resources exist that catalog various attributes of genes, ranging from their mapped coordinates within the genome to the enzymatic function of the proteins they encode. These include Online Mendelian Inheritance in Man (OMIM) (1), SwissProt (2), LocusLink (3), UniGene (3), GenBank (4), PubMed (3), as well as many others. Although these resources are highly informative individually, the collection of available content would have more utility if provided in a unified and centralized context and indexed in a robust manner.
Accordingly, we have developed a publicly available, web-based resource called SOURCE (http://source.stanford.edu). Unifying data from a broad collection of resources, SOURCE is a database providing dynamic content including genomic map position, biological role, and gene expression data. Currently, this content is available for three organisms (Homo sapiens, Mus musculus, Rattus norvegicus), with a number of others slated for addition in the near future. We have designed SOURCE particularly for the analysis of microarray gene expression datasets and have thus emphasized the types of information that are most useful in analyzing and interpreting genome scale gene expression experiments.
SOURCE is structured as a set of relationships between two entities: GeneReports and CloneReports. As the name implies, a GeneReport page captures the collection of features attributable to a given gene and its products, where a gene is defined by a unique UniGene cluster. SOURCE contains GeneReports for both characterized and uncharacterized genes. GeneReports for named genes are titled with Human Gene Nomenclature Committee (5) approved conventions for naming genes, as represented within LocusLink, while GeneReports for uncharacterized genes are listed by their UniGene titles. Wherever available, each GeneReport will contain all or a subset of the following categories of data (Fig. (Fig.1): 1):
In addition to these data, GeneReports also include representative mRNA accessions with direct links to their NCBI GenBank records. Furthermore, each GeneReport page allows formatting of boolean PubMed literature queries using user-defined search terms and all aliases for the given gene. This allows rapid identification of previously published work relevant to each user's interests.
SOURCE CloneReports capture data available for all human, mouse, and rat ESTs within dbEST (11) for which a physical clone has been annotated, regardless of association with a UniGene cluster. Each CloneReport contains a subset of the data from the dbEST record(s) of the cDNA clone, including the putative identity of its EST sequences, as well as links to the corresponding GeneReport and dbEST. When multiple EST sequences are available for a given clone, information for both 5′ and 3′ sequencing reads are displayed. Furthermore, CloneReports contain direct hyperlinks to BLAST searches of databases including the non-redundant nucleotide section of GenBank, dbEST, and high-throughput genome sequences.
Since many of the resources on which SOURCE is based (including UniGene, LocusLink, and SwissProt) are frequently updated, the SOURCE database is re-loaded on a weekly basis to ensure that it contains the most up-to-date information. An automated process checks for updates of the various outside databases, downloads these files, and populates database tables accordingly. In this fashion, we ensure that the connections between external databases which are made within SOURCE are as accurate as possible. This means that both the mapping of clones to genes and the functional attributes associated with those genes is dynamic and thus current.
Currently, SOURCE employs Oracle Server Enterprise Edition version 8.1.7 and runs on an eight processor Sun E4500 under SunOS 5.8. Most of SOURCE's analysis and display software was written in Perl. The table structure for SOURCE can be found at http://genome-www.stanford.edu/microarray/doc/external2.pdf.
An integral mission of SOURCE is to curate and consolidate gene expression data from microarray experiments in order to allow researchers easy and intuitive access to this rapidly growing body of information. While many authors of microarray datasets have made their raw data available on their own websites, accessing these one at a time is tedious and hinders rapid analysis. This is particularly important for researchers generating their own microarray datasets, for whom the examination of co-regulated genes under diverse conditions is critical to successful analyses. While several efforts exist for centrally archiving raw microarray data [e.g., Gene Expression Omnibus (12)], these databases do not re-analyze published data nor do they provide them in a format that is readily searchable at the single gene level. For SOURCE, only datasets for which raw data have been made publicly available are considered for inclusion and these are then curated and re-analyzed in order to ensure proper data processing and display. Currently, SOURCE contains 10 human and 2 mouse microarray datasets, generated using either cDNA or Affymetrix microarrays, and totaling greater than four million gene expression measurements.
Figure Figure2A2A shows the SOURCE display for the gene expression of DNA topoisomerase II alpha (TOP2A) across the cell cycle of HeLa cells (13). The measurements are displayed as a temporally ordered matrix of gene expression data where rows represent genes (i.e., unique cDNA elements) and columns represent experimental samples. Coloured pixels capture the magnitude of the response for any gene. Shades of red and green represent induction and repression, respectively.
An important component of SOURCE's gene expression interface is the ability to list the most highly correlated genes of a given gene through a simple click on that gene's expression ‘color bar.’ This allows rapid identification of co-regulated groups of genes and facilitates quick access to information that is crucial to the interpretation of new microarray experiments. Figure Figure2B2B shows TOP2A's 10 most correlated neighbours in a dataset of normal tissues and cell lines (14). As can be seen, TOP2A expression is highest in transformed cell lines, normal testis and fetal liver. Additionally, many of the neighbours are genes known to be involved in cell proliferation (e.g., CDC2, CCNB2, and MAD2L1), consistent with TOP2A's role in cell cycle progression.
SOURCE also displays in silico generated expression information calculated from EST abundance data. In the absence of useful systematic genome-scale expression data, the EST data provide an accessible source of information that identifies at least some of the sites where a gene is expressed. For example, SNAP25, a synaptic vesicle associated protein specific to neurons (15), is highly overrepresented in EST libraries stemming from central nervous system samples compared to all other tissues (Fig. (Fig.2C).2C). Such information is often useful when examining microarray expression data of cellular mixtures, as is the case with tissue and tumor samples.
SOURCE allows users to query individual genes as well as retrieve selected attributes for many genes in batch. When searching for individual genes, users can query the database via a gene's name (whether the official HGNC name or a historical alias), the LocusLink identifier, the current UniGene cluster identifier, the GenBank accession of a sequence associated with the gene through UniGene, or a cDNA clone identifier. The flexibility of this search interface is important, since users may have access to only a few of these attributes for the genes they are studying. In order to increase the likelihood of successful gene name searches, we have assembled the largest collection of gene aliases available on the web by combining synonym data from a large number of sources.
The capacity to access gene-level data through searches using clone identifiers is particularly practical for users of DNA microarrays, as most spotted array platforms employ cDNA clones, each of which may be represented by multiple ESTs. In this fashion, SOURCE can reveal potentially chimeric cDNA clones, which are associated with ESTs that map to multiple UniGene clusters or genes. Currently, no other publicly available database offers this search functionality for accessing both gene- and clone-level data.
SOURCE allows for dynamic linking to both GeneReports and CloneReports. This feature is particularly useful when browsing large data sets. For example, when visualizing datasets with TreeView (16), linking of the gene or clone names to SOURCE allows users to find detailed information about each gene or clone with just a click. Similarly, external websites, such as supplements to published functional genomic datasets (e.g., see http://genome-www.stanford.edu/hostresponse/) are made much more generally useful by linking of each gene or clone name to SOURCE.
One of the most important and unique features of SOURCE is the ability to simultaneously extract data for thousands of genes in batch, thus eliminating the need for laborious cross-referencing of data from external databases. This is particularly useful for functional genomic studies, where it is necessary to continually update information on the genes and clones being examined. For instance, researchers interested in the mapped position or subcellular localization of a list of genes can extract these attributes with ease, and perform statistical analyses such as assessing the enrichment of certain functional attributes within clusters of genes (17,18). Since the data in SOURCE are refreshed weekly, users can also use this utility to regularly update annotations associated with genes or cDNA clones of interest. Input can be via a text file uploaded to the server or by pasting the queries into a text box. Batch SOURCE can be searched by clone identifier, accession number, gene name, gene symbol, UniGene identifier, or LocusLink identifier. Retrieval options include gene name, aliases, LocusLink ID, chromosome location, subcellular localization, representative accessions (protein or mRNA) and Gene Ontology annotations.
Use of SOURCE has steadily grown over the past two years. Today, thousands of researchers query the system on a daily basis, totaling over 100 000 hits per month. Individual GeneReports make up the majority of accesses, with the gene expression browser and the batch retrieval utility being extremely popular as well. Reciprocal links now exist to and from a number of databases, including SwissProt, GeneCards, and the UCSC Genome Browser.
We plan to continue to add new features to SOURCE, including more gene expression data sets as they are published and other useful resources that we and others develop as the field of functional genomic analysis continues to advance. We are planning on transitioning from a purely UniGene-based mapping of clones to genes, to one based on a combination of UniGene and the genome scaffold. We are also planning on adding additional model organisms and allowing users to navigate orthologies through a simple interface. As genome-scale gene expression datasets continue to amass for these organisms, this will allow SOURCE users to rapidly identify groups of orthologs that are similarly regulated in diverse organisms. Furthermore, we are hoping to provide developers access to SOURCE through data integration tools such as BioMoby (http://www.biomoby.org/) in order to further enhance the ability of researchers to extract and manipulate data in batch. The need for central and publicly available resources which curate biological data will only continue to grow and we feel that SOURCE and resources like it will be critical in enabling biologists to efficiently analyze genome-scale datasets.
We wish to thank members of the Stanford Microarray Database and the Brown and Botstein laboratories for helpful discussions and advice. This work was supported by N.I.H. grant CA85129-04 (P.O.B. and D.B.) and National Institute of General Medical Sciences training grant GM07365 (A.A.A. and M.D.). P.O.B. is an associate investigator of the Howard Hughes Medical Institute.