Utility
To perform an "
in-silico" experiment, users can select arrays from the current database (see Table ) residing on the site (arrays can only be included in an "
in-silico" experiment if the user has been granted access to the data or the data are in the Open Access domain). Microarray data from different laboratories can be combined to form "
in-silico" experiments. Users can also upload microarray data, either Affymterix, CodeLink (Applied Microarrays, Inc., Tempe, AZ, now manufactures these arrays) or custom arrays (if chip definition files are provided), using MIAMExpress, for further analyses. A series of quality control steps can be carried out, once the user has selected arrays to perform the "
in-silico" experiment. This should be done to ensure compatibility and overall quality of the arrays [see Additional file
2: pages 31–42 in PhenoGen user manual]. The data from arrays in an "
in-silico" experiment can be normalized, filtered and statistically analyzed utilizing several normalization and statistical procedures available on-site. At numerous points in this process the user can download data, raw or normalized, from experiments being performed on site, for use with other statistical packages of his/her choice.
As with some other databases, PhenoGen offers a range of options for microarray data normalization, filtering and statistical analyses, including corrections for multiple comparisons. Users can compare gene expression profiles in two groups using any one of the available options, or can use one-way or two-way ANOVA models to check for overall differences when comparing more than two groups. We plan to provide tools to carry out clustering (k-means and hierarchical) analyses of microarray data in the near future. Furthermore, once an "
in silico" experiment has been created, and the microarray data normalized, the user can search the database to determine the expression levels of any particular transcript(s). After choosing the correct (created) experiment, the user enters the probeset ID, or gene name or symbol (or any other annotation ID from the most popular genomic databases) [see Additional file
2: pages 60, 76–79 in PhenoGen user manual] and clicks "search", leading to display of expression data for the gene or genes in the chosen experiment. These data can be downloaded.
In addition to the standard statistical tools for assessing differential gene expression between or among groups, users can analyze the correlation between gene expression levels and phenotype (behavioral, biochemical or physiological). Such correlation analyses have been used in recent studies, by others [
13-
15] and our group [
16], to ascertain "candidate genes" for complex traits with either a panel of recombinant inbred (RI) strains of mice (or rats) or a panel of inbred strains of mice. Users can access data available on site for whole brain gene expression profiles from either 20 inbred and 30 RI (BXD RI) strains of mice or 27 RI strains from the HXB/BXH panel of rats [
17], and compare these data on gene expression to phenotypic data obtained with these same strains and species of animals in the user's laboratory, or to other phenotypic data (in the literature) related to brain function. Users can upload phenotype data (as a ".txt" file) for evaluating the correlation of gene expression with the phenotype. An example of such analysis of correlation of whole brain gene expression profiles with the contextual fear conditioning response in a panel of BXD RI strains, carried out using tools available on PhenoGen, is given in Additional file
1.
Another distinguishing feature of PhenoGen is its multiple offerings for further data analysis, once a list of differentially expressed or correlated genes is generated on site or up-loaded
de novo. Complex behavioral traits reflect variations in biochemistry, physiology, and anatomy that are determined by the action and interaction of several or many genes. We [
16,
18], and others [
19,
20] have indicated that the combined use of gene expression data together with QTL (quantitative trait locus) analysis can provide for a better understanding of the genetics of complex traits. The availability of techniques of genetic mapping and statistical analysis has allowed association of complex behavioral traits with genomic loci (QTL analysis). In short, QTLs are the genomic regions on the chromosomes that can explain a portion of the genetic variation within a given complex trait. Most complex traits are also significantly susceptible to environmental influences.
A premise of QTL analysis is that the genetic material that contributes to the variance in the trait of interest is located in the area of the genome defined by the QTL(s) for the trait. A number of different factors, such as polymorphism(s) in the coding or regulatory region(s) of gene(s), resulting in either a change in function and/or a change in expression (mRNA) of the gene(s), may contribute to a QTL. Therefore one can "filter" the differentially expressed genes in the brains of animals which differ significantly in the manifestation of the trait, or genes whose expression levels correlate with the magnitude of the trait of interest across multiple strains of animals, through a QTL filter. In other words, one can ascertain the genomic location of differentially expressed or correlated genes, and determine whether the location of these genes falls into QTLs determined for the trait of interest. Localization of differentially expressed or correlated genes within a QTL for a trait of interest adds significant weight to the supposition that the gene located within the QTL is one contributing to the variance in that trait. The PhenoGen website allows users to access information for gene location in the genomes of mouse, rat and human, to access data (MGI) on QTLs for a number of traits, and to analyze whether the location of genes falls within relevant phenotypic QTLs.
A major caveat to considering only the genes that have a physical location within behavioral or physiological QTLs as candidates for contributing to trait variance, is that the expression levels of genes that reside outside the behavioral/physiologic QTLs may be regulated from within the behavioral/physiologic QTLs (an example of trans-regulation). The regulatory factor(s), themselves, would not have to be differentially expressed if a polymorphism resides in the target gene's expression regulatory region, and affects the function of the regulatory factor. Thus, any gene whose ultimate expression level is dependent on a genetic factor (cis or trans) within a behavioral/physiologic QTL becomes a candidate for contributing variance to the trait of interest.
We, and our colleagues, have used genomic marker data and information we have gathered on brain gene expression in male BXD RI mice and in HXB RI rats to determine the QTLs for the expression levels of genes in the brain (e-QTLs). This e-QTL data, available on PhenoGen, allows for ascertainment of the genomic site of control of expression for a multitude of genes and allows for determination of whether the genes are cis- or trans-regulated.
In essence, the differentially expressed genes that reside within behavioral/physiologic QTLs, and have their expression regulated from within the same QTL (cis-regulated), and genes residing outside of the behavioral/physiologic QTLs, but whose expression is regulated from within a relevant behavioral/physiologic QTL (trans-regulated), could form the list of candidates contributing to a quantitative trait of interest. It has to be clear, however, that polymorphisms in the coding region of a gene can and do produce altered function of a gene product and can also significantly contribute to the trait of interest. Such polymorphisms, even when located in highly significant QTLs, would not be amenable to being identified by an analysis which relies on the premise that differential expression of a gene contributes to trait variance.
In addition to the "QTL Query tools", the PhenoGen website offers a wide variety of tools to "interpret" a gene list derived on site or up-loaded by the user. Such a list can include a few or hundreds of differentially expressed genes derived from a typical microarray experiment. At PhenoGen, users have access to tools, including annotation (basic and advanced), promoter analysis (to understand transcriptional regulation) and literature searches (including "co-citation" searches) for the entries in a list of differentially expressed genes.
One of the ways to derive a "biological interpretation" of the results of the gene expression data is to analyze the biological annotations associated with the genes in a list of "candidates". i-Decoder, the underlying annotation tool used at PhenoGen, translates gene identifiers among many different nomenclatures, including gene symbols, RefSeq IDs, and probe names from both Affymetrix and CodeLink arrays, even when an up-loaded gene list contains multiple types of (non-identical) gene identifiers. This is accomplished by maintaining a local database of equivalents between identifiers that are available from the following eight sources: Affymetrix, GE Healthcare (formerly Amersham Biosciences), Ensembl, FlyBase, MGI, NCBI, RGD, and SwissProt [see Additional file
2]. In "Basic Annotation" tables every entry in the gene list is linked to the respective annotation in Entrez, MGI (or RGD), UniProt, and UCSC databases. A link is also provided to the "
in-situ" hybridization images available at the Allen Brain Atlas to obtain regional distribution patterns of expression for genes in mouse brain. Another link is provided to the information at MGI about availability of genetically modified animals (transgenics, null mutants, etc.) for the genes in a gene list. Entries in the list of genes are also linked to information about genetic variations (e.g., single nucleotide polymorphism, insertion/deletion etc.), associated with the gene. In "Advanced Annotation Tables", users can personally select the available annotation information they wish to be displayed for the gene list.
To understand the transcriptional regulation of differentially expressed genes, users can use either oPOSSUM or MEME on the PhenoGen site. oPPOSUM uses human-mouse orthologs in calculating the over-representation of conserved transcription factor binding sites [
21]. On the other hand, MEME explores the occurrences of previously uncharacterized transcriptional motifs [
22]. Alternatively, the user can download the upstream sequences of genes of interest using the PhenoGen site and carry out similar analysis using other tools [
23].
The literature search option on PhenoGen is an automated literature search that can be tailored to particular area(s) of interest by selecting a set of query terms. The automated literature search looks for articles in PubMed that mention any of the genes, including synonyms, in the gene list generated on site or uploaded by the user, and one or more of the chosen query terms. The results of the search are organized by the user-defined categories and by gene name, and contain direct links to PubMed citations. Also included in the results of a search is a list of articles where two or more of the genes from the gene list are cited in the same article (co-citation results). This allows the user to easily identify established relationships between genes.