Microarrays have been introduced as powerful tools able to screen a large number of genes in an efficient manner. The typical result of a microarray experiment is a number of gene expression profiles, which in turn are used to generate hypotheses and locate effects on many, perhaps unrelated pathways. This is a typical hypothesis generating experiment. For this purpose, it is best to use comprehensive microarrays, that represent as many genes of an organism as possible. Currently, such arrays include tens of thousands of genes. For example, the HGU133 (A+B) set from Affymetrix Inc. contains 44
928 probes that represent 42
676 unique sequences from GenBank database corresponding to 28
036 UniGene clusters.
Typically, after conducting a microarray experiment, independent of the platform and the analysis methods used, one selects a set of genes that are found to be differentially expressed. These lists of differentially regulated genes need to be translated into biological processes or molecular functions characterizing the underlying biological phenomenon. This poses a requirement to analyze the genes from a functional point of view. Typically, in order to analyze a set of genes and create their functional profiles, one needs to search the literature and the various online databases. For example, a typical analysis of a set of differentially regulated genes will involve searching NCBI UniGene (1
) and LocusLink (3
) databases for each of the genes in the list. This is an extremely tedious and error-prone process. Furthermore, carrying out these manual searches in a systematic manner and finding out a simple frequency of a given biological process among the differentially regulated genes may produce misleading results (4
Onto-Express (OE) (4
) is one of the annotation databases integrated in Onto-Tools. OE is a tool designed to mine the available functional annotation data and help the researcher find relevant biological processes (4
). Many months of tedious and inexact manual searches are substituted by a few minutes of fully automated analysis. The result of this analysis is a functional profile of the condition studied. In the latest version, this functional profile is accompanied by the computation of significance values for each functional category. Such values allow the user to distinguish between significant biological processes and random events. OE's utility has been demonstrated by analyzing data from a recent breast cancer study.
The input to OE is a list of GenBank accession numbers, Affymetrix probe IDs or UniGene cluster IDs. A functional category can be assigned to a gene based on specific experimental evidence or by theoretical inference (e.g. similarity with a protein having a known function). OE shows explicitly how many genes in a category are supported by experimental evidence (labelled ‘experimented’) and how many are inferred (‘inferred’). Those genes for which this information is not available are labelled ‘non-recorded’. The results are provided in graphical form and emailed to the user on request. OE constructs a functional profile for each of the Gene Ontology (GO) categories: cellular component, biological process and molecular function as well as biochemical function and cellular role, as defined by Proteome (http://www.incyte.com/sequence/proteome
). As biological processes can be regulated within a local chromosomal region (e.g. imprinting), an additional profile is constructed for the chromosome location.
The probability model best suited to calculate the significance values would use a hypergeometric distribution (4
). For a typical microarray experiment when the number of genes on the chip N
000 and the number of selected genes is K
, the binomial approximates well the hypergeometric and, therefore, the hypergeometric was not implemented. The χ2
was also proposed for similar problems (6
). Finally, Fisher's exact test is required when the sample size is small and the chi-square test cannot be used. OE provides implementations of the χ2
test, Fisher's exact test as well as the binomial test. The user can select between the binomial and the χ2
test. If χ2
is chosen, the program automatically calculates the expected values and uses Fisher's exact test when χ2
becomes unreliable (expected values <5).
Many microarray users embark upon ‘hypotheses generating experiments’ in which the goal is to find subsets of genes differentially regulated in a given condition. However, another major application of this type of data mining is in experiment design. An alternative to the ‘hypotheses generating experiments’ is the ‘hypothesis driven experiments’ in which one first constructs a hypothesis about the phenomenon under study and then performs directed experiments to test the hypothesis. However, specific hypotheses and a small number of pathways may still involve hundreds of genes. This is still too many for RT–PCRs, western blotting and other gene specific techniques, so the microarray technology is still the preferred approach.
Currently, no two arrays offer exactly the same set of genes. When a hypothesis of a certain mechanism does exist, we argue that one should use the array(s) that best represent the corresponding pathways. This can be accomplished by analyzing the list of genes on all existing arrays and providing information about the pathways and biological mechanisms covered by the genes on each array. If array A contains 10
genes but only 80 are related to a given pathway and array B contains only 400
genes but 200 of them are related to the pathway of interest, the experiment may provide more information if performed with array B instead of A. This can also translate into significant cost savings.
Many commercial microarray manufacturers have realized the need for such focused arrays and have started to offer many of them. Typically, a focused array includes a few hundreds of genes covering the biological mechanism(s) being studied. However, two microarrays produced by different companies are extremely unlikely to use the same set of genes. In consequence, various pathways will be represented to various degrees on different arrays even if the arrays are all designed to investigate the same biological mechanisms. This is an unavoidable functional bias. Such a bias will be associated with each and all arrays that include less than the full genome of a given organism.
The Onto-Tools (OT) toolkit helps researchers assess the biological bias of various commercial arrays through its Onto-Compare (OC) tool. The Onto-Compare database is populated with data collected from several online databases, as well as the lists of genes (GenBank accession numbers) for each microarray as provided by their manufacturers. From the list of accession numbers, a list of unique UniGene cluster identifiers is prepared for each microarray, and then a list of LocusLink identifiers is created for each microarray from the list of UniGene cluster identifiers in the OC database. Each locus in the LocusLink database is annotated using ontologies from the Gene Ontology Consortium (http://www.geneontology.org
) and ontologies from other researchers and companies. The Gene Ontology Consortium provides ontologies for biological processes, molecular functions and cellular components. The data from these databases and gene lists is parsed and entered into the Onto-Compare relational database. After creating a list of locus identifiers for each array, the list is used to generate the following profiles: biochemical functions, biological process, cellular role, cellular component and molecular function. The profiles for each microarray are stored in the database. The list of genes deposited on a microarray is static, but the annotations for those genes keep changing and are updated automatically, as more information becomes available.
In many cases, researchers prefer to print their own arrays. One of the reasons for opting to print one's own custom array is that given the complexity of the biological research one may feel that none of the commercially available microarrays represent the targeted pathways and biological processes to the extent needed. Other reasons may be related to the dramatically reduced price of an in-house solution versus commercial arrays and the ability to adapt the arrays to one's own experimental design and use of controls. In order to design a microarray that constitutes a powerful and effective interrogation tool, a researcher has to choose genes that are representative of key mechanisms, pathways and biological processes. At present, the choice of genes to include on a certain microarray is a very laborious process requiring a high level of expertise. Furthermore, this process is very time consuming, even for experts, since they have to consult many online databases as well as perform an extensive literature review in order to find the set of genes that are involved in specific biological processes of interest. Onto-Design is a tool that is developed to assist in this gene selection process.
The OD interface allows the user to either upload a set of functional categories of interest (such as biological processes), or to browse through a graphical representation of a tree representing the Gene Ontology hierarchy. Actually the GO hierarchy is a directed acyclic graph (DAG), not a tree. The internal structure of the database represents correctly the GO but the interface is more conveniently represented as a tree. Categories linked through DAG links not contained in the tree are automatically travelled by the system in the appropriate way.
In the annotation world, the same piece of information can be stored and viewed differently across different databases. For instance, more than one Affymetrix probe identifier (ID) can refer to the same GenBank sequence (accession number) and more than one nucleotide sequence from GenBank can be grouped in a single UniGene cluster. The result of OE depends on whether the input list contains Affymetrix probe IDs, GenBank accession numbers or UniGene cluster IDs. In order to illustrate this, let us consider an input specified as a list of 10 Affymetrix probe ids. Let us assume that the results show that four out of 10 probes are involved in biological process A and the remaining six probes are involved in biological process B. Therefore the frequency of biological process A will be four and for the process B will be six. In order to interpret this, a researcher might need to use the data sheet provided with each Affymetrix array (or the NetAffy web site) to map these probe IDs into accession numbers. This reveals that the four probe IDs for the process A correspond to only two different accession numbers and the six probe IDs for the process B correspond to another two different accession numbers. Repeating the OE analysis using accession numbers will show that the frequency of both the processes A and B is two. Furthermore, mapping the accession number to UniGene cluster IDs shows that all four accession numbers actually come from the same UniGene cluster. Repeating the OE analysis using cluster IDs will show the frequency of both A and B as one.
This example illustrates that the user has to be aware of these relationships between the different forms of the data in order to interpret correctly the results. Furthermore, even if a user is aware of the relationships and knows how to convert them, most existing tools only allow conversions of individual genes. This makes the process of translating hundreds of genes absolutely unfeasible. Onto-Translate (OT) is a tool that allows the user to perform easily such translations of entire sets of genes. A user can input a list of genes specified by either Affymetrix probe IDs, GenBank accession numbers or UniGene cluster IDs, indicate the type of the list by clicking the appropriate radio button and request the translation on the input list in any of the remaining two forms by selecting the appropriate radio button for output list.
All tools in the Onto-Tools package use a consistent interface. Genes and functional categories can be prepared in advance and submitted to the tools as a text file with one entry per line. The results can generally be emailed back to the user.
As shown with the examples in the Results section, each of the Onto-Tools addresses a specific problem currently faced by microarrays users. However, the ensemble of the Onto Tools is more than the sum of its components. This has been achieved by seamlessly integrating the tools. For instance, it is possible to use a general purpose array such as Affymetrix HG133 in order to investigate a given condition by screening a large number of genes. The list of differentially regulated genes can be analyzed with OE in order to identify the functional categories that are relevant in the given condition. The user can inspect OE's results and select a smaller number of highly significant categories (see 4
for a discussion of the significance values associated with the OE analysis). Based on these highly significant categories, the researcher might formulate a hypothesis about the underlying biological phenomenon. At this point the user can seamlessly switch to ‘Onto-Compare’ in order to find and compare existing commercial arrays that might be useful in the testing of this specific hypothesis. If none of the commercially available arrays covers the necessary pathways to a satisfactory degree, the user can then switch to Onto-Design to create their own custom array representing the chosen biological processes.
Results can also be seamlessly passed between Onto-Compare and Onto-Design. For example, the user compared all available commercial microarrays for apoptosis, was not satisfied with any of them and decided to create their own array. After designing a custom apoptosis array with Onto-Design, the user can click-switch back to Onto-Compare and compare the newly designed array with any of the existing commercial arrays. The user interfaces of the tools as well as a possible navigational pathway through the various tools are show in Figure . Future work will include an analysis at a specified level of the GO hierarchy.
Figure 1 Onto-Tools (clockwise, from top left): regulated genes are analyzed with Onto-Express to find significantly impacted biological processes. Specific hypotheses can be formulated based on such processes. Onto-Compare can be used to select those commercial (more ...)