In the post-genomic era, biologists encounter a flood of information derived mainly from microarray experiments. The blessing of this wealth of information is accompanied by a great difficulty in identifying the biologically significant findings, which are often embedded in irrelevant information. Currently, there are several approaches to deal with this problem. One approach is to identify a category of genes which is overrepresented in the microarray output. This approach can be carried out using the Gene Ontology project (GO) which describes gene products in terms of their associated biological processes, cellular components and molecular functions [1
]. The advantage of this approach is that it can be easily automated and thus can be used for quick screening of large outputs. On the other hand, this approach limits the analysis to the structure of the GO project and thus does not support the desire of many researchers to customize their analysis. A second approach involves searching the literature for information about each of the genes on the list. Although this approach is comprehensive, it suffers from many downsides: it is time consuming; there is no systematic way to integrate the information learned about each gene; usually one gets distracted with seemingly interesting comparisons early on during the literature search and thus does not give the genes at the end of the list the same weight that was given to genes that appear at the top of the list; there are multiple names and symbols for each gene and thus it is hard to extract the literature information for any particular gene since each author may refer to it differently. A third approach entails curated databases that have gathered all the known information pertaining to each gene. This approach is limited by the quality of the curation process. For example for studying the yeast Saccharomyces cerevisiae
, there are excellent curated databases, such as the Yeast Proteome Database [2
] and the Saccharomyces Genome Database [3
], which contain all the known information about each gene. On the other hand in other organisms the curation procedure is at a less advanced stage and thus the information contained in the curated databases is still partial.
We have developed an analysis tool that combines the advantages of all the mentioned approaches and overcomes some of the disadvantages. Our tool (MILANO – Microarray Literature-based Annotation) uses an automatic search of literature databases for performing custom annotation of the list of genes obtained from a microarray output. This is done by generating dynamic annotations for genes, built according to terms provided by the researcher. The program receives as input a list of gene identifiers obtained from any microarray experiment and a set of custom search terms. The program expands each gene identifier to its informative synonyms and searches literature databases for co- occurrences of every gene on the list with each of the custom terms. The program's output is an annotation table with the numbers of publications for each gene-term combination (hit-counts). This novel annotation format can be easily used within a web browser or a spreadsheet program to quickly identify genes within the list that are related to the terms provided by the researcher, and may be easily extended, as every hit-count in the annotation is a hyperlink to the query's results. The great advantage achieved by this method over standard static annotations, such as Gene Ontology (GO) annotations, is that the annotations are generated based on terms provided by the researcher, and therefore help in addressing the specific scientific question the researcher is pursuing.
The program is able to search two literature databases, GeneRIF [4
] and Medline [5
]. GeneRIF contains ~90,000 short summaries of curated articles relevant to known genes. An initial search of the microarray results against the GeneRIF database provides results within minutes and is easily evaluated, thereby providing immediate insights to the microarray results. This search is followed by a comprehensive Medline search via Pubmed, allowing the identification of more subtle biological insights.
To demonstrate the power of this strategy, we have analyzed a list of 148 genes affected by over-expression of p53 [6
]. Our analysis assisted in retrieving from the list 11 known p53 targets, which are all the known targets in the list, and in identifying within the p53-affected genes a subset of putative p53 target genes that are known to be involved in apoptosis (43 genes), in cell cycle arrest (21 genes), and in Cancer (48 genes) as shown in Figure . This example demonstrates the usefulness of our tool in narrowing down microarray results to a small list of genes involved in a specific biological activity.
Figure 3 Analysis of a list of genes affected by p53 overproduction. A. The number of genes remaining after filtering the p53-affected genes with terms intended to reveal known p53 targets. B. Average number of articles per gene in the different queries. C. Venn (more ...)