The development of high-throughput methodologies, as epitomized by microarray technologies, has led to the rapid generation of large-scale datasets about RNA transcripts or proteins. While in the past biologists studied single genes at a time, now we can use high-throughput technologies to analyze tens of thousands of genes simultaneously. The nature of high throughput technologies requires that bioinformatics tools focus on ‘gene sets’ instead of ‘single genes’. For example, microarray and proteome technologies are producing sets of genes and proteins that are differentially expressed under certain conditions, or sets of genes and proteins that are co-expressed under varying conditions. Other studies such as quantitative trait analysis, large-scale mutagenesis studies, and other large-scale genetic studies are also producing sets of interesting genes. Translating the identified gene sets into a better understanding of the underlying biological processes constitutes a huge challenge for today's biologists. Even retrieving the associated functional information for large gene sets can be time-consuming. Further manipulating, visualizing, and statistically analyzing the interrelated data can involve complex processes for an average biologist. Without the assistance of appropriate bioinformatics tools, exploring the gene sets to discover important patterns is not a trivial task for biologists.
Traditional resources that are available for retrieving functional information, such as the LocusLink from NCBI (National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/LocusLink/
), are typically displayed in a one-gene-at-a-time format (1
). A newer generation of resources has been created to facilitate batch information retrieval for sets of genes (2
). One such example is ENSMART (http://www.ensembl.org/Multi/martview
), in which the users can perform a genome information search and retrieval for sets of genes in human and several other eukaryotic species (3
). ENSMART covers a broad spectrum of functional information pertaining to gene- and protein-specific attributes as well as disease, expression, sequence variation and cross-species attributes. Despite being an excellent batch information retrieval tool, ENSMART does not help biologists in efficiently exploring the abundant information associated with a gene set.
One way to help biologists in exploring large gene sets is to organize the genes based on common functional features, such as Gene Ontology (GO) (5
) categories or biochemical pathways. Several bioinformatics tools have been developed for organizing sets of genes based on GO (6
). Most of these tools have also implemented statistical tests to identify enriched GO categories and to suggest the most important biological areas associated with a given gene set. Although the use of ontological methods to structure biological knowledge is an active area of research and development, the body of biological knowledge associated with any gene set extends far beyond GO. In addition to organizing gene sets within the context of GO, MAPPFinder (11
), DAVID (12
) and GFINDer (13
) provide the option of organizing and visualizing gene sets within the context of KEGG (Kyoto Encyclopedia of Genes and Genomes, http://www.genome.ad.jp/kegg
) biochemical pathways (14
). DAVID and GFINDer can also organize gene sets based on protein domain information. Other features, such as chromosome location, tissue expression pattern and association in publication, could also be used to organize a gene set. However, these features are not implemented in the current gene set analysis tools.
Although methods of gene organization help biologists explore large gene sets, they frequently generate complex results with hundreds of categories. Information visualization enables people to deal with the overwhelming amount of information associated with a gene set by taking advantage of our innate visual perception capabilities. Visual methods are useful in displaying data in ways that capitalize upon the particular strengths of human pattern processing abilities (15
). Information visualization techniques have been successfully used in many areas of bioinformatics, including molecular structures, expression profile, genome and sequence annotation, sequence analysis, molecular pathway, ontology, taxonomy and phylogeny (16
). Application of information visualization techniques in gene set analysis will not only help the visualization of large amount of information, but also facilitate data mining by aiding recognition of patterns and trends.
Besides information retrieval, organization, statistical analysis and visualization, management of large gene sets presents additional challenges for biologists. Bioinformatics tools are needed to create subsets of genes from a gene set based on different criteria, such as GO categories, biochemical pathways or chromosome location ranges. Tools are also needed to perform Boolean operations and generate the unions, intersections and differences between gene sets. Boolean operations could help to reveal the interrelationship among different gene sets.
In response to these challenges, we have developed WebGestalt (WEB-based GEne SeT AnaLysis Toolkit), an integrated data mining system for the management, information retrieval, organization, visualization and statistical analysis of large sets of genes.