PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Open Bioinforma J. Author manuscript; available in PMC 2010 December 1.
Published in final edited form as:
Open Bioinforma J. 2009 January 1; 3: 26–30.
PMCID: PMC2995274
NIHMSID: NIHMS174875

GOAPhAR: An Integrative Discovery Tool for Annotation, Pathway Analysis

Abstract

We have developed the web based tool GOAPhAR (Gene Ontology, Annotations and Pathways for Array Research), that integrates information from disparate sources regarding gene annotations, protein annotations, identifiers associated with probe sets, functional pathways, protein interactions, Gene Ontology, publicly available microarray datasets and tools for statistically validating clusters in microarray data. Genes of interest can be input as Affymetrix probe identifiers, Genbank, or Unigene identifiers for human, mouse or rat genomes. Results are provided in a user friendly interface with hyperlinks to the sources of information.

Keywords: GOAPhAR: Gene ontology, Annotations and pathways for array research

1. Introduction

Microarrays are useful in profiling entire genomes of organisms under specific conditions [1]. Data generated are used to assess relationships among genes and to obtain a detailed understanding of underlying cellular processes. After the data are generated, probeset signals are filtered from noise and background. Normalization techniques [2] are then applied to minimize technical variation and probe subsets are selected for detailed analysis. Typically, an initial step in analysis is to obtain annotative information for the selected probesets [3]. Annotation can include; structural information such as chromosome location, sequence information, coding regions, or homologs in other genomes; functional information such as biological pathways; and associated publications. Fortunately extensive annotative information is freely available on the internet. However, because of its vast and heterogeneous nature, these resources are scattered among many sites, and it can be a daunting task to locate relevant information from the disparate sources [4].

While some applications are available that integrate annotative information, they are frequently limited with regard to the identifiers they use and the completeness of the information they provide. Many existing tools provide multi- dimensional information as a single instance that lacks logical integration, for example, displaying gene annotations, pathways and Gene Ontology on a single page. This makes it difficult for the user to understand and navigate through the results. Some of the applications require the user to enter one gene identifier at a time without providing comprehensive batch mode of analysis, making for tedious annotation.

The objective of this study was to develop a comprehensive tool to extract a wide range of annotative information from microarray data, and to provide this detailed information in a user friendly environment. Here we present a new web-based application that mines information from various sources, integrates this information, and presents it to the user in a logical and accessible format. The integrated information can be classified as ‘Gene Annotations’, ‘Protein Annotations’, ‘Gene Ontology’, ‘Biological Pathways’, ‘Protein Interaction’ and ‘Statistical Validity’. All results are hyperlinked to their sources so that users can browse and extract information of their choice. It also provides links to results of existing tools that provide additional information thus giving the user comprehensive information at a single source. Importantly, the tool provides batch mode analysis, so the user can query multiple probe sets simultaneously and the results can be downloaded in the form of a text file.

2. Methods & Implementation

Definition

GOAPhAR is an acronym for Gene Ontology, Annotation and Pathways for Array Research. These categories provide information regarding gene identifiers, gene locations on chromosomes, gene nomenclature, gene symbols, gene ontology, protein identifiers, tertiary protein structures, protein interactions mined from literature, signaling and metabolic pathways and publicly available microarray datasets. It also provides a means of assessing statistical validity of clusters derived from microarray data. The Schematic diagram depicting the work flow in the GOAPhAR is shown in Fig. (1).

Fig. (1)
Schematic diagram depicting the work flow in the GOAPhAR

GOAPhAR Databases

The annotations have been classified as gene and protein annotations and extracted for human, mouse and rat from the NETAFFX annotation file [5] available from Affymetrix. Gene Ontology information is extracted from NETAFFX and Gene Ontology Annotation (GOA). Pathway information data sources are Kyoto Encyclopedia of Genes and Genomes (KEGG) [6], Signaling Pathway Database (SPAD) [7], GenMapp [8] and Panther [10]. Microarray datasets are available from Gene Expression Omnibus in NCBI. Protein interaction information has been extracted from the BIND database, protein structures from PDB database. The tool supports human, mouse and rat genomes with Genbank, Unigene or Affymetrix probe set identifiers. The cluster validation component of GOAPhAR makes use of well known statistical algorithms, namely Dunn's, Davies-Bouldin and silhouette indices [11].

GOAPhAR Web Interface

It was developed using PHP4. Additional scripts for information extraction were written in Perl, C and Java. The curated information is stored in MySQL database.

GOAPhAR Usage

GOAPhAR is accessible through the web and a free user account can be obtained after registering on the website. The website has been tested on IE, Mozilla, Firefox and Safari web browsers. The input to the system is a new line delimited text file with the probe identifiers. It consists of the previously mentioned aspects of data analysis, entitled ‘Gene Annotations’, ‘Protein Annotations’, ‘Gene Ontology’, ‘Biological Pathways’, ‘Protein Interactions’ and ‘Statistical Validity’. The user is asked to upload a text file that contains the identifiers for the genes, the genome to which it belongs, and the type of identifier. Once a user uploads the file and selects an option the system locks the file and the user can navigate through the entire system without having to upload the file again. The results are displayed in a tabular format and can be downloaded as a text file. Representation of the data analysis aspects of GOAPhAR and their sources from the World Wide Web is shown in Table 1 below.

Table 1
Representation of the Data Analysis Aspects of GOAPhAR and their Sources from the World Wide Web

3. Results

In this section we describe the main analysis modules implemented in GOAPhAR.

Gene Annotations

Gene annotation is a critical feature as it gives the identification, position and functional characteristics of genes in a genome. GOAPhAR extracts information from Genbank [12], Unigene and NETAFFX [9]. In also provides gene titles gene symbols, reference transcript identifiers, associated homologs, enzyme commission numbers and location on chromosome as annotation. The identifiers from all the above mentioned databases are displayed, thus circumventing the problem of multiple identifiers being used for the same gene in different databases. These are hyperlinked to their sources, thus providing more detailed information. Additionally, links are provided to the popular web-based tools Genecards [13] and iHOP [14] which provide additional annotation. This system is capable of retrieving UniGene identifiers as well as other information related to a specific set of probe id's.

Gene Ontology

Gene Ontology provides relevant information on biological processes, molecular functions and the cellular components in which the gene products are involved [15]. This information is useful for determining additional functions of genes, relation to other gene products, and for comparative genomics. Gene Ontology information is obtained from the Gene Ontology Consortium and hyperlinks are provided to QuickGO [16] and Reactome [17] applications. The user can view the hierarchical nature of Gene Ontology in QuickGO and the gene products are ordered by the organisms in Reactome [17].

Biological Pathways

After examining the annotations and expression levels, the investigator typically may select a set of genes for additional statistical analyses such as principal component analysis and clustering. If the genes are co-regulated, it is of interest to determine if they share a common biological pathway. GOAPhAR integrates pathway information from KEGG, Biocarta, Genmapp, Panther, Spad and Cellml databases, thus providing the user with comprehensive information. Some of the pathways are redundant, as many pathways occur in two or more databases, but this is also useful as many pathway schemas are incomplete.

Protein Annotations

A protein domain is a structurally and functionally defined protein region. If a protein contains multiple domains then it may be involved in two or more functions. Protein families are subsets of protein domains with related structure and function. This information is obtained from Structural Classification of Proteins (SCOP) [18] and Protein Families (PFAM) [19] databases and the tertiary structure is obtained from Protein Databank (PDB) [20]. In addition it provides the protein reference identifiers from NCBI and protein identifiers from Swissprot [21].

Protein Interactions

There is abundant public literature providing information concerning interactions between proteins [22]. These interactions are experimentally defined or hypothesized, and can be very helpful in assessing the molecular significance of changes in gene expression. PreBind [23] is a literature mining tool that extracts protein-protein associations from Pubmed and classifies them in various interaction categories based on results of pattern matching. GOAPhAR extracts this information from the Prebind database and displays it to the user. The investigator can then access the relevant citations via hyperlink.

Cluster Validity

Clustering is used to identify patterns in the microarray dataset and is often used to find co-regulated genes. There are many algorithms that produce clusters of various granularities. Since a huge number of clusters is possible, the appropriate cluster must be selected for further analysis. Davies-Bouldin, Dunn's and silhouette indices [24] provide good assessments with respect to intra- and inter-point separation. For example, a low Davies-Bouldin, high Dunn's index and silhouette close to 1 are considered to be a good indication of valid clustering. GOAPhAR implements these indices, allowing the investigator to both identify related genes and to vet their function and annotation using the text files.

The usage of the Gopahar is shown in Fig. (2) wherein the user can upload the probe set id's and retrieve various information related to it that includes gene annotation, protein annotations as well as pathway annotations.

Fig. (2)
Simple use age of the GOAPhAR server wherein the user provides the system with a list of probe set id's and retrieve various annotations related to it.

Comparison with Other Tools

There are many web-based applications and desktop software programs that extract information regarding gene identifiers. Two of the web-based applications are Database Referencing of Gene Array Online (DRAGON) [25] and MicroArray Data Review and Annotation System (MADRAS) [26]. While these applications provide much useful information, DRAGON does not provide Gene Ontology information whereas MADRAS lacks protein annotations. Commercial software like GeneSpring provide pathway information from only a limited number of databases (i.e. KEGG and GenMapp). None of the above applications provides information on publicly available microarray datasets, protein structures or protein-protein interactions. GOAPhAR overcomes all these limitations and provides a structured and detailed analytical framework. GOAPhAR's functionality is currently being expanded to include additional genomes, tools that map expression profiles onto functional pathways, statistical tools for analysis and tools that mine protein interaction literature.

GOAPhAR provides detailed and comprehensive information from microarray data in a user-friendly and structured manner. Investigators can use the information to filter genes or perform detailed analyses on subsets of genes. GOAPhAR can exponentially reduce the time required for analyzing data obtained from gene profiling microarray experiments. We are in the process of adding extra functionality's to the gopahar server that would allow the user to group objects (for example probe set id with their respective proteins and their interactions. We are also in the process of adding extra functionalities to the output wherein the user can split the results.

4. Conclusions

GOAPhAR provides detailed and comprehensive information from microarray data in a user-friendly and structured manner. Investigators can use the information to filter genes or perform detailed analyses on subsets of genes. GOAPhAR can exponentially reduce the time required for analyzing data obtained from gene profiling microarray experiments. GOAPhAR is useful in preliminary data analysis for finding gene/protein annotations, as well as for detailed analysis including functional pathway and protein interactions. The tool significantly increases efficiency of analysis of microarray data by providing information from many sources on a single interface, thus reducing time and effort. The tool is freely available at http://bioinformatics.kumc.edu/goaphar/

Acknowledgments

This work was supported by the K-INBRE, NIH grant number P20 RR016475 and Kansas IDDRC grant number P30 HD002528.

Footnotes

The tool is freely available at http://bioinformatics.kumc.edu/goaphar/

Publisher's Disclaimer: This is an open access article licensed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted, non-commercial use, distribution and reproduction in any medium, provided the work is properly cited.

References

1. Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981;147:195–197. [PubMed]
2. Lockhart DJ. Expression monitoring by hybridization to high-density oligonucleotide array. Nat Biotechnol. 1996;14(13):1675–80. [PubMed]
3. Bolstad BM. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19(2):185–93. [PubMed]
4. Riva A. Comments on selected fundamental aspects of microarray analysis. Comput Biol Chem. 2005;29(5):319–36. [PubMed]
5. Navarange M. MiMiR: a comprehensive solution for storage, annotation and exchange of microarray data. BMC Bioinform. 2005;6:268. [PMC free article] [PubMed]
6. Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30. [PMC free article] [PubMed]
7. Signaling Pathway Database. http://www.grt.kyushu-u.ac.jp/spad/
8. Doniger SW. MAPPFinder: using gene ontology and genMAPP to create a global gene-expression profile from microarray data. Genome Biol. 2003;4(1):R7. [PMC free article] [PubMed]
10. Mi H. The PANTHER database of protein families, subfamilies, functions and pathways. Nucleic Acids Res. 2005;33(Database issue):D284–8. [PMC free article] [PubMed]
11. Bolshakova N, Azuaje F, Cunningham P. An integrated tool for microarray data clustering and cluster validity assessment. Bioinformatics. 2005;21(4):451–5. [PubMed]
12. Benson DA. GenBank. Nucleic Acids Res. 2006;34(Database issue):D16–20. [PMC free article] [PubMed]
13. Safran M. Human Gene-Centric Databases at the Weizmann Institute of Science: GeneCards, UDB, CroW 21 and HORDE. Nucleic Acids Res. 2003;31(1):142–6. [PMC free article] [PubMed]
15. Harris MA. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 2004;32(Database issue):D258–61. [PMC free article] [PubMed]
16. Hermida L. MIMAS: an innovative tool for network-based high density oligonucleotide microarray data management and annotation. BMC Bioinform. 2006;7:190. [PMC free article] [PubMed]
17. Joshi-Tope G. Reactome: a knowledgebase of biological pathways. Nucleic Acids Res. 2005;33(Database issue):D428–32. [PMC free article] [PubMed]
18. Lo Conte L. SCOP: a structural classification of proteins database. Nucleic Acids Res. 2000;28(1):257–9. [PMC free article] [PubMed]
19. Finn RD. Pfam: clans, web tools and services. Nucleic Acids Res. 2006;34(Database issue):D247–51. [PMC free article] [PubMed]
20. Sussman JL. Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules. Acta Crystallogr D Biol Crystallogr. 1998;54(Pt 6 Pt 1):1078–84. [PubMed]
21. Bairoch A, Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000;28(1):45–8. [PMC free article] [PubMed]
22. Bader GD, Betel D, Hogue CW. BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res. 2003;31(1):248–50. [PMC free article] [PubMed]
23. Donaldson I. PreBIND and Textomy--mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinform. 2003;4:11. [PMC free article] [PubMed]
24. Bolshakova N, Azuaje F. CVE: cluster validation for gene expression data. Bioinformatics. 2003;19(18):2494–5. [PubMed]
25. Bouton CM, Pevsner J. DRAGON View: information visualization for annotated microarray data. Bioinformatics. 2002;18(2):323–4. [PubMed]