|Home | About | Journals | Submit | Contact Us | Français|
Medline is a rich information source, from which links between genes and keywords describing biological processes, pathways, drugs, pathologies and diseases can be extracted. We developed a publicly available tool called CoPub that uses the information in the Medline database for the biological interpretation of microarray data. CoPub allows batch input of multiple human, mouse or rat genes and produces lists of keywords from several biomedical thesauri that are significantly correlated with the set of input genes. These lists link to Medline abstracts in which the co-occurring input genes and correlated keywords are highlighted. Furthermore, CoPub can graphically visualize differentially expressed genes and over-represented keywords in a network, providing detailed insight in the relationships between genes and keywords, and revealing the most influential genes as highly connected hubs. CoPub is freely accessible at http://services.nbic.nl/cgi-bin/copub/CoPub.pl.
Analysis and interpretation of microarray data is not a trivial task. Many public and commercial bioinformatics tools have been developed to help scientists interpret the lists of differentially expressed genes that are the result of microarray experiments. For example, Gene Ontology and Pathway Mapping tools (1–4) allow batch input of genes and produce lists of GO terms or pathways that are significantly correlated with the input set of genes (5–13).
The outcome of these tools is based on well-established relationships between the genes and biological processes in which they participate. However, the primary literature contains much more information about the functions of genes than is captured in structured vocabularies or canonical pathways. To extract this additional information on gene function from literature, we used thesaurus-based keyword matching in Medline abstracts to link human, mouse and rat genes to biomedical concepts describing liver pathologies, pathways, GO terms, diseases, drugs and tissues (Table 1). This approach builds on the assumption that co-occurrence of a gene and a biomedical concept in the same abstract is an indication of a functional link between the gene and the concept.
In this article, we describe a tool named CoPub that calculates keyword over-representation for a set of regulated genes in a similar fashion to general GO term over-representation tools, but where the over-represented keywords for the gene set are retrieved directly from Medline by text mining. Several text mining methods for the analysis of microarray data have been published that annotate clustered sets of regulated genes based on their literature profile (14–17), or on their expression profile, often based on subsets of the total Medline repository (18–21). CoPub uses the entire Medline library to calculate robust statistics for gene-keyword co-occurrence, and is not dependent on pre-clustered gene sets to calculate significance for keyword over-representation. In addition to calculating over-represented keywords, CoPub also shows the results graphically in an interactive network, providing an additional level of insight into the biological mechanisms related to a set of regulated genes.
CoPub has two other features: the Gene search and the BioConcept search. The Gene search and the BioConcept search options identify genes and keywords that share occurrences in Medline abstracts with a gene or keyword of interest, which provides a kind of annotation for the gene or keyword of interest.
In an earlier study (22), we successfully applied CoPub for compound toxicity evaluation of a variety of compounds, which shows that CoPub is a useful additional bioinformatics tool for microarray data analysis. CoPub is freely accessible at http://services.nbic.nl/cgi-bin/copub/CoPub.pl.
Eleven thesauri were generated to search Medline (Table 1). These thesauri describe genes (human, mouse and rat), Gene Ontology terms, diseases, pathways, drugs, tissues and liver pathologies. The keyword thesauri are based on biological items, which represent an instance of a biological concept (e.g. a gene, a pathway), and may contain one or more keywords (e.g. a gene is assigned a full gene name as well as a gene symbol and gene aliases).
The full Medline baseline XML files (1966 to February 2008) were obtained from the NCBI website (http://www.nlm.nih.gov/bsd/licensee/2008_stats/baseline_doc.html) and extracted to small text files containing title, abstract and substances.
Regular expressions were used to search the compiled Medline text files for the presence of all keywords (~250 000) from the biological concept thesauri, as described by Alako et al. (23). Keywords that generated a hit in a Medline abstract were stored, together with the PubMed identifiers (IDs) of the Medline records in which the hit occurred. For every biological item, the hits were made non-redundant (note: multiple keywords of a biological item can occur in the same Medline abstract), resulting in a PubMed ID-biological item list. Gene symbols were curated for ambiguity and gene hits of orthologous genes were combined to make the keyword search more comprehensive.
Co-publication of biological items (e.g. a gene with a pathology term) was retrieved from the database by matching common Medline abstract occurrences. For every biological item pair, an R-scaled score, which describes the strength of a co-citation between two keywords given their individual frequencies of occurrence (23), and the literature count, which is the number of co-publications between every biological item pair, was calculated. Both measures were used to describe the strength of the relationship between two keywords.
To link gene expression data to literature data, mappings of Affymetrix probe set identifiers to Entrez Gene identifiers and orthology information were retrieved from Affymetrix human, mouse, rat GeneChip Genome Array annotation files (http://www.affymetrix.com). Mappings of Ensembl identifiers to Entrez Gene identifiers were retrieved from BioMart (http://www.biomart.org).
Keyword enrichment calculation is performed using the Fisher exact test, in which the association of a given keyword with a set of regulated genes (i.e. co-publications in Medline abstracts) is statistically tested against a background set; the set of unchanged genes on the microarray in case Affymetrix probe set identifiers are uploaded, or all other genes in the genome when Entrez Gene identifiers or Ensembl identifiers are uploaded. The calculated P-values are corrected using the Benjamini–Hochberg multiple testing correction method. All statistical tests are done using the R Statistics package (http://www.r-project.org).
The CoPub user-interface offers three analysis methods: the Microarray data analysis, the Gene search and the BioConcept search.
The Microarray data analysis option calculates keyword over-representation for a set of differentially expressed genes, and offers graphical visualization of the analysis results in a literature-based network. Screenshots and workflow of the Microarray data analysis are shown in Figure 1 and described subsequently in more detail.
The Gene search identifies genes and keywords that share co-occurrences in Medline abstracts with a gene of interest. It provides answers on a question like; ‘Which diseases and drugs are strongly connected to the gene p53’? In a similar manner, the BioConcept search identifies genes and keywords that share co-occurrences in Medline abstracts with a keyword of interest and provides answers on a question like, ‘Which pathways are associated with Alzheimer's disease?’ Screenshots and the workflows of the Gene search and the BioConcept search are shown in Figure 2 and described below in more detail.
The user can select one of two analysis modes for microarray data analysis: the keyword enrichment calculator or the matrix generator. For each of the two analysis modes, the user needs to upload a list of gene identifiers (Affymetrix probe set identifiers, Entrez Gene identifiers or Ensembl identifiers), either by copy–paste or as a text file. Following this, the user must specify the correct Affymetrix microarray chip or species and the categories of keywords used for analysis (Figure 1A). Example gene sets are provided.
For the keyword enrichment calculator, thresholds need to be specified for the P-value significance level, the minimal number of co-publications, the minimal R-scaled score and the minimal number of submitted genes that have linkage with the analyzed keyword in literature (i.e. share abstract occurrences). For the threshold settings, sensible defaults are provided.
Both analysis modes can either be performed with ‘species-specific’ or ‘cross-species’ gene information. In the ‘cross-species’ mode, full gene names as well as gene symbols of human, mouse and rat orthologous genes are combined. This is in contrast to the ‘species-specific’ mode, in which species-specific gene names and symbols are used for analysis.
The matrix generator produces a tab-delimited file in which all co-publication information between the uploaded set of genes and the selected keywords are presented in a matrix format. The values in the matrix can either be the absolute number of co-publications or the R-scaled score between a gene and a keyword. This matrix is provided as a flat-format text file, which can be used for any kind of follow-up analysis, such as clustering the genes on basis of their keyword profile.
The keyword enrichment calculator produces a list of keywords, ranked on P-values (Figure 1B). This list already provides a first impression of the biological processes related to the gene set. The user can drilldown into these results by clicking on the hyperlinked number of genes that are significantly associated with the analyzed keyword (Figure 1B). It links to an overview of uploaded genes that share co-publications with the analyzed keyword (Figure 1C), and provides access to highlighted Medline abstracts in which they co-occur (Figure 1D).
CoPub can also visualize the keyword enrichment results in a SVG format (Figure 1E). In this interactive network, nodes represent over-represented keywords and differentially expressed genes and edges represent literature links. The nodes and edges link to the relevant Medline abstracts in which co-occurring genes and keywords are highlighted. This allows for quick retrieval of relevant literature and interpretation of the data. Threshold sliders for the R-scaled score and the literature count can be used to reduce the size of the network interactively. Alternatively, the literature-network can be re-calculated with new threshold values.
The Gene search option requires a single gene name or symbol, or for the BioConcept search, a single keyword as input (Figure 2). Furthermore, for both the Gene search and the BioConcept search, the categories of keywords need to be specified for which co-occurrences in literature with the gene or keyword of interest will be matched and retrieved.
In addition, thresholds for the minimal number of co-publications and the minimal R-scaled score between keywords/genes can be specified, for which sensible defaults are provided.
Both the Gene search and the BioConcept search return lists of genes and keywords that are strongly linked with the input gene or keyword of interest.
The results on these pages are all hyperlinked. This enables the user to navigate through various pages that report on how many times genes and keywords co-occurred in literature, and provide access to the relevant Medline abstracts (Figure 2).
We have developed CoPub, a web-based tool for calculating keyword enrichment in sets of regulated genes. It detects keywords that are significantly linked to a set of genes, using robust co-occurrence statistics of genes and keywords in Medline abstracts. In a study in which gene expression profiles induced by 11 distinct hepatotoxicants were analyzed with CoPub, we were able to accurately describe histopathological findings and the mode of toxicity of these compounds (22). This shows that CoPub is a useful additional tool in the toolbox for the analysis of microarray experiments.
We intend to further develop CoPub by updating its Medline content on a regular basis (once every 2 months), and by adding new and improved keyword thesauri. Furthermore, we will broaden the scope of the keyword over-representation analysis by allowing identifiers from other microarray platforms as input for the keyword over-representation analysis. On the output side, options for multiple graphical output formats, such as png and jpg as well as a connection to Cytoscape, an open source network analysis tool, will be offered, allowing for improved downstream analysis of the results generated by CoPub.
The authors thank SARA Computing and Networking Services (Amsterdam, The Netherlands) for updating and maintaining our CoPub database and web server, which was made possible by grants received from the Netherlands Bioinformatics Centre (NBIC) under the BioAssist program and from Organon, part of Schering-Plough corporation.
Conflict of interest statement. None declared.