We manually collected data from ChIP-X experiments into a database of gene lists by extracting lists from the supporting materials of publications. In this ChIP-X database, each record contains a list of genes potentially regulated by a specific transcription factor under a specific condition. We only included publications that describe ChIP-X experiments applied to profile human or mouse cells. Besides manually extracting the gene lists reported by the authors from the publications' supporting materials, when it was possible and available, we also generated gene lists directly from the raw data files that belong to each publication. We implemented our own method for indexing, peak calling and gene matching to process the raw ChIP-seq and ChIP-chip data using a standard process (see Supplementary data
for detail). The manually curated portion of the database as of July 26, 2010 contains 189 933 extracted interactions, from 84 publications, describing the binding of 92 transcription factors to 31 932 target genes. Several publications reported target genes for more than one factor, and several factors were profiled by different groups using different conditions and cell types. The automatically generated portion of the database contains 19 ChIP-seq and 10 ChIP-chip publications with 203 ChIP-seq and 22 ChIP-chip individual experiments. Both parts of this ChIP-X database are expected to continually grow. Additionally, since peak height varies significantly across experiments, we also implemented an interactive visualization tool that gives users the control to dynamically set the peak height cut-off for a specific experiment using a slider ().
Screen-shot from the ChIP-X database web application. Users can interactively adjust the normalized peak height threshold to determine the genes that are regulated by the transcription factor.
For comparing input gene lists across species, human and mouse gene IDs were merged using homologene. However, species are separated in the database and the user can perform the analysis on each species separately. The automatically generated lists displayed more variability in total gene calling per experiment per transcription factor when using a fixed peak height, in combination with the same indexing and peak calling methods. This suggests that there is intrinsic variability in average peak height and number of identified peaks across different ChIP-seq and ChIP-chip experimensts. The ChIP-X database is utilized to create a web-based interactive software application called ChIP Enrichment Analysis (ChEA) (). With ChEA, users can cut and paste input lists of mammalian gene symbols, typically gene lists that significantly changed in expression level from genome-wide gene expression profiling studies. Then, the software computes over-representation for targets of transcription factors from the ChIP-X database. To compute statistical enrichment, we implemented the Fisher exact test with the Bonferroni's correction, where the proportions for the test are the number of genes in the input list, the number of genes identified in the ChIP-X experiment, the genes that are shared among the two lists and the number of overall targets in the ChIP-X database (~30 000). The program reports a ranked list of ChIP-X experiments that show statistically significant overlap with the input list. Indentified genes from the input list, potentially regulated by a specific transcription factor, are also connected and visualized as a network using known protein–protein interactions. To construct the protein–protein interaction network we used the networks we consolidated for the program Genes2Networks (Berger et al.
), as well as all mammalian interactions downloaded from the KEGG pathway database (Kanehisa et al.
). Users can also browse the content of the database online. The database containing all interactions extracted manually from the ChIP-X experimental data, as well as the indexed files created from the raw data analysis can be downloaded from our web site.
Fig. 2. Screen-shot from the ChEA program web application. Users can cut and paste input lists of genes in the text box on the left. The system reports a ranked list of transcription factors/experiments (concatenated string that includes the transcription factor (more ...)