In two recent papers, Duan et al.
) and Dai and Dai (11
) assessed the extent to which certain functional genomic elements colocalize, using P
-values derived from a hypergeometric distribution. We have shown here that such hypergeometric P
-values are flawed. The assumptions of the hypergeometric distribution are inappropriate in this setting, and consequently hypergeometric P
-values computed on random gene sets are far from uniform. We then presented an alternative, resampling-based P
-value calculation approach that is suitable for this setting. These resampling-based P
-values indicate a complete lack of evidence that the 174 coregulated gene sets studied in Dai and Dai (11
) colocalize in the nucleus. However, they do support the hypothesis that centromeres colocalize, and provide some evidence in support of colocalization of other functional genomic elements.
In the current study, we reassessed the extent to which target genes of 174 TFs, considered by Dai and Dai (11
), exhibit colocalization. We did not investigate several other results in that study that were also based on a hypergeometric test: that only one TF shows significant colocalization based on intrachromosomal interactions, that 5 of 158 TFs measured via ChIP-chip show evidence of colocalization of their targets, and that various classes of chromatin regulatory genes—histone modification regulated genes, genes whose promoters exhibit high chromatin remodeler occupancy, genes that show expression changes in response to chromatin remodeler perturbation, genes whose promoters are occupied by nucleosomes, genes containing histone variant H2A.Z, and genes with high trans
effects on gene expression divergence—are colocalized. We are not claiming that coregulated gene sets do not colocalize in the nucleus; we are simply stating that there does not appear to be evidence in the Duan et al.
) data set of colocalization of the 174 gene sets studied by Dai and Dai (11
The implications of our reanalysis for the claims made in the Duan et al
. paper are relatively minor. The primary colocalization claims in that paper—regarding centromeres, telomeres, tRNAs, breakpoints and origins of replication—were based primarily upon the qualitative assessment of a set of receiver operating characteristic curves (Figure 4d of that paper). This analysis was augmented by a set of hypergeometric tests, reported in their Supplementary Figure 11
. Our analysis suggests that three of the asterisks in that figure (indicating Bonferroni adjusted significance of 0.01) were erroneous. These three changes imply weaker statistical support for the colocalization of early-firing origins of replication and chromosomal breakpoints, and no support for the colocalization of telomeres.
We have shown that using a hypergeometric test to assess colocalization of a gene set is invalid, since the gene pairs underlying the hypergeometric test calculation are not independent. Goeman and Buhlmann (13
) showed that for a similar reason, it is incorrect to use a hypergeometric test to assess the extent to which genes associated with a particular Gene Ontology term are differentially expressed. The problem of assessing colocalization of gene sets is inherently a difficult one, since it is unclear what one would expect 3D interaction data to look like under the null hypothesis, i.e. in the absence of colocalization. To overcome this difficulty, we have proposed a resampling-based approach for assessing colocalization. This approach suffers from some drawbacks that are shared with the hypergeometric test. Using the terminology of Goeman and Buhlmann (13
), both the resampling-based and hypergeometric P
-values test a competitive null hypothesis
, which posits that genes in a given gene set colocalize no more than the genes not in the gene set. Both are gene sampling methods
, and hence do not provide any information about whether a given gene set will colocalize on a new interaction matrix derived from a future experiment. (Indeed, on the basis of a single interaction matrix, one cannot make claims about future experiments.) Instead, these P
-values tell us whether or not, if one were to obtain more genes corresponding to a given gene set, one would expect those new genes to colocalize. Though both the hypergeometric and resampling-based P
-values are gene sampling methods for testing the competitive null, the resampling-based P
-values do not rely on the untenable assumption of independence of gene pairs. Consequently, unlike the hypergeometric P
-values, our proposed P
-values follow a uniform distribution under the null hypothesis of no colocalization.