With a pair-wise calculation of the Pearson's correlation coefficient r
, co-expressed profiles form discrete local clusters or mountains in a correlation topomap [18
]. The probability is less than 10-12
that two arbitrary and independently generated data sets of size 48, e.g. the joined UV and IR data set shown in Figure , are correlated with an r
-value greater than 0.8 [25
]. When there exist tens of thousands of gene expression profiles, many of them, which are inconsistently expressed among replicate groups or intra-groups, may appear to be similar by chance and form a correlation local cluster due to stochastic noise. Unlike factor analysis methods, such as ICA, or clustering methods, such as CLICK, K
-means or SOM, our approach called E
atterns and I
dentifying co-expressed G
enes (EPIG) not only calculates the similarity among the profiles, but also evaluates each profile via signal-to-noise (S/N) ratio measurements (Equations 1 and 3). Through a filtering procedure, EPIG removes profiles that don't fit into a pattern. Only the profiles with high S/N ratios and desirable magnitudes of expression change are included in the formation of patterns representing co-expressed genes. With such a profile evaluation strategy, EPIG is able to extract patterns of co-expressed genes without predefined seeding.
In a head-to-head comparison, EPIG competed with CLICK and CAST in the analysis of a simulated data set by 1) extracting all of the designated patterns, 2) accurately categorizing the profiles to their appropriate patterns, and 3) generating patterns of profiles with higher homogeneity and more stability (Table ). However, it is clear that CLICK outperformed EPIG and CAST in terms of generating clusters/extracted patterns that are more dissimilar to each other (i.e., they have a lower average correlation between clusters/patterns). Furthermore, given the two experimental data sets presented above (one from the public domain), EPIG extracted more patterns of gene expression than CLICK (Tables and ). The patterns extracted by EPIG which were not represented by any of the cluster centroids generated by CLICK contained genes which related to key biological responses coupled to the experimental treatments. For example, in the case of UV and IR DNA damage, the patterns extracted by EPIG contained p53 cell cycle control target genes (in Patterns 10 and 11) and many S phase genes (in Pattern 9) of the mitotic cell cycle (Figure ).
There are two main thresholds used in EPIG pattern extraction: the local cluster size threshold Mt and the correlation threshold Rt. Rt determines the closeness in similarity that is allowed among the extracted patterns. Depending on the sample size, one may determine Rt such that the most similar patterns possess clear response differences. For example, in Figure , Patterns 5 and 6 have a correlation r-value of 0.77. But the two contain genes with expression patterns that display a clear difference in the response to UV-induced DNA damage. In Pattern 5, gene expression was repressed only at 2 h post-UV while in Pattern 6, gene expression was repressed at both 2 and 6 h post-UV. Mt is the minimum number of the genes in a local cluster needed to have a profile candidate deemed as a pattern. The value of Mt affects the pattern extraction outcome. Higher Mt values may cause a meaningful pattern with a lower number of co-expressed genes to be concealed. On the other hand, lower Mt values may lead to the extraction of some patterns lacking biological meaningfulness. To test for an optimal Mt setting, we varied its values from 2 to 19 and performed EPIG analysis on the IR-treated gene expression data. Figure shows that the average Pattern SNR increased with the increase of Mt, while the number of extracted patterns decreased. This result is seems plausible considering the observation that as Mt increases, more correlation local clusters are filtered-out since their cluster sizes are less than Mt. The fewer extracted patterns then have higher averaged SNRs. However, when Mt ≥ 6, the SNR had an up-shift and the number of extracted patterns had a down-shift. This result prompted us to set Mt to 6 in the given data set. To be precise, one should vary these thresholds empirically for a given data set to examine the outcomes. We have done just that and have concluded that, upon many sets of the gene expression data analyzed by using EPIG, selections of Mt at 6 and Rt at 0.8 have worked reasonably well (data not shown). Incidentally, there may be some genes with profiles not similar to any other gene(s) or their related local cluster had a size less than Mt. Then these "orphan" genes (singletons) will not be considered as a pattern candidate nor will they be categorized to any extracted patterns. Attention certainly needs to be paid to these orphan genes, as a part of the EPIG analysis result, to determine if they have a unique role in the treatment response.
Optimization of the Mt value. Cluster size threshold Mt (the horizontal axis) verses average of patterns' SNR (A) and number of extracted patterns (B).
EPIG is a general method for gene expression analysis when the data consists of profiles with multiple inter-groups and multiple samples intra-groups. Each intra-group has a specific biologically relevant factor. The inter groups account for the factor variations. For example, in the IR time- series data set, since the intra-group included both biological and technical replicates, the common response features among different cell lines were identified. On the other hand, if the intra-group included only the technical replicates, then one would reasonably expect to extract patterns representing idiosyncratic responses in individual cell lines. The responses to DNA damage that are common among biological individuals are intriguing because they are conserved, but individual-specific responses also are of interest as they point to inter-individual variations in response to external perturbations.
The application to the joined IR and UV data set had eight inter-groups, four each (i.e. sham, 2 h, 6 h and 24 h post-treatment) to IR and UV respectively. In this case, similar and dissimilar responses between IR- and UV- induced DNA damage can be clearly observed (Figure ). For example, UV-specific response patterns included genes functioning in transcription regulation, RNA processing, nucleotide binding, and cell growth (Patterns 1 through 8 in Figure and Table ). The over-represented categories of Gene Ontology from the 616 genes in Pattern 8 included purine nucleotide binding, protein modification, ubiquitin cycle, kinase activity, and cell growth. It appears that protein kinases may be generally down-regulated specifically via phosphorylation in response to UV-induced DNA damage [26
]. Two early response Patterns 10 and 11 in Figure showed that both UV and IR caused these genes to be up-regulated, but in different ways. Many of the genes in these two patterns have been widely studied and known to be related to p53-dependent cell cycle control. The two different patterns of response to IR imply that factors other than p53 also influence the expression of p53-target genes. Pattern 14 in Figure showed similar late down-regulation responses to both UV and IR treatments. There were 563 genes in Pattern 14 participating in a number of important biological processes among them were mitotic cell cycle, DNA replication, DNA repair, cell cycle checkpoint, and G0
-like status transition [20
In general, the inter-group-related factors are not limited to the time variable only. As a matter of fact, EPIG has been applied to many different data sets, where the variable factors include time, treatments (such as chemicals, radiation, knock-out), doses, organs (such as blood, liver, kidney), or organ sections (such as left or right lobe in liver). As such, EPIG is a robust, flexible and new pattern extraction method which is generally applicable to a variety of microarray data sets.