In this study, we extend the original two dimensional microarray data to a three dimensional REV space that incorporates TF binding information from ChIP-chip experiments, expression levels of TFs and expression levels of regulated genes. To explore this kind of data, we extended previously described two dimensional clustering approaches and developed an efficient and robust method: TRI-Clustering algorithm with a sub-algorithm for automatic threshold detection. We provided detailed results and analysis of a tri-cluster and the associated regulatory network.
The bi-clustering concept provides for identification of sets of genes that are condition specific, and may not be found by classical clustering which operates on all experimental conditions. However, bi-clustering only leads to the point of identifying co-expressed genes which then leaves the task of predicting or explaining the regulatory mechanisms as a further interpretive step. The tri-clustering concept which we propose here provides an explicit representation of the regulatory effects in the TF-gene network and also clearly identifies transitions in the network from condition to condition which are implicit in the boundaries of the identified tri-clusters.
The meaning of this REV space involves the intersection of three orthogonal planes represented by familiar two dimensional matrices. The T-G (transcription regulatory factor-target gene) matrix defines a connectivity network in which the potential for a regulatory factor to influence a gene is present for at least some conditions. The links in this transcription factor connectivity network indicate the potential for a regulatory factor to influence the expression of a gene through transcriptional or post-transcriptional means. In the case of DNA binding proteins, this indicates in the simplest case a potential DNA binding site in the promoter, or other regulatory region of a gene, and the potential for occupancy under some biological conditions relevant to the analysis at hand. For a microRNA similarly this might indicate similarly a complementary cis-regulatory region, or other target for a hairpin structure. Alternatively, for gene silencing, this connectivity network might indicate that DNA methylation of the promoter occurs under some relevant conditions. And, for chromatin modifications, this might indicate that the methylation or acetylation of a particular histone site may occur indicating silencing or activation under the relevant conditions. The G-C matrix is the matrix for the gene expression for the entire universe of genes under the complete conditions in which the hybridization was performed, while the T-C matrix (transcription factor-condition) represents the specific condition activity/influence of particular regulatory factors.
Unlike bi-clusters extracted from microarray data alone that have no explicit information about regulation, a gene in a tri-cluster always is controlled by at least one regulator. At the same time, compared to traditional models inferred from ChIP-chip data, it has novel information about the specificity of experimental conditions. From the view point of informatics theory, the regulation evidence from ChIP-chip experiments can be regarded as prior knowledge for general conditions. But it does not guarantee that under all circumstances, these rules can apply. By adding new sources of data specifying activity under different conditions, we can refine the existing information about transcription regulation by machine learning algorithms.
Vast amounts of detailed gene regulatory information are already or will become available in the near future, including transcription factor regulatory interactions, transcriptional or post-transcriptional events related to small RNAs, sense-anti-sense in-teractions, or epigenetic activities including DNA methylation, histone modification, and chromatin remodeling. Novel approaches to detect subtle and complex regulatory events among various factors are needed. One intriguing question about REV data in higher dimensional spaces is the capability of reflecting all these kinds of regulation influences, and extending related algorithms to analyze and integrate all such information.