A large and exponentially growing volume of gene expression data from microarrays is now available publicly. Since the quantity of data from around the world dwarfs the output of any individual laboratory, there are opportunities for mining these data that can yield insights that would not be apparent from smaller, less diverse data sets. Consequently, numerous approaches for extracting large networks of relationships from large amounts of public-domain gene expression data have been used. Almost all of this work constructs networks of pairwise relationships between genes, indicating that the genes are co-expressed [1
]. Co-expression is a symmetric relationship between a gene pair, because if A is related to B, then B is related to A. Many of these methods are based on showing that the expression of two genes has a coefficient of correlation exceeding some threshold.
We propose a new approach to identify a larger set of relationships between gene pairs across the whole genome using data from thousands of microarray experiments. We first classify the expression level of each gene on each array as 'low' or 'high' relative to an automatically determined threshold that is derived individually for each gene. We then identify all Boolean implications between pairs of genes. An implication is an if-then rule, such as 'if gene A's expression level is high, then gene B's expression level is almost always low', or more concisely, 'A high implies B low', written 'A high
In general, Boolean implications are asymmetric: 'A high
B high' may hold for the data without 'B high
A high' holding. However, it is also possible that both of these implications hold, in which case A and B are said to be 'Boolean equivalent'. Booleanequivalence is a symmetric relationship. Equivalent genes are usually strongly correlated as well. A second kind of symmetric relationship occurs when A high
B low and B high
A low. In this case, the expression levels of A and B are usually strongly negatively correlated, and genes A and B are said to be 'opposite'. In total, six possible Boolean relationships are identified: two symmetric (equivalent and opposite) and four asymmetric (A low
B low, A low
B high, A high
B low, B high
A high). Below, 'symmetric relationship' means a Boolean equivalence or opposite relationship; 'asymmetric relationship' means any of the four kinds of implications, when the converse relationship does not hold; and 'relationship' means any of the two symmetric or four asymmetric relationships.
The set of Boolean implications is a labeled directed graph, where the vertices are genes (more precisely, Affymetrix probesets for genes, in our data) and the edges are implications, labeled with the implication type. We call this graph the Boolean implication network. Networks based on symmetric relationships are undirected graphs.
It is important to understand that a Boolean implication is an empirically observed invariant on the expression levels of two genes and does not necessarily imply any causality. One way to understand the biological significance of a Boolean implication is to consider the sets of arrays where the two genes are expressed at a high level. The asymmetric Boolean implication A high
B high means that 'the set of arrays where A is high is a subset of the set of arrays where B is high'. For example, this may occur when gene B is specific to a particular cell type, and gene A is specific to a subclass of those cells. Alternatively, this implication can be the result of a regulatory relationship, so A high
B high could hold because A is one of several transcription factors that increases expression of B, or because B is a transcription factor that increases expression of A only in the presence of one or more cofactors. On the other hand, the asymmetric Boolean implication A high
B low means that A and B are rarely high on the same array - the genes are 'mutually exclusive'. A possible explanation for this is that A and B are specific to distinct cell types (for example, brain versus prostate), or it could be that A represses B or vice versa
Boolean implications capture many more relationships that are overlooked by existing methods that scale to large amounts of data, which generally find only symmetric relationships. There may be a highly significant Boolean implication between genes whose expression is only weakly correlated. The relationships in the resulting network are often biologically meaningful. The network identifies Boolean implications that describe known biological phenomena, as well as many new relationships that can serve to generate new hypotheses. Moreover, many previously unidentified relationships are conserved between humans, mice, and fruit flies.
A meta-analysis was performed on thousands of publicly available microarray datasets on Affymetrix platforms for humans, mice, and fruit flies. This is the first time Boolean implication networks have been applied to the problem of mining large quantities of microarray data. The remainder of this manuscript explains how the networks are constructed from gene-expression microarray datasets, and describes selected Boolean implications that capture important biological phenomena that would be overlooked in gene expression networks based on co-expression. We also discuss related work.