Gene expression data
For our study we used public datasets from the Stanford Microarray Database (SMD) [19
]. This database includes much actual expression data from the same (cDNA microarray) platform, which is an important prerequisite for a well-founded analysis [20
]. Datasets were selected by the following criteria:
• At least 20 000 clones per array
• At least 20 arrays per dataset
• Equal sets of measured clones per dataset
• Publication not earlier than 2003.
The following datasets were included in our study: Chi et al. [12
] (human kidney cells), Higgins et al. [13
] (normal tissue of kidney), Pathan et al. [15
] (infection of blood cells), Zhang et al. [16
] (gene transcription occurring during replicative senescence in human fibroblasts and mammary epithelial cells), and Zhao et al. [17
] (effects of methylseleninic acid on the transcriptional program of prostate cancer LNCaP (Lymph Node Carcinoma of the Prostate) cells). The number of arrays ranges from 21 to 42 and the number of measured clones from 31 736 to 43 196. As expression level we used the binary logarithm of the normalised ratio of gene signal (channel 2) and reference signal (channel 1).
Protein-protein interaction data
As protein interaction database we used DIP [21
] listing protein pairs that are known to interact with each other, because DIP allows the user to select interactions based on their species of origin (e.g. human). Interaction here means that two amino acid chains were experimentally identified to bind to each other. In September 2004 the database comprised 1369 human protein-protein interaction pairs including 51 self-interactions. These self-interactions were excluded from the analysis because the corresponding gene expression levels (which are two identical vectors) always have correlation 1.
Matching gene and protein identifiers
In order to determine the expression levels of genes encoding proteins that interact we have to know which proteins are encoded by which genes. Thus we have to match gene identifiers with protein identifiers. Specifically, we matched UniGene cluster IDs [22
] from the expression files of the SMD [19
] with Swiss-Prot accession numbers [23
] (e.g. sp:Q07812), with PIR accession numbers [24
] (e.g. pir:A47538), and with NCBI sequence identification numbers [25
] (e.g. gi:539664) of the DIP files. As 'translator' we used a file called 'Hs.data' from the NCBI website [26
] which contains the mentioned identifiers and the corresponding UniGene cluster IDs. In order to limit runtime, we refrained from applying sophisticated selection methods [27
] where multiple matching occurred, but considered the first hit at all times. By using this approach, for 87% of the interacting protein pairs the genes encoding these proteins can be determined. In cases where this procedure was not successful, we used information from the Harvester website [28
]. By this combination of methods the proportion increases from 87% to 94%. For many of these protein interactions no expression data of the encoding genes are available. Depending on the number of genes measured in the five expression data sets for at least 43% (dataset 4) and for up to 72% (dataset 3) of the proteins the corresponding gene expression levels can be determined. For evaluating the amount of dependence between the expression levels of two genes encoding interacting proteins, the expression of both genes has to be measured. Disregarding self-interactions this is the case in at least 10% (dataset 4) and up to 47% (dataset 3) out of 1369 interactions.
Pearson's correlation coefficient
Let (X1, Y1), (X2, Y2),..., (Xn, Yn) be the n pairs of expression levels of two random variables X (expression of first protein) and Y (expression of second protein). We wish to measure the degree to which X and Y are linearly dependent as opposed to being independent. The correlation then is defined by
Mutual information measures the mutual dependence of two variables based on information theory.
Two random variables, X and Y, with probability distributions pX(x) and pY(y) and the joint distribution pXY(x, y) are statistically independent if
pXY(x, y) = pX(x)·pY(y). (2)
The mutual information
quantifies the degree of dependence of X and Y using the distance between the joint distribution and the distribution in case of total independence. The mutual information becomes large if X and Y contain the same information.
Calculating the degree of dependence between expression levels of genes encoding interacting proteins for all datasets
In our analysis we used expression vectors each containing the expression levels of one gene from all arrays of a dataset extracted from the SMD [19
]. For each dataset we determined the correlation of vectors each containing the expression levels of genes encoding two interacting proteins. For each interaction the median of the resulting set of five (one for each dataset) correlation coefficients was calculated. We used a permutation approach (with 10 000 permutations) to compare the empirical correlation and mutual information with the corresponding background distributions. In each permutation step we held the expression levels X
of one protein fixed and permuted the interaction partners encoded by genes with expression levels Y
. Thus for each permutation we got a new interaction dataset with random protein pairs. For each of these datasets we calculated the correlation values and their median as before for the original dataset.
We repeated this procedure using mutual information as measure of dependence. The distributions of correlation and mutual information are shown in figure and in figure , respectively.
Calculating p-values for each dataset
As before we determined the correlation of vectors for each dataset, each vector containing the expression levels of genes encoding two interacting proteins. We also calculated these values for the permuted datasets (1000 permutations). To get more specific results, we did not use the median of the correlations or mutual information values here but performed the permutation approach for each dataset separately.
Denote with nperm the number of permutations and with nhigh the number of correlation and mutual information values higher than those in the original dataset. Then the estimated p-values are given by
p = nhigh/nperm. (5)
The corresponding p-values are shown in table .
Table 2 P-values using correlation and mutual information as measure of dependence. The p-values describe the probability of obtaining a higher correlation (third column) or mutual information (fourth column) than the one observed, assuming that the expression (more ...)
GO analysis in each dataset
To further elucidate dependencies between expression levels in the five datasets we analysed for each dataset if genes encoding proteins within different GO-classes representing biological processes have correlated expression levels. Therefore, using QuickGO [29
] we determined for each GO-class describing a biological process which of the 1369 interacting pairs include proteins, both of which belong to the respective biological process. We used these sets of interacting protein pairs to find biological processes that include protein pairs encoded by genes with highly correlated gene expression levels. By the use of a permutation test we compared the correlations of protein pairs belonging to a certain GO-class with the correlations of protein pairs not belonging to that GO-class. Analogous to the case without differentiation between GO-classes we can apply equation (5) again to get a p-value for each GO-class in each dataset (figure , , , , ). We did not perform a correction procedure for multiple testing because the tested GO-classes often include very similar or even identical sets of interactions. Essential in this analysis is the ranking of the GO-classes.