Overview of SIG method
We present the SIG method based on stochastic process model, to examine genome-wide expression profiles from cancer samples at two stages correlated with progression from stage 1 (early stage) to stage 2 (late stage). The SIG method aims to find gene pairs with significant co-expression change between two stages, i.e. gene pairs have high co-expression in one cancer stage, and have little correlation in the other cancer stage.
There are three key steps in the SIG method:
Step 1: Calculating a correlation ratio (CR) to evaluate gene pair co-expression change. We calculate a ratio of z-transformed correlation that reflects the grade to which a gene pair is differentially co-expressed between two cancer stages.
Step 2: Estimating the significance level of CR. For a gene in interest, we estimate the statistical significance (nominal P-value) of the CR formed by another gene with it. Based on a random walk model in stochastic process, we construct an analytical distribution of CRs formed by all other genes with the gene factored into consideration to work as a background for significance assessment. This analytical distribution preserves gene specificity and cancer progression pertinence to better fit complex correlation structure in gene expression data.
Step 3: Adjusting for multiple hypothesis testing. When an entire database of genes is evaluated, to reduce the family wise error we adjust the estimated significance to account for multiple hypothesis testing by using Westfall-Young permutation, which exactly takes advantage of the dependence structure between genes.
Random walk model
The random walk model is a simple case of stochastic process. The basic idea of a random walk can be described by a drunkard's tottering along a street in a stepwise process: he may walk forward or backward stumblingly with the same probability at each step as generally assumed. This drunkard is known to initially depart from the zero position, and the probability of his being at the position x after time τ passed is modelled as follows
is the probability. After differential approximation, the solution of Equation (4) is approached as:
is the probability density as
is a constant as
. This solution represents the cumulative effect of many random events over the time interval τ
. It is the general premise that for a cancer cell at each mutation step, the co-expression of a gene pair varies randomly with the same chance to increase or decrease. Therefore, it is very similar to the random walk carried out in a stochastic form.
Stochastic process-based analytical distribution of correlation ratio
For a gene pair (A, B), we use Pearson correlation coefficient (CC) denoted as r
to measure their co-expression. r
is calculated in both stage 1 and stage 2. Then we transform r
into a metric z
using Fisher's r
transformation as z = ln[(1 + r
)/(1 - r
)]/2. After the transformation, z
is well known to be normally distributed [36
]. We use the z
-transformed correlation coefficient (z
-CC) to measure the co-expression of a gene pair, and in the following text, we name z
-CC as 'correlation' for simplicity.
For a gene A of interest, let X denote the correlations of all other genes with it in one stage, and Y denote the corresponding correlations in the other stage. Since correlation has been z-transformed, both X and Y are normally distributed:
where x and y are the values of correlation, μX, μY and σX, σY are the expectations and standard deviations of X and Y respectively, which are be estimated using standard methods.
We use the ratio of correlation of one stage relative to the other stage for measuring the co-expression change of a gene pair. The ratio of correlation is denoted as T, and calculated as T = X/Y. The analytical distribution of the ratio of correlation is described as follows:
where t is the value of variant T, fXY(x, y) is the joint probability density of X and Y. The co-expression of a gene pair progresses from an initial value in the early stage to a final value in the late stage, thus X and Y are dependent on each other. The joint probability of X and Y should be calculated using the conditional probability, which measures the probability of a correlation transformed to a final value when its initial value is known. Due to the direction that cancer only progressing from stage 1 to stage 2, there are two cases when computing the conditional probability of X and Y during the calculation of analytical distribution.
1) X belongs to stage 2, and Y belongs to stage 1
In this case, the joint probability of a gene pair's correlation values in two cancer stages is represented as:
) is the conditional probability density for a correlation transforming from value y
in stage 1 into value x
in stage 2. For this point, the conditional probability acts as the keystone in our theory. Since the conditional probability is highly correlated with progression between the two stages. Herein, we combine the differential co-expression study with stochastic process model by using the solution of random walk model to approximate the condition probability of correlations. According to random walk model, for a prostate cancer cell, the correlation of a gene pair has equal chance to increase or decrease, therefore G
) has similar cumulative effect with
as shown in Equation (4), and is approximated in the following formula. Please refer to additional file 3
for the detailed derivation.
where d represents the progression distance from stage 1 to stage 2, and D is a constant as defined in Equation (5). For generality, we set d = 1 and D = 1 for the investigations in this paper. The joint probability in Equation (9) is computed as:
Note that x = ty, then by combining Equations (8) and (11), the analytical distribution function for the ratio of correlation of stage 2 relative to stage 1 is presented as:
, and K
is a normalization constant.
A typical curve of fT
) is plotted in Figure , using the prostate cancer data of HS samples versus HR samples [20
] as example. The profile of fT
) is semblable to that of normal distribution, but fT
) has higher density in the center and lower density in the tails. For a given gene, its analytical distribution of the ratio of correlations formed by all other genes pairing this gene, is treated as a background to assess the significance of the observed values of the ratio of correlations. Based on the analytical distribution, gene pairs with absolutely large values of correlation ratio could be identified, i.e. gene pairs with high co-expression in stage 2 and little correlation in stage 1 are regarded as significantly differentially co-expressed.
Figure 7 Analytical distribution function of z-CC ratio between HS stage and HR stage. (a) The analytical distribution fT of z-CC ratio T = X/Y, X represents z-CCs at the HR stage and Y denotes z-CCs at the HS stage. T represents the gene pair co-expression alteration. (more ...)
It should be noted that there is a limitation in our methodology. If the correlation changes from positive to negative or from negative to positive, the corresponding ratio (referred as "P/N ratio") is negative and most of them are not far away from 0. If P/N ratio is negatively large enough, then the corresponding gene pair could be identified. However, there is still a small part of P/N ratios escaping from being identified, and these gene pairs would be missed accordingly.
2) X belongs to stage 1, and Y belongs to stage 2
In this case, the joint probability for a gene pair's correlation values in two cancer stages is represented as:
where G(y|x) is the conditional probability density for the correlation transforming from an initial value x in stage 1 into a final value y in stage 2. After similar derivations, the analytical distribution function fT(t) for the correlation ratio of stage 1 relative to stage 2 is given as
, and K
' is a normalization constant.
There are some essential differences between Equations (12) and (14), mainly manifested by the expressions of a
', and b
'. Detailed derivations for Equations (12) and (14) are demonstrated in the additional file 3
A typical curve for the analytical distribution fT(t) of correlation ratio in Equation (14) is plotted in Figure , also based on the data set of HS vs. HR. Notice that most correlation ratios (stage 1/stage 2) are very close to zero; whereas in Figure , correlation ratios (stage 2/stage 1) scatter more incompactly from 0. This suggests that gene pairs may have more co-expressions in stage 2 (HR) than in stage 1 (HS); in other words, it implies that the number of gene-gene interactions, either activation or repression, increases remarkably during prostate cancer evolution.
Estimating significance of gene pair differential co-expression
For a gene of interest, since the correlation ratio (CR) is calculated as a quantification of evidence for differential co-expression, a cut-off for identification is needed by using a false discovery rate criterion, or in other words, the significance level. This process involves calculating the null hypothesis distribution of CR that contains little co-expression change. The profile character of the analytical distribution function fT(t) has the properties similar to null hypothesis, and also keeps information about gene specificity. In our methodology, the calculation of null distribution is achieved through an approximation of the analytical distribution of the interested gene. Importantly, this approximation preserves gene specificity and progression character, in order to make gene pair identification without systematic derivation. Therefore, it provides a more biologically reasonable assessment of significance than that obtained by the traditional simulation data.
Therefore, we calculate the nominal P-value of a correlation ratio for significance estimation, based on the analytical distribution fT(t) of the gene in interest. The calculation is carried out by the area integral of distribution function.
By setting a false discovery rate α
to the nominal P-value, in principle, gene pair co-expression change can be determined as whether it is significant or not. However, the incidence of false positives for the total data, referred as 'family wise error rate', would be high if we do not adjust individual nominal P-values. We use the Westfall-Young Permutation [27
] to do the multiple hypothesis testing to correct for false positive occurrence, because this is the only correction for taking advantage of dependence structure between genes, and accounts for gene co-regulation (review [37
]). In the SIG method, for a given gene, Westfall-Young permutation is employed to calculate adjusted P-value for the pair formed by another gene with the given gene. The details of this multiple hypothesis testing are demonstrated in additional file 3
Identifying gene pair with differential co-expression pattern
We propose the following procedure for identifying differential gene pair co-expression patterns in different biological stages. For a gene A, this procedure screens all the other genes B and selects genes that form differentially co-expressed pair with it. For a gene pair (A, B), two analytical distributions
may be not the same. To achieve consistency, a pair of genes will be considered to be detected only if their CR-based adjusted P-values in each analytical distributions are both greater than a threshold value α
Taking together, the procedure of determining a gene pair (A, B) with significant co-expression pattern change is described as follows:
1) Calculate the correlation ratio tAB for gene pair (A, B).
2) Figure out the analytical distributions
3) For gene A chosen, calculate the adjusted P-value p* for tAB, if p* <α, regard gene B as differentially co-expressed with gene A.
4) For gene B, repeat step 3).
5) If gene A and gene B both have differential co-expression with each other, identify them as a significant differentially co-expressed pair.
In this paper, we set P-value threshold α = 0.05.