|Home | About | Journals | Submit | Contact Us | Français|
We propose a distribution-free approach to detect nonlinear relationships by reporting local correlation. The effect of our proposed method is analogous to piece-wise linear approximation although the method does not utilize any linear dependency. The proposed metric, maximum local correlation, was applied to both simulated cases and expression microarray data comparing the rd mouse with age-matched control animals. The rd mouse is an animal model (with a mutation for the gene Pde6b) for photoreceptor degeneration. Using simulated data, we show that maximum local correlation detects nonlinear association, which could not be detected using other correlation measures. In the microarray study, our proposed method detects nonlinear association between the expression levels of different genes, which could not be detected using the conventional linear methods. The simulation dataset, microarray expression data, and the Nonparametric Nonlinear Correlation (NNC) software library, implemented in Matlab, are included as part of the online supplemental materials.
Although nonlinear relationships between the expression levels of genes or gene products are expected (Kitano, 2002a,b) and observed (Rohrer et al., 2004) in biological datasets, there is no commonly used statistic quantifying nonlinear correlation that can find a similarly generic use as Pearson's correlation coefficient for quantifying linear correlation. Many analyses of complex biological phenomena still approach the biological question of correlation using global linear association only. Mutual information has been used to detect nonlinear correlation in microarray expression dataset, implicitly (Kasturi and Acharya, 2005) and explicitly (Steuer et al., 2002; Daub et al., 2004). However, different estimation methods of the mutual information lead to different conclusions. We propose a method to quantify global nonlinear relationship by reporting local correlation. Local here refers to the correlation that exists either at a certain scale or under certain condition(s), and is therefore also referred to as transient correlation. An example is illustrated in Figure 1 that includes partial signals that are positively correlated and partial signals that are negatively correlated. However, neither linear dependency is assumed nor used in our method. That is, the association need not be linear to be detected at either the local or global scale. Our proposed method quantifies correlation by examination of bivariate distances between data points and was inspired by the use of the correlation integral in a fluctuating dynamic system (Grassberger and Procaccia, 1983).
The correlation integral examines the cumulative distribution of distances between data points of a time series. With proper modification, we show that the correlation integral can be used to estimate global patterns and global association. We develop statistics using the correlation integral to estimate local correlation. We only focus on bivariate correlation.
The need for quantification of nonlinear relationships is particularly acute for expression microarray data, where a massive number of variables and a wide range of biological processes involved in a typical experiment represent a particularly convoluted version of the proverbial “search for a needle in a haystack”. Aside from the nonlinearity, several additional challenges (Quackenbush, 2001) are commonly found in high-throughput biological datasets: 1) different datasets have different scales, 2) the scale that is potentially biologically relevant is often unknown at the exploratory stage of the research, 3) high noise levels (Marshall, 2004), and 4) the sampling distribution is generally unknown, very seldom if ever is the data normally distributed. Multi-modality is not uncommon. We propose a generic method to quantify nonlinear correlation by reporting local correlation, with the option of removal of the scaling effects between different measurements, which will 1) detect association at multiple scales, 2) be insensitive to noise, and 3) not rely on distribution assumptions, i.e., a nonparametric method.
In Section 2, the correlation integral is introduced and we define the proposed measures of local correlation and of correlation change. Local correlation measures are used to describe the relationships across experimental units, genes in the case of our motivating application. On the other hand, correlation change allows us to identify experimental units whose relationships differ across two conditions, in our case gene expression in a treatment group versus a control group. In Section 3 we validate our method on simulated cases. Finally, in Section 4, we use the proposed measures to quantify nonlinear association change in microarray expression between a treatment group and a control group in an animal experiment. The treatment group are mice exhibiting photoreceptor degeneration (rd) and the control group are wild type mice (wt). The rd mouse is also referred to as the rd1 mouse in the literature. The generality of the proposed method makes it appropriate to many other types of data, such as those generated by proteomics or metabolomics.
We first describe the proposed summary of bivariate local correlation in words. Formal definition of the proposed method and the notations will follow in the next few paragraphs. First, each variable is transformed such that the marginal distribution is uniform. This is achieved by transforming to ranks (in ascending order) followed by a linear transformation. Let N denote the sample size. The linear transformation is achieved by subtracting the minimum rank (i.e., 1, in the absence of ties) from the ranks and then divided by the difference between maximum rank and minimum rank (i.e., N – 1, in the absence of ties). The rank transformation is the default setting in our implementation. Alternatively, this preprocessing step can be omitted if the raw data are already on comparable and non-arbitrary scales. Next we evaluate the neighbor density, which records the rate of change of the number of observations within a neighborhood of a given radius. We then compare the observed neighbor density with the neighbor density under the null hypothesis of no linear or nonlinear association. The difference defines the proposed measure of local correlation. Finally, we define maximum local correlation to describe overall bivariate nonlinear correlation.
The definition of the proposed method is based on the concept of correlation integrals. Consider a time series, zi, i = 1, . . . , N. The correlation integral I(r) is defined as (Grass-berger and Procaccia, 1983):
The correlation integral quantifies the average cumulative number of neighbors within radius r. The definition remains meaningful also when the data are not a time series.
To develop a measure of association between vectors, x and y, we modify the definition of I(r) as follows. Let zi = (xi, yi), i = 1, . . . , N, denote the observations in the dataset and let |zi – zj| denote Euclidean distance. We define . The observed distances are further linearly transformed between 0 and 1 before quantifying Î. Note that Î has the property of a cumulative distribution function (cdf). It is nondecreasing from 0 to 1 and continuous from the right. The function Î(r) describes the global pattern of neighboring distances.
Our primary interest is the definition of a metric to quantify nonlinear association by reporting local patterns. Therefore the neighbor density D is devised as the derivative of Î:
where ΔÎ(r) denotes change in Î(r). The observed neighbor density is evaluated at the discrete radius r, where r = 0, 1/m, 2/m, . . . , 1, and m is an arbitrary grid size that determines Δr = 1/m. An automatic smoother using cross-validation to choose an optimal window size (Vilela et al., 2007) is applied to smooth . Any traditional smoothing algorithms with proper choice of smoothing window size could alternatively be used here. In our study, the default size m is set as N, the number of observations. The statistic is a discrete approximation of dÎ(r)/dr, which has the formal properties of a probability density function (pdf). Therefore, with a slight abuse of terminology we refer to as a distribution. An example of a correlation integral and a neighbor density is illustrated in Fig. 2.
Intuitively, the distances between data points between two correlated variables would differ from that between two uncorrelated variables. Let denote the estimate of a null distribution, which is composed of two vectors without any bassociation. We define local correlation as the deviation of from that of the null distribution at a given neighboring distance r:
Our approach does not assume any parametric distribution. The flexibility of this method makes it easy to change the null distribution to any distribution of interest. The null hypothesis Ho of no difference between the observed neighbor density and is for r = 0, 1/m, . . . , 1. Let , b = 1, 2, . . . , B, denote permutation replicates (see Appendix A for details) and let be corresponding permutation evaluations of . A two-sided p-value can be evaluated by (Efron and Tibshirani, 1993):
where r = 0, 1/m, . . . , 1. Multiplicity correction could be carried out to adjust for the m tests performed. This could be done through the control of either false discovery rate (FDR) or experiment-wide error. In the following sections, we make use of these two alternative approaches for different aspects of the inferences.
Local correlation is further used for the definition of a summary statistic, maximum local correlation M, to describe overall nonlinear association. Using an idea similar to the Kolmogorov-Smirnov statistic, maximum local correlation is defined, as its name implies, by
The interpretation of as the difference of two pdf's implies that M can be interpreted as distance under the supremum norm of D and Do. In other words, we define the statistic M as the maximum deviation between two underlying neighbor densities. Statistical significance of M is assessed in a permutation procedure similar to (2). See Appendix B for details.
Recall that the motivating application for the development of the local correlation measures and M is an application for microarray data analysis. We will report details of the experiment and the data later, in Section 4. The setup is such that local correlation between responses for gene i and responses for gene j can be interpreted as measuring the relationship between genes i and j. Like many microarray experiments, the data includes measurements under two biologic conditions, wt and rd. The ultimate inference goal is to understand how the two different biologic conditions affect the gene functions. We formalize this inference goal by considering the change in local correlation between the two conditions.
Building upon maximal local correlation M, we propose a statistic δM to measure association change. The motivation is that nonlinear behavior in biology can reflect the fact that the molecular machinery that is underlying biological processes is reconfigured to changes occurring in physiological or diseased states. In the microarray data example, for each pair of genes (ij) we want to test whether there is a change of nonlinear correlation between rd and wt mice. We therefore define a statistic δM to identify changes in maximal local correlation. We will later use it to identify critical genes in the disease development process. The statistic δM is defined as
The results using our proposed methods are compared to those obtained using existing methods, including Pearson's linear correlation (C), Spearman's rank correlation coefficient, and mutual information (MI) as a nonlinear approach. Mutual information is evaluated using the R function mutual_information2() (Daub et al., 2004). The default setting was used for the R function. See Appendix C for a definition of MI. MI is a summary of (nonlinear) association between x and y. We use it for a comparison of various performance summaries in the examples. No details are needed for the upcoming discussion. See Daub et al. (2004) for more details.
The statistical significance for C and MI is estimated using the same permutation procedures as for M to ensure comparability. See (A.1) and (A.2) in Appendix A. Similar to (3) we define
Permuatation tests for δS are defined in (B.1) in Appendix B.
Twelve cases representing linear or different non-linear relationships were considered (Fig. 3). Case 1 is composed of independent vectors x and y. The observations of x and y in Case 2 have a perfect linear relationship. Case 3 is composed of 7 non-decreasing broken straight lines while Case 4 has 8 broken straight lines. Case 5 is a continuous monotonically increasing curve and Case 6 has 3 broken monotonically increasing curves; Case 7 is composed of three segments of sine waves. Case 8 is a mixture distribution of Cases 1 and 4 while Case 9 is a mixture of Cases 1 and 7. The 3 clusters in Case 10 are randomly sampled 100 data points from dataset 1 from Jonnalagadda and Srinivasan (2004). Case 11 is a mixture distribution of Cases 10 and 1, that is, a mixture of 3 clusters along with some background noise. Two-fifths of the data points in Case 11 are randomly sampled from Case 1 while 3/5 of the data are randomly sampled from Case 10. Case 12 includes five clusters with 3/5 of the data points randomly sampled from Case 10 and the remaining 2/5 of additional clusters (Fig. 3). The dataset (Jonnalagadda and Srinivasan, 2004) and the Matlab scripts to generate the simulated cases are included as supplemental materials. We evaluate and M for each case. For each case with stochastic components, i.e., Cases 1 and 8 through 12, we simulated 100 realizations to estimate positive rates of testing for significant maximum local correlation M under repeat simulation. For Case 1 the (false) positive rate is a type I error. For cases 8 through 12 the (true) positive rate is interpreted as statistical power. The results are summarized in Table 1. Maximum local correlation M was computed for the raw data values, without rank transformation, to keep results comparable with other correlation methods (Pearson's correlation, Spearman's correlation, and mutual information, which will be described later).
Fig. 4 shows the local correlations from one simulation for each of the 12 cases. Significant local correlations were found in all simulated cases, except (correctly) for Case 1. Here, significance is determined after a Bonferroni correction .
Positive local correlation means that the observed neighbor density is greater than the neighbor density under the null hypothesis when evaluated at radius r. Negative local correlation means that the observed neighbor density is lower than the neighbor density under the null hypothesis. Significant local correlations are detected at almost all scales (of radius r) in Case 2, which is a boundary condition with perfectly linear relationship. The profile of peaks near 0 and decreases along r. Observed multi-modal for all cases except Case 1 showed significant correlation at multiple scales. Statistically significant local correlation was observed for Case 11 although it contains both, the signals of Case 10 and the noise from Case 1.
Maximal local correlation is a summary statistic to test global nonlinearity while reports local correlation as a function of the neighborhood width of interest r. All cases with only deterministic components, i.e., Cases 2 to 7, had statistically significant maximum local correlation M at significance level α = 0.05 (Table 1). The stochastic component in the remaining Cases 1 and 8 through 12 allows us to evaluate positive rates across repeat experimentation. Except for Case 1, we (correctly) detect significant maximal local correlation for all 100 repeat simulations performed for each of these cases. That is, 100% true positive rates for Cases 8 through 12. Positive rate for these cases is interpreted as power. The (false) positive rate for maximal local correlation for Case 1 is 3% (Table 1). It is interpreted as type I error.
Pearson's (C), Spearman's (Sp), maximum local correlation (M), and mutual information (MI) for the 12 simulated cases are compared in Table 1. The results of Pearson's and Spearman's correlations showed that the majority of cases have significant global linear correlation, i.e., downward or upward trends, except for Case 1. When the global up or downward trend is weak, i.e., Cases 8 and 9, the statistical power for Pearson and Spearman's correlation is lower than that of M and MI. The performance of M and MI are comparable.
We analyze microarray expression data that was collected to identify critical genes in photoreceptor degeneration (Rohrer et al., 2004). The samples were collected from age-matched wild type controls (C57BL/6; abbreviated as wt) and rd mice with rod degeneration at 5 postnatal time points (6, 10, 14, 17, and 21 days of age). Retinas from 4 animals per genotype per time point were pooled, and biological duplicates were obtained. Each probe is treated as an independent unit (although some genes have more than one probe on the array). Gene expression data was filtered based on reproducibility for the original analysis, leaving 181 genes for the analysis.
Correlation change (δM, δC and δMI) was evaluated based on the Euclidean distances of expression profiles between wt and rd mice. We proceeded as follows. Average expression values of the duplicates were calculated for wt and rd mice, respectively. Next, distance matrices, and , were generated for the wt and rd mouse, respectively, with the (i, j) entry denoting the Euclidean distance between the i-th and the j-th genes using the data from the time course (Fig. 5). For each pair of the genes (i, j) we evaluated M, C, and MI between the i-th and j-th columns of the distance matrix, resulting in a total of 16,290 pair-wise correlations pairs, one for wt and one for rd. The correlation is measuring the relationship between how gene i interacts with all the other genes vs. how gene j interacts with all the other genes. Finally, correlation change between wt and rd mice, δS (S = M, C, MI), was evaluated using (4). The values of association change based on Spearman's rank correlation coefficient and Pearson's correlation coefficient are almost identical. Also, these two methods had very similar performance on 12 simulated cases. Therefore, δSpearman is not included in the results for the expression data.
Plotting Pearson's correlation C against M for each of the pair-wise correlations, results in a W shaped relationship (Fig. 6). That is, when Pearson correlation C is low, maximal local correlation M could be high, the portion reflected by the middle of the W. This phenomenon, however, is not vice versa. In addition, two sides of the W indicate that when C is high, the value of M is also high as well. In other words, this W shaped relationship indicates that maximum local correlation can capture local correlation observed in the biological dataset, which can not be detected using Pearson's correlation C. Figure 1 shows an example of correlation between two genes that can only be detected by nonlinear correlation measures (M = 0.93, p = 0.007; MI = 1.15, p = 0.01) but not by linear correlation (C = –0.08, p = 0.88; S = –0.21, p = 0.67). This is due to the fact that signals are partially positively and partially negatively correlated (e.g., Fig. 1). The agreement between M and MI is high across all pairs of genes (C = 0.81, p ≈ 0).
Association changes (δM, δC and δMI) between wt and rd animals were estimated for each pair of genes. Their p-values were estimated using (B.1) in Appendix B and corresponding q-values were estimated (Storey and Tibshirani, 2003). A total of 75 genes with at least one significant association change was detected using δC when controlling FDR at 0.05. In a similar manner, 137 genes with at least one significant association change were detected using δM, and 67 genes were detected using δMI when controlling FDR for each statistic at 0.05. The ranked importance of the candidate genes involved in rod degeneration is based on the frequency of association change with statistical significance. Some association changes are detected by both δM and δC statistics, but not all (Fig. 7a). The linear correlation between δM and δC is weak but significant (C = 0.03, p ≈ 4.4 · 10–5). Out of the selected candidate genes with significant association changes between control and disease states using δM and δC, 62 of the probes were selected by both statistics, 75 of them could only be detected using δM whereas 13 of them were only detected using δC. The overall agreement between δM and δMI, in contrast, is high (C = 0.74, p ≈ 0), especially for large values (Fig. 7b). This is consistent with the earlier observation that M and MI have relatively high concordance. When controlling FDR at 0.05 for each statistic, δM can detect more association change (170 changes) between rd and wt mouse with statistical significance than those detected by δMI (58 changes; Fig. 7b). Fig. 8 shows an example of an association change between genes, and this association change can only be detected using δM (δM = 1.22, p ≈ 0), but not by δC (δC = 0.47, p = 0.32) nor by δMI (δMI = 1.32, p = 0.001). In the wt mouse, the correlation between the two genes is linear, but the association becomes a nonlinear (partially positively and partially negatively) correlation in the rd mouse.
Given the current status of the literature on photoreceptor degeneration, much of the physiology still remains unclear. Knowing this, the results for δ statistics were evaluated through the following means. First, based on the fact that photoreceptor cell death in the rd mouse retina is caused by a mutation in Pde6b, leading to the complete loss of Pde6b mRNA expression, it is expected that a successful method will identify Pde6b as a high-ranked gene; i.e., Pde6b serves as a positive control. Indeed, Pde6b is ranked as the number 1 gene using either of the three statistics (Table 2). Second, the ultimate goal is to identify marker genes that are able to correctly recognize physiological processes (i.e., photoreceptor degeneration) and to identify key regulatory genes involved in this process. Thus, criteria for genes that accurately reflect the biological process might be (a) sharing similar expression profiles with genes involved in photoreceptor development, and (b) the involvement of their gene products in processes such as cellular stress response, cellular homeostasis, or others. These key regulatory genes might encode for transcription factors, growth factors or their receptors, anti-apoptotic genes, etc. To identify genes induced during the period of photoreceptor development, Pde6b was used as the bait gene to identify the 180 neighboring genes whose expression profiles cluster with Pde6b during development (http://cepko.med.harvard.edu/default.asp) (Black-shaw et al., 2001). Twenty-two of these neighboring genes were in the list of genes analyzed in this study. We then compared these genes against the candidates selected by each δ statistic. Among those, δC identified 9 of them, δMI identified 11, whereas δM identified 21 of the 22 possible genes, suggesting that δM was able to identify more photoreceptor-specific genes. The known biological roles of the top ranked genes for each method are summarized in Table 3. Interestingly, δM identified 3 transcription factors known to be involved in neuronal degeneration and cellular stress responses, whereas δC and δMI identified only 1 each, again suggesting that δM could identify more regulatory genes. To further evaluate the performance of the developed statistic, a gene regulatory network was constructed for the 181 analyzed genes using transcription factor binding sites to establish node connectivity (see Appendix D for details). Briefly, the top ranked genes generated by each of the statistics (Table 2) were used as “seed genes” to find a subnetwork of pairwise shortest paths in order to summarize the genes’ relationships. Each set of top ranked genes generated by each statistic (Table 2) was used to establish each of the subnetworks, subnetwork-M, subnetwork-MI, and subnetwork-C, depicted in Fig. 9. Let AM, AMI, and AC denote the sets of “discovered genes”, discovered independently of their expression levels, in the three subnetworks. The underlying assumption is that the discovered genes in AS (S = M, MI, C) may also play important roles as the “seed genes” since they are likely regulated by the same transcription factors. We asked whether these genes in A are also recognized by the δ statistics, and how do they compare to each other. Almost all of the genes in AM, AMI, and AC are also identified as candidate genes by δM (Fig. 9), with a few exceptions, Thrsp, Nefl, and Mt1. Among those, Thrsp and Nefl were not identified by any of statistics. Thus, the resulting “discovered” genes are in high agreement with the listed genes selected by no matter which set of “seeds” we started with. This is not the case for δMI and δC. There is at least one or more genes that can not be identified by δMI and δC in each of the 3 subnetworks.
Is the subnetwork-M (Fig. 9) plausible for further experimental assessment? c-Fos, a transcription factor, is known to be induced in the rd rods prior to degeneration (Rich et al., 1997), but was not essential for rod degeneration in this particular mouse model (Hafezi et al., 1997). Itm2C, a relatively unknown gene, interacts with the beta-site APP-cleaving enzyme 1, a protease important in the pathogenesis of Alzheimer's disease (Wickham et al., 2005). It is plausible that over-expression of this protein might be involved in rod degeneration. Dp1l1 is a protein presumably associated with membrane trafficking (Sato et al., 2005). Loss of Dp1l1 expression may be a reflection of cell loss, or alternatively suggest that disrupted cellular homeostasis in the rd rods leads to a secondary defect in membrane trafficking. Follow-up experiments are required to test the potential roles of these genes in the regulation of rod degeneration.
In Section 3, we demonstrated that our proposed metrics, local correlation (; Fig. 4) and maximum local correlation M (Table 1), can quantify generic nonlinear associations in all simulated cases that have been designed with some degree of association. This is in contrast to Pearson's or Spearman's correlations, which were not able to detect significant association when the data was not linear in nature.
In Section 4, we applied the proposed method to identify genes related to rod degeneration. The data were microarray expression data of healthy and mutant animals. All three δ statistics, δC, δM, and δMI, correctly identified Pde6b, the defective gene in the rd mouse, as the most central gene in the degeneration process, validating our method. Furthermore, δM selected more genes involved in photoreceptor development (the presumed inverse of photoreceptor degeneration), and more plausible key regulatory genes involved in the process of cell death when compared to δC, which selected genes whose expression appears changed as a response to death (Table 3). The discordant results between nonlinear and linear methods (Fig. 7a) as well as the concordant findings between δM and δMI (Fig. 7b) further emphasize the importance of the development of statistics to measure nonlinear association instead of global linear patterns. In addition, δM has higher statistical power, and can detect more local correlation change than δMI when controlling FDR at the same level (Fig. 7b). Our method is not distributional based, and is instead an omnibus approach for detecting nonlinear correlations.
The challenges of nonlinearity and small sample size for the massive amount of data generated by modern high-throughput methods set the stage for the study described here. We have extended the use of correlation integrals to detect nonlinear correlation between any two variables reporting transient association. The proposed method is shown to have higher statistical power when compared with the reference use of mutual information and Pearson's correlation. Being distribution-free makes this tool applicable to a wide variety of problems. Although we only applied this approach to expression data in this study, the method can be applied to many other data, such as those generated by proteomics and metabolomics. In conclusion, the development of novel correlation methods that cope with the characteristic transient nonlinearity of biological dependencies, as assessed by the pairwise comparison study reported here, holds great promise.
The results of our study suggest a number of other avenues for future research. Our current work only focused on bivariate correlation. A natural extension would be the development of multivariate local correlations. Another extension would be to identify local group membership using the significant local correlation at particular scales of interest. Improvement of computational e ciency could also be a topic itself in the future since the computational complexity is O(N2) as it stands currently. More details on the computational complexity could be found in Appendix E. The relationship between linear and nonlinear association changes (Fig. 7) indicates that using both correlations yield more information. This also leaves open the question of the relationship between these statistics or the development of a new metric.
Funding was provided by NIH grants EY-13520, a Core Grant EY-14793, and an unrestricted grant to MUSC from Research to Prevent Blindness, Inc., New York. YAC was also supported by a National Cancer Institute training grant (CA90301). JSA acknowledges funding from the National Heart, Lung and Blood Institute (contract number N01-HV-28181). Doctoral studies of AJR are supported by a NLM training grant. Carroll's research was supported by a grant from the National Cancer Institute (CA57030). BR is a Research to Prevent Blindness Olga Keith Weiss Scholar. The authors appreciate the comments from Dr. Francisco Pinto. We thank Katie Hulse for preparation of the microarray dataset. Finally, we would also like to thank the suggestions from the anonymous reviewers.
The supplemental materials are available in a single zip file, which contains a readme file with a detailed explanation of its contents. The major items are listed as follows.
Appendix: Appendices A-E are included as supplemental materials.
NNC toolbox: The Nonparametric Nonlinear Correlation (NNC) software library, implemented in Matlab. Example Matlab scripts to call the functions are also included.
Data: Simulation dataset, the Matlab scripts to simulate the data, and the microarray expression data used in this study.