Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2858406

Formats

Article sections

- Abstract
- I. Introduction
- II. Related work
- III. Strongly Correlated Feature Subset
- IV. CARE
- V. Experiments
- VI. Conclusion
- References

Authors

Related links

Proc Int Conf Data Eng. Author manuscript; available in PMC 2010 April 22.

Published in final edited form as:

Proc Int Conf Data Eng. 2008 April 25; 24: 130–139.

doi: 10.1109/ICDE.2008.4497421PMCID: PMC2858406

NIHMSID: NIHMS132003

Department of Computer Science University of North Carolina at Chapel Hill Chapel Hill, NC 27599, USA

See other articles in PMC that cite the published article.

Finding latent patterns in high dimensional data is an important research problem with numerous applications. Existing approaches can be summarized into 3 categories: feature selection, feature transformation (or feature projection) and projected clustering. Being widely used in many applications, these methods aim to capture global patterns and are typically performed in the full feature space. In many emerging biomedical applications, however, scientists are interested in the local latent patterns held by feature subsets, which may be invisible via any global transformation. In this paper, we investigate the problem of finding local linear correlations in high dimensional data. Our goal is to find the latent pattern structures that may exist only in some subspaces. We formalize this problem as finding strongly correlated feature subsets which are supported by a large portion of the data points. Due to the combinatorial nature of the problem and lack of monotonicity of the correlation measurement, it is prohibitively expensive to exhaustively explore the whole search space. In our algorithm, CARE, we utilize spectrum properties and effective heuristic to prune the search space. Extensive experimental results show that our approach is effective in finding local linear correlations that may not be identified by existing methods.

Many real life applications involve the analysis of high dimensional data. For example, in bio-medical domains, advanced microarray techniques [1], [2] allow to monitor the expression levels of hundreds to thousands of genes simultaneously. By mapping each gene to a feature, gene expression data can be represented by points in a high dimensional feature space. To make sense of such high dimensional data, extensive research has been done in finding the latent structure among the large number of features. In general, existing approaches in analyzing high dimensional data can be summarized into 3 categories [3]: feature selection, feature transformation (or feature projection) and projected clustering.

The goal of feature selection methods [4], [5], [6], [7] is to find a single representative subset of features that are most relevant for the task at hand, such as classification. The selected features generally have low correlation with each other but have strong correlation with the target feature.

Feature transformation methods summarize the dataset by creating linear combinations of features in order to uncover the latent structure. The commonly used feature transformation methods include principal component analysis (PCA) [8], linear discriminant analysis (LDA), and their variants (see [9] for an overview). PCA is one of the most widely used feature transformation methods. It seeks an optimal linear transformation of the original feature space such that most variance in the data is represented by a small number of orthogonal derived features in the transformed space. PCA performs one and the same feature transformation on the entire dataset. It aims to model the global latent structure of the data and hence does not separate the impact of any original features nor identify local latent patterns in some feature subspaces.

Recently proposed projected clustering methods, such as [10], [11], [12], can be viewed as combinations of clustering algorithms and PCA. These methods can be applied to find clusters of data points that may not exist in the axis parallel subspaces but only exist in the projected subspaces. The projected subspaces are usually found by applying the standard PCA in the full dimensional space. Like other clustering methods, projected clustering algorithms find the clusters of points that are spatially close to each other in the projected space. However, a subset of features can be strongly correlated even though the data points do not form any clustering structure.

PCA is an effective way to determine whether a set of features, *F* = {**x**_{i1}, · · · , **x**_{in}}, show strong correlation [8]. The general idea is as follows. If the features in *F* are indeed strongly correlated, then a few eigenvectors of the covariance matrix with the largest variance will describe most variance in the whole dataset. Only a small amount of variance is represented by the remaining eigenvectors. The variance on each eigenvector is its corresponding eigenvalue of the covariance matrix *C _{F}* of

In many real life applications, however, it is desirable to find the *subsets of features* having strong linear correlations. For example, in gene expression data analysis, a group of genes having strong linear correlation is of high interests to biologists since it helps to infer unknown functions of genes and gives rise to hypotheses regarding the mechanism of the transcriptional regulatory network [1], [2]. We refer to such correlation among a subset of features in the dataset as a * local linear correlation* in contrast to the global correlation found by the full dimensional feature transformation methods.

For example, Figure 1 shows a dataset consisting of 9 features and 15 data points. Among the 9 features, {**x**_{2}, **x**_{7}, **x**_{9}} have local linear correlation 2**x**_{2} + 6**x**_{7} + 3**x**_{9} = 0 on point set {**p**_{1}, **p**_{2}, · · · , **p**_{9}}, and {**x**_{1}, **x**_{5}, **x**_{6}, **x**_{8}} have local linear correlation **x**_{1}+3**x**_{5}+2**x**_{6}+5**x**_{8} = 0 on point set {**p**_{7}, **p**_{8}, · · · , **p**_{15}} with i.i.d. gaussian noise of mean 0 and variance 0.01. The eigenvalues of the example dataset is shown in Figure 2.

Figure 2 tells us that the features in the example dataset are somehow correlated, since the smallest eigenvalues are much smaller than the largest ones.

To get the linear correlation identified by PCA, we can apply the following approach. Note that this approach has been adopted in [12] to derive the quantitative descriptions for projected clusters. As a basic concept of linear algebra, a hyperplane is a subspace of co-dimension 1 [8]. Each vector **a** = [*a*_{1}, *a*_{2}, · · · , *a _{n}*]

Using the example dataset, Table I shows the hyperplanes (linear correlations) determined by the 3 eigenvectors with the smallest eigenvalues. Clearly, none of them captures the embedded correlations. This is because PCA does not separate the impact of different feature subsets that are correlated on different subsets of points.

Recently, many methods [1], [13] have been proposed for finding clusters of features that are pair-wisely correlated. However, a set of features may have strong correlation but each pair of features only weakly correlated.

For example, Figure 4 shows 4 genes that are strongly correlated in the mouse gene expression data collected by the biologists in the School of Public Health at UNC. All of these 4 genes have same Gene Ontology (GO) [14] annotation *cell part*, and three of which, Myh7, Hist1h2bk, and Arntl, share the same GO annotation *intracelluar part*. The linear relationship identified by our algorithm is −0.4(*Nrg*4) + 0.1(*Myh*7) + 0.7(*Hist*1*h*2*bk*) − 0.5(*Arntl*) = 0. As we can see from the figure, all data points almost perfectly lay on the same hyperplane, which shows that the 4 genes are strong correlated. (In order to visualize this 3-dimensional hyperplane, we combine two features, Nrg4 and Myh7, into a single axis as −0.4(*Nrg*4) + 0.1(*Myh*7) to reduce it to a 2-dimensional hyperplane.) If we project the hyperplane onto 2 dimensional spaces formed by each pair of genes, we find none of them show strong correlation, as depicted in Figures 5(a) to 5(c).

Projected clustering algorithms [10], [11] have been proposed to find the clusters of data points in projected feature spaces. This is driven by the observation that clusters may exist in arbitrarily oriented subspaces. Like other clustering methods, these methods tend to find the clusters of points that are spatially close to each other in the feature space. However, as shown in Figure 4, a subset of features (genes in this example) can still be strongly correlated even if the data points are far away from each other. This property makes such strong correlations invisible to the projected clustering methods. Moreover, to find the projections of original features, projected clustering methods apply PCA in the full dimensional space. Therefore they cannot decouple the local correlations hidden in the high dimensional data.

In order to find the local linear correlations, a straightforward approach is to apply PCA to all possible subsets of features to see if they are strongly correlated. This approach is infeasible due to the large number of possible feature combinations. For example, given a 100-dimensional dataset, the number of feature subsets need to be checked is 2^{100}.

Real life datasets often contain noises and outliers. Therefore, a feature subset may be correlated only on a subset of the data points. In order to handle this situation, it is reasonable to allow the algorithm to find the local linear correlations on a large portion of the data points. This makes the problem even harder since for a fixed subset of features, adding (or deleting) data points can either increase or decrease the correlation among them. More details about the computational challenges of finding local linear correlations can be found in Section III.

In this paper, we investigate the problem of finding local linear correlations in high dimensional data. This problem is formalized as finding * strongly correlated feature subsets*. Such feature subsets show strong linear correlations on a large portion of the data points. We examine the computational challenges of the problem and develop an efficient algorithm, CARE

The goal of feature transformation (or projection) methods, such as PCA, is to find linear combinations of original features in order to uncover the latent structures hidden in the data. Feature transformation methods can be further divided into supervised methods and unsupervised methods. Principal component analysis (PCA) is a representative unsupervised projection method. PCA finds the eigenvectors which represent the directions with maximal variances of the data by performing singular value decomposition (SVD) to the data matrix [8]. Supervised methods take the target feature into consideration. Existing supervised methods include linear regression analysis (LRA) [15], linear discriminant analysis (LDA) [16], principal component regression (PCR) [17], supervise probabilistic PCA (SPPCA) [18] and many others [9]. LRA and LDA find the linear combinations of the input (predictor) features which best explain the target (dependent) feature. In these methods, the input features are generally assumed to be non-redundant, i.e., they are linearly independent. If there are correlations in the input features, PCR first applies PCA to the input features. The principal components are then used as predictors in standard LRA. SPPCA extends PCA to incorporate label information. These feature transformation methods perform one and the same feature transformation for the entire dataset. It does not separate the impact of any original features nor identify local correlations in feature subspaces.

Feature selection methods [4], [5], [6], [7] try to find a subset of features that are most relevant for certain data mining task, such as classification. The selected feature subset usually contains the features that have low correlation with each other but have strong correlation with the target feature. In order to find the relevant feature subset, these methods search through various subsets of features and evaluate these subsets according to certain criteria. Feature selection methods can be further divided into two groups based on their evaluation criteria: wrapper and filter. Wrapper models evaluate feature subsets by their predictive accuracy using statistical re-sampling or cross-validation. In filter techniques, the feature subsets are evaluated by their information content, typically statistical dependence or information-theoretic measures. Similar to feature transformation, feature selection finds one feature subset for the entire dataset.

Subspace clustering is based on the observation that clusters of points may exist in different subspaces. Many methods [19], [20], [21] have been developed to find clusters in axes paralleling subspaces. Recently, the projected clustering was studied in [10], [11], inspired by the observation that clusters may exist in arbitrarily oriented subspaces. These methods can be treated as combinations of clustering algorithms and PCA. Similar to other clustering methods, these methods tend to find the clusters of points that are close to each other in the projected space. However, as shown in Figure 4, a subset of features still can be strongly correlated even if the data points are far away from each other. Pattern based bi-clustering algorithms have been studied in [1], [13]. These algorithms find the clusters in which the data points share pair-wise linear correlations, which is only a special case of linear correlation.

In this section, we formalize the problem and study its computational challenges.

Let *D* = *A* × *B* be a data matrix consisting of *M N*-dimensional data points, where the feature set *A* = {**x**_{1}, **x**_{2}, · · · , **x**_{N}} and the point set *B* = {**p**_{1}, **p**_{2}, · · · , **p**_{M}}. Figure 1 shows an example dataset with 15 points and 9 features.

A strongly correlated feature subset is a subset of features that show strong linear correlation in a large portion of data points.

*Definition 1:* Let *F* = {**x**_{i1}, · · · , **x**_{in}} × {**p**_{j1}, · · · , **p**_{jm}} be a submatrix of *D*, where 1 ≤ *i*_{1} < *i*_{2} < · · · < *i _{n}* ≤

Eigenvalue ≥ λ_{l} is the variance on eigenvector **v**_{l} [8]. The set of eigenvalues {λ_{l}} of matrix *C _{F}* is also called the

Geometrically, each *m* × *n* submatrix of *D* represents an *n*-dimensional space with *m* points in it. This *n*-dimensional space can be partitioned into two subspaces, *S*_{1} and *S*_{2}, which are orthogonal to each other. *S*_{1} is spanned by the *k* eigenvectors with smallest eigenvalues and *S*_{2} is spanned by the remaining *n* − *k* eigenvectors. Intuitively, if the variance in subspace *S*_{1} is small (equivalently the variance in *S*_{2} is large), then the feature subset is strongly correlated. The input parameters *k* and threshold *ε* for the objective function $f(F,k)=\frac{{\Sigma}_{t=1}^{k}{\lambda}_{t}}{{\Sigma}_{t=1}^{n}{\lambda}_{t}}$ are used to control the strength of the correlation among the feature subset. The default value of *k* is 1. The larger the value of *k*, the stronger the linear correlation.

The reason for requiring *m/M* ≥ *δ* is because a feature subset can be strongly correlated only in a subset of data points. In our definition, we allow the strongly correlated feature subsets to exist in a large portion of the data points in order to handle this situation. Note that it is possible that a data point may participate in multiple local correlations held by different feature subsets. This makes the local correlations more difficult to detect. Please also note that for a given strongly correlated feature subset, it is possible that there exist multiple linear correlations on different subsets of points. In this paper, we focus on the scenario where there exists only one linear correlation for a strongly correlated feature subset.

For example, in the dataset shown in Figure 1, the features in submatrix *F* = {**x**_{2}, **x**_{7}, **x**_{9}}×{**p**_{1}, **p**_{2}, · · · , **p**_{9}} is a strongly correlated feature subset when *k* = 1, *ε* = 0.004 and *δ* = 60%. The eigenvalues of the covariance matrix, *C _{F}*, the input parameters and the value of the objective function are shown in Table II.

In real world applications, it is typical that many local correlations co-exist, each of which involves a small number of features. Thus, it is reasonable to set the maximum size, *max _{s}*, of the feature subsets to be considered for each local correlation

Our goal is to find all strongly correlated feature subsets in the database *D*. This problem is computationally challenging. In the following subsection, we study the properties concerning the monotonicity of the objective function with respect to the feature subsets and point subsets separately.

The following theorem concerning the spectrum of covariance matrix developed in the matrix theory community is often called the *interlacing eigenvalues theorem*^{4} [22].

*Theorem 3.1:* Let *F* = {**x**_{i1}, · · · , **x**_{in}} × {**p**_{j1}, · · · , **p**_{jm}} and *F′* = {**x**_{i1}, · · · , **x**_{in}, **x**_{i(n+1)}} × {**p**_{j1}, · · · , **p**_{jm}} be two submatrices of *D*. *C _{F}* and

$${\lambda}_{1}^{\prime}\le {\lambda}_{1}\le {\lambda}_{2}^{\prime}\le {\lambda}_{2}\le \cdots \le {\lambda}_{n-1}\le {\lambda}_{n}^{\prime}\le {\lambda}_{n}\le {\lambda}_{n+1}^{\prime}.$$

Theorem 3.1 tells us that the spectra of *C _{F}* and

By applying the interlacing eigenvalues theorem, we have the following property for the strongly correlated feature subsets.

*Property 3.2:* (Upward closure property of strongly correlated feature subsets) Let *F* = *X* × *P* and *F′* = *X′* × *P* be two submatrices of *D* with $X\subseteq {X}^{\prime}$. If *X* is a strongly correlated feature subset, then *X′* is also a strongly correlated feature subset.

*Proof:* We show the proof for the case where |*X′*| = |X| + 1, i.e., *X* is a subset of *X′* by deleting one feature from *X′*. Let *C _{F}* and

The following example shows the monotonicity of the objective function with respect to the feature subsets. Using the dataset shown in Figure 1, let *F*_{1} = *X*_{1} × *P*_{1} = {**x**_{2}, **x**_{7}} × {**p**_{1}, · · · , **p**_{15}}, ${F}_{1}^{\prime}=({X}_{1}\cup \left\{{\mathbf{x}}_{9}\right\})\times {P}_{1}$, and ${F}_{1}^{\prime \prime}=({X}_{1}\cup \{{\mathbf{x}}_{4},{\mathbf{x}}_{9}\})\times {P}_{1}$. The values of the objective function, when *k* = 1, are shown in Table III. It can be seen from the table that the value of the objective function monotonically decreases when adding new features.

Property 3.2 shows that for a fixed set of points, if a subset of features are strongly correlated, then all of its supersets are also strongly correlated. Therefore, in our algorithm, we can focus on finding * the minimum strongly correlated feature subsets*, of which no subset is strongly correlated.

For a fixed feature subset, adding (or deleting) data points may cause the correlation of the features to either increase or decrease. That is, the objective function is non-monotonic with respect to the point subsets. The following property states this fact.

*Property 3.3:* Let *F* = {**x**_{i1}, · · · , **x**_{in}} × {**p**_{j1}, · · · , **p**_{jm}} and *F′* = {**x**_{i1}, · · · , **x**_{in}} × {**p**_{j1}, · · · , **p**_{jm}, **p**_{j(m+1)}} be two submatrices of *D. f(F, k)* can be equal to, or less than, or greater than *f(F′, k)*.

We use the following example to show the non-monotonicity of the objective function with respect to the point subsets. Using the dataset shown in Figure 1, let *F*_{2} = *X*_{2} × *P*_{2} = {**x**_{2}, **x**_{7}, **x**_{9}} × {**p**_{1}, · · · , **p**_{9}, **p**_{11}}, ${F}_{2}^{\prime}={X}_{2}\times ({P}_{2}\cup \left\{{\mathbf{p}}_{10}\right\})$, and ${F}_{2}^{\prime \prime}={X}_{2}\times ({P}_{2}\cup \left\{{\mathbf{p}}_{14}\right\})$. The values of their objective functions, when *k* = 1, are shown in Table IV. It can be seen from the table that the value of the objective function *f* can either increase or decrease when adding more points.

In summary, the value of the objective function will monotonically decrease when adding new features. On the other hand, adding new points can either increase or decrease the value of the objective function.

In this section, we present the algorithm CARE for finding the minimum strongly correlated feature subsets. CARE enumerates the combinations of features to generate candidate feature subsets. To examine if a candidate is a strongly correlated feature subset, CARE adopts a 2-step approach. It first checks if the feature subset is strongly correlated on all data points. If not, CARE then apply point deletion heuristic to find the appropriate subset of points on which the current feature subset may become strongly correlated. In Section IV-A, we first discuss the overall procedure of enumerating candidate feature subsets. In Section IV-B, we present the heuristics for choosing the point subsets for the candidates that are not strongly correlated on all data points.

For any submatrix *F* = *X* × {**p**_{1}, · · · , **p**_{M} } of *D*, in order to check whether feature subset X is strongly correlated, we can perform PCA on *F* to see if its objective function value is lower than the threshold, i.e., if $f(F,k)=\frac{{\Sigma}_{t=1}^{k}{\lambda}_{t}}{{\Sigma}_{t=1}^{n}{\lambda}_{t}}\le \u220a$.

Starting from feature subsets containing a single feature, CARE adopts depth first search to enumerate combinations of features to generate candidate feature subsets. In the enumeration process, if we find that a candidate feature subset is strongly correlated by evaluating its objective function value, then all its supersets can be pruned according to Property 3.2.

Next, we present an upper bound on the value of the objective function, which can help to speed up the evaluation process. The following theorem shows the relationship between the diagonal entries of a covariance matrix and its eigenvalues [22].

*Property 4.1:* Let *F* be a submatrix of *D* and *C _{F}* be the

Applying Property 4.1, we can get the following proposition.

*Proposition 4.2:* Let *F* be a submatrix of *D* and *C _{F}* be the

The proof of Proposition 4.2 is straightforward and omitted here. This proposition gives us an upper bound of the objective function value for a given submatrix of *D*. For any submatrix *F* = *X* × {**p**_{1}, · · · , **p**_{M}} of *D*, we can examine the diagonal entries of the covariance matrix *C _{F}* of

In the previous subsection, we discussed the procedure of generating candidate feature subsets. A feature subset may be strongly correlated only on a subset of the data points. As discussed in Section III-B.2, the monotonicity property does not hold for the point subsets. Therefore, some heuristic must be used in order to avoid performing PCA on all possible subsets of points for each candidate feature subset. In this subsection, we discuss the heuristics that can be used for choosing the subset of points.

For a given candidate feature subset, if it is not strongly correlated on all data points, we can delete the points successively in the following way.

Suppose that *F* = {**x**_{i1}, · · · , **x**_{in}} × {**p**_{1}, · · · , **p**_{M} } is a submatrix of *D* and *f(F, k)* > *ε*, i.e., the features of *F* is not strongly correlated on all data points. Let *F*_{\pa} be the submatrix of *F* by deleting point **p**_{a} (**p**_{a} {**p**_{1}, · · · , **p**_{M}}) from *F* . This heuristic deletes the point **p**_{a} from *F* such that f(_{\pa}, *k*) has the smallest value comparing to deleting any other point. We keep deleting points until the number of points in the submatrix reaches the ratio *m/M* = *δ* or the feature subset of *F* turns out to be strongly correlated on the current point subset.

This is a successive greedy point deletion heuristic. In each iteration, it deletes the point that leads to the most reduction in the objective function value. This heuristic is time consuming, since in order to delete one point from a submatrix containing *m* points, we need to calculate the objective function value *m* times in order to find the smallest value.

In this subsection, we discuss the heuristic used by CARE. It avoids calculating objective function value *m* times for deleting a single point from a submatrix containing *m* points.

Suppose that *F* = {**x**_{i1}, · · · , **x**_{in}} × {**p**_{1}, · · · , **p**_{M}} is a submatrix of *D* and *f(F, k)* > *ε*, i.e., the features of *F* is not strongly correlated on all data points. As discussed in Section III-A, let *S*_{1} be the subspace spanned by the *k* eigenvectors with the smallest eigenvalues and the *S*_{2} be the subspace spanned by the remaining *n* − *k* eigenvectors. For each point **p**_{a} (**p**_{a} {**p**_{1}, · · · , **p**_{M}}), we calculate two distances: *d*_{a1} and *d*_{a2}. *d*_{a1} is the distance between **p**_{a} and the origin in sub-eigenspace *S*_{1} and *d*_{a2} is the distance between **p**_{a} and the origin in sub-eigenspace *S*_{2}. Let the distance ratio *r*_{pa} = *d*_{a1}/*d*_{a2}. We sort the points according to their distance ratios and delete (1−*δ*)*M* points that have the largest distance ratios.

The intuition behind this heuristic is that we try to reduce the variance in subspace *S*_{1} as much as possible while retaining the variance in *S*_{2}.

Using the running dataset shown in Figure 1, for feature subset {**x**_{2}, **x**_{7}, **x**_{9}}, the deleted points are shown as red stars in Figures 6(a) and 6(b) using the two different heuristics described above. The reestablished linear correlations are 2**x**_{2}+5.9**x**_{7}+3.8**x**_{9} = 0 (successive), and 2**x**_{2}+6.5**x**_{7}+2.9**x**_{9} = 0 (distance-based). Note that the embedded linear correlation is 2**x**_{2} + 6**x**_{7} + 3**x**_{9} = 0. As we can see from the figures, both methods choose almost the same point subsets and correctly reestablish the embedded linear correlation.

The distance-based heuristic is more efficient than the successive approach since it does not have to evaluate the value of the objective function many times for each deleted point.

As a summary of Section IV, CARE adopts the depth-first search strategy to enumerate the candidate feature subsets. If a candidate feature subset is not strongly correlated on all data points, then CARE applies the distance-based point deletion heuristic to find the subset of points on which the candidate feature subset may have stronger correlation. If a candidate turns out to be a strongly correlated feature subset, then all its supersets can be pruned.

To evaluate CARE, we apply it on both synthetic datasets and real life datasets. CARE is implemented using Matlab 7.0.4. The experiments are performed on a 2.4 GHz PC with 1G memory running WindowsXP system.

To evaluate the effectiveness of the CARE, we generate a synthetic dataset of 100 features and 120 points in the following way. The dataset is first populated with randomly generated points for each one of the 100 features. Then we embedded three local linear correlations into the dataset as described in Table V. For example, on points {**p**_{1}, · · · , **p**_{60}} we create local linear correlation **x**_{50} − **x**_{20} + 0.5**x**_{60} = 0. Gaussian noise with mean 0 and variance 0.01 is added into the dataset.

We first show the comparison of CARE and full dimensional PCA. We perform PCA on the synthetic dataset described above. To present the linear correlation discovered by PCA, we show the resulting hyperplanes determined by the three eigenvectors with the smallest eigenvalues. Each such hyperplane represents a linear correlation of all the features in the dataset. Due to the large number of features, we only show the features with coefficients with absolute values greater than 0.2. The linear correlations reestablished by full dimensional PCA are shown in Table VI. Clearly, these are not the local linear correlations embedded in the dataset.

Table VII shows the local linear correlations reestablished by CARE, with *k* = 1, *ε* = 0.006, *δ* = 50%, and *max _{s}* = 4. As can be seen from the table, CARE correctly identifies the correlations embedded in the dataset.

Figure 7 shows the hyperplane representation of the local linear correlation, **x**_{40} − 0.97**x**_{30} + 0.83**x**_{80} − 0.47**x**_{10} = 0, reestablished by CARE. Since this is a 3-dimensional hyperplane in 4-dimensional space, we visualize it as a 2-dimensional hyperplane in 3-dimensional space by creating a new feature (−0.83**x**_{80}+0.47**x**_{10}). As we can see from the figure, the data points are not clustered on the hyperplane even though the feature subsets are strongly correlated. The existing projected clustering algorithms [10], [11], [12] try to find the points that are close to each other in the projected space. Therefore, they can not find the strongly correlated feature subset as shown in this figure. In Section V-B.3, we further compare CARE with projected clustering method on real dataset.

In Figures 8(a) to 8(c), we show the pair-wise correlations between the features of the local linear correlation x_{40} − 0.97**x**_{30} + 0.83**x**_{80} − 0.47**x**_{10} = 0. These figures demonstrate that although the feature subset is strongly correlated, the pair-wise correlations of the features may still be very weak. The clustering methods [1], [13] focusing on pair-wise correlations cannot find such local linear correlations.

We run CARE under different parameter settings. Table VIII shows the local linear correlations reestablished by CARE for the embedded correlation **x**_{40} − **x**_{30} + 0.8**x**_{80} − 0.5**x**_{10} = 0. As we can see from the table, CARE is not very sensitive to the parameters. Similar results have also been observed for other embedded correlations.

To evaluate the efficiency of CARE, we generate synthetic datasets as follows. Each synthetic dataset has up to 500K points and 60 features, in which 40 linear correlations are embedded. Gaussian noise with mean 0 and variance 0.01 is added into the dataset. The default dataset for efficiency evaluation contains 5000 points and 60 features if not specified otherwise. The default values for the parameters are: *k* = 1, *ε* = 0.006, *δ* = 50%, and *max _{s}* = 4.

Figures 9(a) to 9(f) show the efficiency evaluation results. Figure 9(a) shows that the running time of CARE is roughly quadratic to the number of features in the dataset. Note that the theoretical worst case should be exponential when the algorithm has to check every subset of the features and data points. Figure 9(b) shows the scalability of CARE with respect to the number of points when the dataset contains 30 features. The running time of CARE is linear to the number of data points in the dataset as shown in the figure. This is due to the distance-based point deletion heuristic. As we can see from the figure, CARE finishes within reasonable amount of time for large datasets. However, since CARE scales roughly quadratically to the number of features, the actual runtime of CARE mostly depends on the number of features in the dataset.

Figure 9(c) shows that the runtime of CARE increases steadily until *ε* reaches certain threshold. This is because the higher the value of *ε*, the weaker the correlations identified. After certain point, too many weak correlations meet the criteria will be identified. Figure 9(d) demonstrates that CARE's runtime when varying *δ*. Figure 9(e) shows CARE's runtime with respect to different *max _{s}* when the datasets contain 20 features.

Figure 9(f) shows the number of patterns evaluated by CARE before and after applying the upper bound of the objective function value discussed in Section IV-A.

We apply CARE on the mouse gene expression data provided by the School of Public Health at UNC. The dataset contains the expression values of 220 genes in 42 mouse strains. CARE find 8 strongly correlated gene subsets with parameter setting: *k* = 1, *ε* = 0.002, *δ* = 50%, and *max _{s}* = 4. Due to the space limit, we show 4 of these 8 gene subsets in Table IX with their symbols and the corresponding GO annotations. As shown in the table, genes in each gene subset have consistent annotations. We also plot the hyperplanes of these strongly correlated gene subsets in 3-dimensional space in Figures 10(a) to 10(d). As we can see from the figures, the data points are sparsely distributed in the hyperplanes, which again demonstrates CARE can find the groups of highly similar genes which cannot be identified by the existing projected clustering algorithms.

We apply CARE on the NBA statistics dataset^{5}. This dataset contains the statistics of 28 features for 200 players of season 2006-2007. Since the features have different value scales, we normalized each feature by its variance before applying CARE. The parameter setting is: *k* = 2, *ε* = 0.003, *δ* = 50% and *max _{s}* = 4. We report some interesting local linear correlations found by CARE in Table X.

- Correlation 1 says that the total number of rebounds is equal to the sum of defensive and offensive rebounds. This is an obvious correlation that one would expect.
- The meaning of correlation 2 is that the number of games played is highly correlated with the 2-point shooting percentage and free throw percentage of the player.
- Correlation 3 says that players having high 3-point shooting percentage tend to get more 3-point field goals in the game.
- Correlation 4 tells us that the total number of field goals made by a player is correlated with his 2-point shooting percentage and the number of times he attempted to shoot 3-point.
- Correlation 5 shows the number of games started depends on how good the player is at offensive rebounds and free throws.

We plot Correlation 5 in Figure 11. This correlation holds on 3 different groups of players. The points in circle 1 show that the players not good at both offensive rebound and free throw get low game start. Circle 2 shows that players good at free throw get high game start and circle 3 show players good at offensive rebound get high game start. The points in circle 1 are close to each other but other points are far away from each other. Therefore this local linear correlation is invisible to the existing projected clustering algorithms.

We further compare CARE with the projected clustering method COPAC [12], which has been demonstrated to be more effective than ORCLUS [10] and 4C [11]. We apply CARE on the wage dataset^{6}, which also has
been used in [12]. CARE successfully identifies both linear correlations reported in [12], i.e., *YE* + *YW* − *A* = −6 and *YW* − 1.*0*3A = −17.4. Further more, CARE identifies two new linear correlations, which are 4.25*YW*+*W*−4.5*A* = −80 and 2.4*YE*+0.34*YW*−*W* = 28.4. These two linear correlations show the relationship among wage, working experience, age, and education, which are not discovered by COPAC. Figure 12(a) shows the hyperplane of linear correlation *YE*+*YW* − *A* = −6, which is found by both methods. In this figure, the points in the red circle form a density connected cluster. Therefore, the projected clustering method can find the correlation by first identifying this cluster. However, as shown in the figure, this correlation is also supported by other points outside the cluster. We also plot, in Figure 12(b), the hyperplane of correlation 4.25*YW*+*W*−4.5*A* = −80, which is only found by CARE. Clearly, such correlation cannot be found by projected clustering methods because the points are sparsely distributed on the plane.

In this paper, we investigate the problem of finding local linear correlations in high dimensional datasets. The local linear correlations may be invisible to the global feature transformation methods, such as PCA. We formalize this problem as finding the feature subsets that are strongly correlated on a large number of data points. We use spectrum theory to study the monotonicity properties of the problem. An efficient and effective algorithm, CARE, for finding such strongly correlated feature subsets is presented. The experimental results show that CARE can find these interesting local linear correlations that cannot be identified by the existing algorithms, such as full dimensional PCA, and projected clustering methods. The experimental results also demonstrate that CARE scales well to large datasets.

Our work reported in this paper focuses on the case where there is one linear correlation for a strongly correlated feature subset. For future work, one interesting direction is to extend current work to find multiple linear correlations in a feature subset. This is more challenging, since to find such correlations we have to decouple both features and points.

^{1}CARE stands for finding loCAl lineaR corrElations.

^{2}In this paper, we assume that the eigenvalues are always arranged in increasing order. Their corresponding eigenvectors are {**v**_{1}, **v**_{2}, · · · , **v**_{n}}.

^{3}Setting the maximum size of the feature subsets is also used in many other feature selection and feature transformation methods [5], [8].

^{4}This theorem also applies to Hermitian matrix [22]. Here we focus on the covariance matrix, which is semi-positive definite and symmetric.

^{5}http://sports.espn.go.com/nba/teams/stats?team=Bos&year=2007&season=2

1. Eisen M, Spellman P, Brown P, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA. 1998;95:14 863–68. [PubMed]

2. Iyer V, et al. The transcriptional program in the response of human fibroblasts to serum. Science. 1999;283:83–87. [PubMed]

3. Parsons L, Haque E, Liu H. Subspae clustering for high dimensional data: a review. KDD Explorations. 2004

4. Blum A, Langley P. Selection of relevant features and examples in machine learning. Artificial Intelligence. 1997;97:245–271.

5. Liu H, Motoda H. Feature selection for knowledge discovery and data mining. Kluwer Academic Publishers; Boston: 1998.

6. Yu L, Liu H. Feature selection for high-dimensional data: a fast correlation-based filter solution. Proceedings of International Conference on Machine Learning; 2003.

7. Zhao Z, Liu H. Searching for interacting features. the 20th International Joint Conference on AI; 2007.

8. Jolliffe I. Principal component analysis. Springer; New York: 1986.

9. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. Springer Verlag; 1996.

10. Aggarwal C, Yu P. Finding generalized projected clusters in high dimensional spaces. SIGMOD. 2000

11. Bohm C, Kailing K, Kroger P, Zimek A. Computing clusters of correlation connected objects. SIGMOD. 2004

12. Achtert E, Bohm C, Kriegel H-P, Kroger P, Zimek A. Deriving quantitative models for correlation clusters. KDD. 2006

13. Wang H, Wang W, Yang J, Yu Y. Clustering by pattern similarity in large data sets. SIGMOD. 2002

14. Ashburner M, et al. Gene ontology: tool for the unification of biology. The gene ontology consortium, Nat. Genet. 2000;25:25–29. [PMC free article] [PubMed]

15. Lindman HR. Analysis of variance in complex experimental designs. Wiley-Interscience; 2001.

16. Fukunaga K. Introduction to statistical pattern recognition. Academic Press; San Diego, California: 1990.

17. Mendenhall W, Sincich T. A Second Course in Statistics: Regression Analysis. Prentice Hall; 2002.

18. Yu S, Yu K, Kriegel H-P, Wu M. Supervised probabilistic principal component analysis. KDD. 2006

19. Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensinal data for data mining applications. SIGMOD. 1998

20. Aggarwal C, Wolf J, Yu P, Procopiuc C, Park J. Fast algorithms for projected clustering. SIGMOD. 1999

21. Chen C, Fu A, Zhang Y. Entropy-based subspace clustering for mining numerical data. SIGKDD. 1999

22. Horn RA, Johnson CR. Matrix analysis. Cambridge University Press; Cambridge U. K.: 1985.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |