Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2916238

Formats

Article sections

- Abstract
- 1. INTRODUCTION
- 2. RELATED WORK
- 3. PROBLEM FORMALIZATION
- 4. OVERALL REDUCIBLE SUBSPACE
- 5. MAXIMUM REDUCIBLE SUBSPACE
- 6. EXPERIMENTS
- 7. CONCLUSION
- 9. REFERENCES

Authors

Related links

Proc ACM Int Conf Inf Knowl Manag. Author manuscript; available in PMC 2010 August 4.

Published in final edited form as:

Proc ACM Int Conf Inf Knowl Manag. 2008; 2008: 961–970.

PMCID: PMC2916238

NIHMSID: NIHMS131996

Department of Computer Science University of North Carolina at Chapel Hill Chapel Hill, NC 27599, USA

Finding latent patterns in high dimensional data is an important research problem with numerous applications. The most well known approaches for high dimensional data analysis are feature selection and dimensionality reduction. Being widely used in many applications, these methods aim to capture global patterns and are typically performed in the full feature space. In many emerging applications, however, scientists are interested in the local latent patterns held by feature subspaces, which may be invisible via any global transformation.

In this paper, we investigate the problem of finding strong linear and nonlinear correlations hidden in feature subspaces of high dimensional data. We formalize this problem as identifying reducible subspaces in the full dimensional space. Intuitively, a reducible subspace is a feature subspace whose intrinsic dimensionality is smaller than the number of features. We present an effective algorithm, REDUS, for finding the reducible subspaces. Two key components of our algorithm are finding the overall reducible subspace, and uncovering the individual reducible subspaces from the overall reducible subspace. A broad experimental evaluation demonstrates the effectiveness of our algorithm.

Many real life applications deal with high dimensional data. In bio-medical domain, for example, advanced microarray techniques [2, 11, 17] allow to monitor the expression levels of hundreds to thousands of genes simultaneously. By mapping each gene to a feature, gene expression data can be treated as data points distributed in a very high dimensional feature space. To make sense of such high dimensional data, extensive research has been done in finding the latent structures hidden in the large number of features. Two well known approaches in analyzing high dimensional data are feature selection and dimensionality reduction.

The goal of feature selection methods [6, 22, 31, 33] is to find a single representative subset of features that are most relevant for the data mining task at hand, such as classification. The selected features generally have low correlation with each other but have strong correlation with the target feature.

Dimensionality reduction [4, 8, 18, 27, 29] is widely used as a key component of many approaches in analyzing high dimensional data. The insight behind dimensionality reduction methods is that a high dimensional dataset may exhibit interesting patterns on a lower dimensional subspace due to correlations among the features. Though very successful in finding the low dimensional structures embedded in a high dimensional space, these methods are usually performed in the full feature space. They aim to model the global latent structure of the data and do not separate the impact of any original features nor identify latent patterns hidden in some feature subspaces. Please refer to Section 2 for a more detailed discussion of the related work.

In many emerging applications, the datasets usually consist of thousands to hundreds of thousands of features. In such high dimensional dataset, some feature subsets may be strongly correlated, while others may not have any correlation at all. In these applications, it is more desirable to find the correlations that are hidden in feature subspaces. For example, in gene expression data analysis, a group of genes having strong correlation is of high interests to biologists since it helps to infer unknown functions of genes [11] and gives rise to hypotheses regarding the mechanism of the transcriptional regulatory network [17]. We refer to such correlation among a subset of features as a *local correlation* in comparison with the global correlation found by the full dimensional feature reduction methods. Since such local correlations only exist in some subspaces of the full dimensional space, they are invisible to the full dimensional reduction methods. In [32], an algorithm is proposed to find local linear correlations in high dimensional data. However, in real applications, the feature subspace can be either linearly or nonlinearly correlated. The problem of finding linear and nonlinear correlations in feature subspaces remains open.

For example, Figure 1 shows a data sets consisting of 12 features, {*f*_{1}, *f*_{2}, · · · , *f*_{12}}, and 1000 data points. Embedded in the full dimensional space, features subspaces {*f*_{1}, *f*_{2}, *f*_{3}} and {*f*_{4}, *f*_{5}, *f*_{6}} are nonlinearly correlated, {*f*_{7}, *f*_{8}, *f*_{9}} are linearly correlated. Features {*f*_{10}, *f*_{11}, *f*_{12}} contain random noises.

Performing dimensionality reduction methods to the full dimensional space cannot uncover these local correlations hidden in the full feature spaces. For example, Figure 2(a) shows the result of applying Principal Component Analysis (PCA)[18] to the full dimensional space of the example dataset shown in Figure 1. In this figure, we plot the point distribution on the first 3 principal components found by PCA. Clearly, we cannot find any pattern that is similar to the patterns embedded in the dataset. Similarly, Figure 2(b) shows the results of applying ISOMAP [29] to reduce the dimensionality of the dataset down to 3. There is also no desired pattern found in this low dimensional structure.

How can we identify these local correlations hidden in the full dimensional space?

This question is two-fold. First, we need to identify the strongly correlated feature subspaces, i.e., a subset of features that are strongly correlated and actually have low dimensional structures. Then, after these locally correlated feature subsets are found, we can apply the existing dimensionality reduction methods to identify the low dimensional structures embedded in them.

Many methods have been proposed to address the second aspect of the question, i.e., given a correlated feature space, finding the low dimensional embedding in it. The first aspect of the question, however, is largely untouched. In this paper, we investigate the first aspect of the question, i.e., identifying the strongly correlated feature subspaces.

**(1)** In this paper, we investigate the problem of finding correlations hidden in the feature subspaces of high dimensional data. The correlations can be either linear or nonlinear. To our best knowledge, our work is the first attempt to find local linear and nonlinear correlations hidden in feature subspaces.

Many methods for modelling correlations can be found in the literature, such as mutual information [10], Pearson correlation [26], and rank correlation [19]. However, the commonly used measurements are for correlations between two features. In high dimensional data, the correlation may involve a large number of features, i.e., a high order correlation. Note that a strong high order correlation does not necessarily imply that there are strong correlations between feature pairs. For example, the 3 features, {*f*_{1}, *f*_{2}, *f*_{3}} in Figure 1, are strongly correlated, since the 3-dimensional Swiss roll structure is actually on a 2-dimensional manifold. Figures 3(a) to 3(c) show the projections of the Swiss roll onto the spaces of two features. As we can see from the figures, there are no clear pairwise correlations between any two features.

We adopt the concept of intrinsic dimensionality [13] to model the high dimensional correlation. We formalize this problem as finding *reducible subspaces* in the full dimensional space. Informally, a feature subspace is reducible if its intrinsic dimensionality is smaller than the number of features. Various intrinsic dimensionality estimators have been developed [9, 14, 21]. Our problem formalization does not depend on any particular method for estimating the intrinsic dimensionality. We show two necessary properties that any estimator should satisfy in order to be a generalization of the well-defined concepts in linear case.

**(2)** We develop an effective algorithm REDUS^{1} to find the reducible subspaces in the dataset. REDUS consists of the following two steps.

It first finds the union of all reducible subspaces, i.e., the *overall reducible subspace*. The second step is to uncover the individual reducible subspaces in the overall reducible subspace. The key component of this step is to examine if a feature is strongly correlated with a feature subspace. We develop a method utilizing point distributions to distinguish the features that are strongly correlated with a feature subspace and those that are not. Our method achieves similar accuracy to that of directly using intrinsic dimensionality estimators, but with much less computational cost.

Extensive experiments on synthetic and real life datasets demonstrate the effectiveness of REDUS.

Feature selection methods [6, 22, 31, 33] try to find a subset of features that are most relevant for certain data mining task, such as classification. In order to find the relevant feature subset, these methods search through various subsets of features and evaluate these subsets according to some criteria. Feature selection methods can be further divided into two groups according to their evaluation criteria: wrapper and filter. Wrapper models evaluate feature subsets by their predictive accuracy using statistical re-sampling or cross-validation. In filter techniques, the feature subsets are evaluated by their information content, using statistical dependence or information-theoretic measures.

Feature selection finds one feature subset for the entire dataset. The selected feature subset usually contains the features that have low correlation with each other but have strong correlation with the target feature. The correlation measurements are usually defined for feature pairs, such as mutual information and Pearson correlation. Our work, on the other hand, is to find the subsets of features, in which the features are strongly correlated. Moreover, the correlations are not limited to feature pairs.

Dimensionality reduction methods can be categorized into linear methods, such as Multi-Dimensional Scaling (MDS) [8] and Principal Component Analysis (PCA)[18], and non-linear methods, such as Local Linear Embedding (LLE) [27], ISOMAP [29], and Laplacian eigenmaps [4]. For high dimensional datasets, if there exist low dimensional subspaces or manifolds embedded in the full dimensional spaces, these methods are successful in identifying these low dimensional embeddings.

Dimensionality reduction methods are usually applied on the full dimensional space to capture the independent components among all the features. They are not designed to address the problem of identifying correlation in feature subspaces. It is reasonable to apply them to the feature spaces that are indeed correlated. However, in very high dimensional datasets, different feature subspaces may have different correlations, and some feature subspace may not have any correlation at all. In this case, dimensionality reduction methods should be applied after such strongly correlated feature subspaces have been identified.

The goal of correlation clustering methods is to find clusters hidden in projected feature spaces [1, 7, 30]. They can be viewed as combinations of clustering methods and dimensionality reduction methods. Both ORCLUS [1] and 4C [7] can be treated as PCA-lized clustering methods. To find the low dimensional clusters, ORCLUS and 4C apply PCA on subsets of data points in the full dimensional space and merge the point subsets having similar orientations. Therefore, they do not touch the problem of finding reducible subspaces. Instead, they implicitly assume the full dimensional space is reducible for certain subsets of data points. CURLER [30] finds clusters having non-linear correlations in subspaces. It first applies EM clustering algorithm to find a large number of micro-clusters, and then merges clusters with large overlaps. Since in its first step, micro-clusters are formed by applying EM to the full dimensional space, CURLER also does not address the problem of finding the reducible subspaces.

Due to correlations among features, a high dimensional dataset may lie in a subspace with dimensionality smaller than the number of features [9, 14, 21]. The intrinsic dimensionality can be treated as the minimum number of free variables required to define the data without any significant information loss [13]. For example, as shown in Figure 1, in the 3-dimensional space of {*f*_{1}, *f*_{2}, *f*_{3}}, the data points lie on a Swiss roll, which is actually a 2-dimensional manifold. Therefore, its intrinsic dimensionality is 2.

The concept of intrinsic dimensionality has many applications in the database and data mining communities, such as clustering [3, 15], outlier detection [24], nearest neighbor queries [23], and spatial query selectivity estimation [5, 12]. Different definitions of intrinsic dimensionality can be found in the literature. For example, in linear cases, matrix rank [16] and PCA [18] can be used to estimate intrinsic dimensionality. For nonlinear cases, estimators such as *box counting dimension, information dimension*, and *correlation dimension* have been developed. These intrinsic dimensionality estimators are sometimes collectively referred to as *fractal dimension*. Please see [25, 28] for good coverage of the topics of intrinsic dimensionality estimation and its applications.

In [32], the CARE algorithm has been proposed for finding local linear correlations. Adopting a similar criterion used in PCA, the strongly correlated feature subsets are formalized as feature subspaces having small residual variances. However, this work only focuses on linear correlations. The problem of finding nonlinear local correlations remains uninvestigated.

In this section, we utilize intrinsic dimensionality to formalize the problem of finding strongly correlated feature subspaces.

Suppose that the dataset Ω consists of *N* data points and *M* features. Let Ω_{P} = {*p*_{1}, *p*_{2}, · · · , *p _{N}*} denote the point set, and Ω

Intrinsic dimensionality provides a natural way to examine whether a feature is correlated with some feature subspace: if a feature *f*_{a} Ω_{F} is strongly correlated with a feature subspace $V{\Omega}_{F}$, then adding *f*_{a} to *V* should not cause much change of the intrinsic dimensionality of *V*. The following definition formalizes this intuition.

Definition 3.1. (Strong Correlation)

*A feature subspace* $V{\Omega}_{F}$ *and a feature **f*_{a} Ω_{F }*have strong correlation, if*

$$\Delta ID(V,{f}_{a})=ID(V\left\{{f}_{a}\right\})-ID\left(V\right)\le .$$

In this definition, ε is a user specified threshold. Smaller ε value implies stronger correlation, and larger ε value implies weaker correlation. If *V* and *f*_{a} have strong correlation, we also say that they are strongly correlated.

Definition 3.2. (Redundancy)

*Let* $V=\{{f}_{{v}_{1}},{f}_{{v}_{2}},,{f}_{{v}_{m}}\}{\Omega}_{F}$. *f _{vi}* V is a redundant feature of V, if f

$$\Delta ID(\{{f}_{{v}_{1}},,{f}_{{v}_{i-1}},{f}_{{v}_{i+1}},,{f}_{{v}_{m}}\},{f}_{{v}_{i}})\le .$$

We say *V* is a *redundant* feature subspace if it has at least one redundant feature. Otherwise, *V* is a *non-redundant* feature subspace.

Note that in Definitions 3.1 and 3.2, *ID*(*V*) does not depend on a particular intrinsic dimensionality estimator. Any existing estimator can be applied when calculating *ID*(*V*). Moreover, we do not require that the intrinsic dimensionality estimator reflects the exact dimensionality of the dataset. However, in general, a good intrinsic dimensionality estimator should satisfy two basic properties.

First, if a feature is redundant in some feature subspace, then it is also redundant in the supersets of the feature subspace. We formalize this intuition as the following property.

Property 3.3. *For V* Ω_{F}, *if* Δ*ID*(*V, f _{a}*) ≤ ε,

This is a reasonable requirement, since if *f*_{a} is strongly correlated with $VU$, then adding *f*_{a} to *U* will not greatly alter its intrinsic dimensionality.

From this property, it is easy to see that, if feature subspace *U* is non-redundant, then all of its subsets are non-redundant, which is clearly a desirable property for the feature subspaces.

Corollary 3.4. *If* $U{\Omega}_{F}$ *is non-redundant, then for* $\forall VU$, *V is also non-redundant*.

The following property extend the concept of basis [20] in a linear space to nonlinear space using intrinsic dimensionality. In linear space, suppose that *V* and *U* contain the same number of vectors, and the vectors in *V* and *U* are all linearly independent. If the vectors of *U* are in the subspace spanned by the vectors of *V*, then the vectors in *V* and the vectors in *U* span the same subspace. (A span of a set of vectors consists of all linear combinations of the vectors.) Similarly, in Property 3.5, for two non-redundant feature subspaces, *V* and *U*, we require that if the features in *U* are strongly correlated with *V*, then *U* and *V* are strongly correlated with the same subset of features.

Property 3.5. *Let* $V=\{{f}_{{v}_{1}},{f}_{{v}_{2}},,{f}_{{v}_{m}}\}{\Omega}_{F}$ *and* $U=\{{f}_{{u}_{1}},{f}_{{u}_{2}},,{f}_{{u}_{m}}\}{\Omega}_{F}$ *be two non-redundant feature subspaces. If* ∀*f _{ui}*

Intuitively, if a feature subspace $Y\phantom{\rule{thickmathspace}{0ex}}(Y{\Omega}_{F})$ is redundant, then *Y* should be reducible to some subspace, say $V\phantom{\rule{thickmathspace}{0ex}}(VY)$. Concerning the possible choices of *V*, we are most interested in the smallest one that Y can be reduced to, since it represents the intrinsic dimensionality of *Y*. We now give the formal definitions of reducible subspace and its core space.

Definition 3.6. (Reducible Subspace and Core Space) $Y{\Omega}_{F}$ is a reducible subspace if there exists a non-redundant subspace $V\phantom{\rule{thickmathspace}{0ex}}(VY)$, *such that*

- ∀
*f*_{a}*Y*, Δ*ID*(*V, f*) ≤ ε,_{a}*and* - $\forall UY\phantom{\rule{thickmathspace}{0ex}}(U\le V)$,
*U is non-redundant.*

We say V is the core space of Y, and Y is reducible to V.

Criterion (1) in Definition 3.6 says that all features in *Y* are strongly correlated with the core space *V*. The meaning of criterion (2) is that the core space is the smallest non-redundant subspace of *Y* with which all other features of *Y* are strongly correlated.

Among all reducible subspaces, we are most interested in the maximum ones. A maximum reducible subspace is a reducible subspace that includes all features that are strongly correlated with its core space.

Definition 3.7. (Maximum Reducible Subspace)

$Y{\Omega}_{F}$ *is a maximum reducible subspace if*

- Y is a reducible subspace, and
- ∀
*f*Ω_{b}_{F},*if f*_{b}*Y, then*Δ*ID*(*V, f*) > ε,_{b}*where V is the core space of Y*.

Let {*Y*_{1}, *Y*_{2}, · · · , *Y _{S}*} be the set of all maximum reducible subspaces in the dataset. The union of the maximum reducible subspaces $OR={i=1S}_{{Y}_{i}}^{}$ is referred to as the

Note that Definition 3.7 works for the general case where a feature can be in different maximum reducible subspaces. In this paper, we focus on the special case where maximum reducible subspaces are non-overlapping, i.e., each feature can be in *at most one* maximum reducible feature subspace.

To find the maximum reducible subspaces in the dataset, REDUS adopts a two-step approach. The first step is to find the overall reducible subspace. Then, from the overall reducible subspace, it identifies the individual maximum reducible subspaces. In the next section, we present the algorithm for finding the overall reducible subspace. In Section 5, we discuss the method for finding individual maximum reducible subspaces.

In this section, we present the algorithm for finding the overall reducible subspace. We first give a short introduction to the intrinsic dimensionality estimator. Then we present the algorithm for finding the overall reducible subspace and the proof of its correctness.

To find the overall reducible subspace in the dataset, we adopt correlation dimension [25, 28], which can measure both linear and nonlinear intrinsic dimensionality, as our intrinsic dimensionality estimator since it is computationally more efficient than other estimators while its quality of estimation is similar to others. In practice, we observe that correlation dimension satisfies Properties 3.3 and 3.5, although we do not provide the proof here. In what follows, we give a brief introduction of correlation dimension.

Let *Y* be a feature subspace of the dataset, i.e., $Y{\Omega}_{F}$. Suppose that the number of points *N* in the dataset approaches infinity. Let *dis*(*p _{i}, p_{j}, Y*) represent the distance between two data points

$${B}_{Y}({p}_{i},r)=\{{p}_{j}{p}_{j}{\Omega}_{P},dis({p}_{i},{p}_{j},Y)\le r\}.$$

The average fraction of pairs of data points within distance *r* is

$${C}_{Y}\left(r\right)=\underset{N\to \infty}{\text{lim}}\frac{1}{{N}^{2}}\sum _{{p}_{i}{\Omega}_{P}}$$

The *correlation dimension of Y* is then defined as

$$ID\left(Y\right)=\underset{r,{r}^{\prime}\to 0}{\text{lim}}\frac{\text{log}[{C}_{Y}\left(r\right)/{C}_{Y}\left({r}^{\prime}\right)]}{\text{log}[r/{r}^{\prime}]}.$$

In practice, N is a finite number. *C _{Y}* is estimated using $\frac{1}{{N}^{2}}{\sum}_{{p}_{i}{Y}_{P}}$. The correlation dimension is the growth rate of the function

The intuition behind the correlation dimension is following. For points that are arranged on a line, one expects to find twice as many points when doubling the radius. For the points scattered on 2-dimensional plane, when doubling the radius, we expect the number of points to increase quadratically. Generalizing this idea to *m*-dimensional space, we have *C _{Y}*(

The following theorem sets the foundation for the efficient algorithm to find the overall reducible subspace.

Theorem 4.1. *Suppose that* $Y{\Omega}_{F}$ *is a maximum reducible subspace and* $VY$ *is its core space. We have* $\forall UY\phantom{\rule{thickmathspace}{0ex}}(U=V)$, *U is also a core space of Y*.

Proof. We need to show that *U* satisfies the criteria in Definition 3.7. Let *V* = {*f*_{v1}, *f*_{v2}, · · · , *f*_{vm}} and *U* = {*f*_{u1}, *f*_{u2}, · · · , *f*_{um}}.

Since $UY$, from the definition of reducible subspace, *U* is non-redundant, and for every *f*_{ui} *U*, Δ*ID(V, f _{ui})* ≤ ε. For every

Therefore, *U* is also a core space of *Y*. □

Theorem 4.1 tells us that any subset $UY$ of size |*V*| is also a core space of *Y*.

Suppose that {*Y*_{1}, *Y*_{2}, · · · , *Y*_{S}} is the set of all maximum reducible subspaces in the dataset and the overall reducible subspace is $OR={i=1S}_{{Y}_{i}}^{}$. To find *OR*, we can apply the following method. For every *f*_{a} Ω_{F}, let *RF _{fa}* = {

Corollary 4.2. *OR* = {*f*_{a}|Δ*ID*(*RF _{fa}, f_{a}*) ≤ ε}.

Proof. Let *f _{y}* be an arbitrary feature in the overall reducible subspace. From Theorem 4.1, we have $\forall {f}_{y}{Y}_{i}OR,{V}_{i}{Y}_{i}\phantom{\rule{thickmathspace}{0ex}}({f}_{y}{V}_{i})$, such that

Similarly, if *f _{y}*

Therefore, we have *OR* = {*f _{y}*|Δ

The algorithm for finding the overall reducible subspace is shown in Algorithm 1 from Line 1 to Line 7. Note that the procedure of finding overall reducible subspace is linear to the number of features in the dataset.

In this section, we present the second component of REDUS, i.e., identifying the maximum reducible subspaces from the overall reducible subspace found in the previous section.

From Definition 3.7 and Theorem 4.1, we have the following property concerning the reducible subspaces.

Corollary 5.1. *Let* ${Y}_{i}OR$ *be a maximum reducible subspace, and* ${V}_{i}{Y}_{i}$ *be any core space of Y _{i}. We have*

$${Y}_{i}=\{{f}_{a}\Delta ID({V}_{i},{f}_{a})\le \mathrm{,{f}_{a}OR\}.}$$

Therefore, to find the individual maximum reducible subspaces ${Y}_{i}OR\phantom{\rule{thickmathspace}{0ex}}(q\le i\le S)$, we can use any core space ${V}_{i}{Y}_{i}$ to find the other features in *Y _{i}*. More specifically, a candidate core space of size

Corollary 5.2. *Any candidate core space is non-redundant*. □

Proof. It is easy to see any candidate core space of size 1 is non-redundant. Now, assume that all candidate core spaces of size *d*−1 are non-redundant, we show all candidate core spaces of size *d* are non-redundant. We prove this by contradiction.

Let *V* = {*f*_{v1}, *f*_{v2}, · · · , *f _{vd}*} be an arbitrary candidate core space of size

Corollary 5.3. *Let C be a candidate core space. If* *f _{a}*

Proof. Let *Y* = {*f _{y}*|Δ

In this method, for each candidate core space, we need to calculate Δ*ID*(*C*) and $\Delta ID(C\left\{{f}_{a}\right\})$ for every *f _{a}*

After finding the overall reducible subspace *OR*, we can apply the following heuristic to examine if a feature is strongly correlated with a feature subspace. The intuition behind our heuristic is similar to the one behind the correlation dimension.

Assume that the number of data points *N* in the dataset approaches infinity, and the features in the dataset are normalized so that the points are distributed from 0 to 1 in each dimension. Let *p _{s}* Ω

Figure 4(a) and 4(b) show two examples on 2-dimensional spaces. In both examples, *d* = 1 and *C* = {*f _{a}*}. In Figure 4(a), feature

Therefore, for each candidate core space, we can check if a feature is correlated with it in the following way. We randomly sample *n* points *P* = {*p*_{s1}, *p*_{s2}, · · · , *p*_{sn}} from the dataset. Suppose that *C* = {*f*_{c1}, *f*_{c2}, · · · , *f*_{cd}} is the current candidate core space. For feature *f _{a}*

Concerning the choice of *l*, we can apply the following reasoning. If we let $l={\left(\frac{1}{N}\right)}^{{\scriptstyle \frac{1}{d+1}}}$, then the expected number of points in the cube *Q _{siC}′* is 1, if

The second step of REDUS is shown in Algorithm 1 from Line 8 to Line 18. Note that in the worst case, the algorithm needs to enumerate all possible feature subspaces. However, in practice, the algorithm is very efficient since once an individual reducible subspace is found, all its features are removed. Only the remaining features need to be further examined.

To evaluate REDUS, we apply it on both synthetic datasets and real datasets. REDUS is implemented using Matlab 7.0.4. The experiments are performed on a 2.4 GHz PC with 1G memory running WindowsXP system.

As shown in Algorithm 1, REDUS generally requires three input parameters: ε, *n*, and *τ*. In the first step of finding the overall reducible subspace, ε is the threshold to filter out the irrelevant features. Since features strongly correlated with some core space can only change intrinsic dimensionality a small amount, the value of ε should be close to 0. According to our experience, a good starting point is 0.1. After finding the reducible subspaces, the user can apply the standard dimensionality reduction methods to see if the are really correlated, and the adjust ε value accordingly to find stronger or weaker correlations in the subspaces. In all our experiments, we set ε between 0.002 to 0.25. In the second step, *n* is the point sampling size and *τ* is the threshold to determine if a feature is strongly correlated with a candidate core space. In our experiments, *n* is set to be 10% of the total number of data points in the dataset, and *τ* is set to be 90%.

To evaluate the effectiveness of the REDUS, we generate two synthetic datasets.

The first synthetic dataset is as shown in Figure 1. There are 12 features, {*f*_{1}, *f*_{2}, · · · , *f*_{12}}, and 1000 data points in the dataset. 3 reducible subspaces: a 2-dimensional Swiss roll, a 1-dimensional helix-shaped line, and a 2-dimensional plane, are embedded in different 3-dimensional spaces respectively. The overall reducible subspace is {*f*_{1}, *f*_{2}, · · · , *f*_{9}}. Let *c _{i}* (1 ≤

In the first step, with ε = 0.25, REDUS successfully uncovers the overall reducible space. The parameter setting for the second step is *τ* = 90%, and point sampling size 10%. We run REDUS 10 times. In all 10 runs, REDUS successfully identifies the individual maximum reducible subspaces from the overall reducible subspace.

We generate another larger synthetic dataset as follows. There are 50 features, {*f*_{1}, *f*_{2}, · · · , *f*_{50}} and 1000 data points in the dataset. There are 3 reducible subspaces: *Y*_{1} = {*f*_{1}, *f*_{2}, · · · , *f*_{10}} reducible to a 2-dimensional space, *Y*_{2} = {*f*_{11}, *f*_{12}, · · · , *f*_{20}} reducible to a 1-dimensional space, and *Y*_{3} = {*f*_{21}, *f*_{22}, · · · , *f*_{30}} reducible to a 2-dimensional space. The remaining features contain random noises. Figures 5(a) and 5(b) show two examples of the embedded correlations in 3-dimensional subspaces. Figure 5(a) plots the point distribution on feature subspace {*f*_{1}, *f*_{2}, *f*_{9}} of *Y*_{1}, and Figure 5(b) plots the point distribution on feature subspace {*f*_{11}, *f*_{12}, *f*_{13}} of *Y*_{2}.

We apply REDUS on this synthetic dataset using various parameter settings. Table 1 shows the accuracy of finding the overall reducible subspace when ε taking different values. The recall is defined as *TP*/(*TP* + *FN*), and the precision is defined as *TP*/(*TP* + *FP*), where *TP* represents the number of true positive, *FP* represents the number of false positive, and *FN* represents the number of false negative. As we can see, REDUS is very accurate and robust to ε.

Tables 2(a) and 2(b) show the identified maximum reducible subspaces when varying *τ* and *n*. Table 2(a) shows results under different settings of *τ*. The point sampling size *n* in this table is the default value, i.e., 10% of the total number of data points. Although changing *τ* may cause some mis-classified features, Table 2(a) still shows that REDUS achieves reasonably high accuracy under different settings of *τ*. The reason for different decompositions of maximum reducible subspaces is that features in different maximum reducible subspaces may still have moderate correlations. If these correlated features are identified, they will be removed from the overall reducible subspace, since REDUS focuses on the case where the maximum reducible subspaces are non-overlapping. If we allow each feature to be in different maximum reducible subspaces, the findings under different *τ* should be more similar to each other. Finding overlapping reducible subspaces is computationally more demanding, and is an interesting problem worth further exploration.

Accuracy of identifying the maximum reducible subspaces from the overall reducible subspace when varying *τ* and *n*

Table 2(b) shows the identified maximum reducible subspaces when varying the number of the sampled points *n*. *τ* is set to be 0.92 in this table. As shown in the table, REDUS is not sensitive to the size of the sampled data points.

To evaluate the efficiency and scalability of REDUS, we apply it to synthetic dataset 2. The default dataset for efficiency evaluation contains 1000 points and 50 features if not specified otherwise. The default values for the parameters are the same as before.

Figure 6(a) shows the runtime of finding the overall reducible subspace when varying the number of data points. The runtime scales roughly quadratically. This is because when computing the correlation dimensions, we need to calculate all pairwise distances between the data points, which is clearly quadratic to the number of points.

Figure 6(b) shows that the runtime of finding the overall reducible subspace is linear to the number of features. This is because REDUS only scans every feature once to examine if it is strongly correlated with the subspace of the remaining features. This linear scalability is desirable for the datasets containing a large number of features.

Figures 7(a) and 7(b) show the runtime comparisons between using the correlation dimension as intrinsic dimensionality estimator and the point distribution heuristic to identify the individual maximum reducible subspaces from the overall reducible subspaces. Since the calculation of intrinsic dimensionality is relatively expensive, the program often cannot finish in a reasonable amount of time. Using the point distribution heuristics, on the other hand, is much more efficient and scales linearly to the number of points and features in the dataset.

We apply REDUS on the NBA statistics dataset. The dataset can be downloaded from http://sports.espn.go.com/nba/teams/stats?team=Bos&year=2007&season=2. It contains the statistics of 28 features for 200 players of season 2006-2007. Since the features have different value scales, we normalized each feature such that points are distributed between 0 and 1. We report two interesting correlations found in the dataset in Figures 8(a) and 8(b). Note that the features shown in the figures are mean-centered.

The correlation shown in Figure 8(a) says that the feature subspace of three features: defence rebounds (DEF), the offense rebounds (OFF), and the total number of rebounds (TOT), is strongly correlated and reducible to a 2-dimensional space. This is an obvious correlation that one would expect. As shown in the figure, the points are linearly distributed on a 2-dimensional plane in the 3-dimensional subspace.

Figure 8(b) shows a nonlinear correlation identified by REDUS. The feature subspace of three features: field goal made (FGM), field goal attempted (FGA), and the percentage of field goal (FG%) is strongly correlated and reducible to 2-dimensional space. Clearly, the data points on this 3-dimensional space are distributed on a 2-dimensional manifold.

The wage dataset from the 1985 Current Population Survey consists of 11 features in 534 data points. The dataset is available at http://lib.stat.cmu.edu/datasets/CPS_85_Wages. The numerical features are age, years of education, years of work experience, and wage. We apply REDUS on this dataset to find the strongly correlated feature subsets.

REDUS identifies one correlation between the features: age, years of education, and wage. Figure 9 shows the point distribution of the data points in this 3-dimensional feature subspace. From this figure, we can see that wage is clearly a linear function of age and years of education. Therefore, this 3-dimensional space is reducible to the 2-dimensional plane embedded in it.

We apply REDUS to the breast cancer dataset which is available at the UCI Machine Learning Archieve. There are 569 data points and 30 features in this dataset. The features include the statistics of radius, texture, perimeter, area, smoothness, compactness concavity, concave points, symmetry, and fractal dimension. These features are computed from a digitized image of a fine needle aspirate of a breast mass. They describe characteristics of the cell nuclei present in the image.

Figure 10 shows one of the nonlinear correlations identified by REDUS. The three features are the mean of radius, largest radius, and the mean of texture. From the figure, we can see these three features are strongly correlated and reducible to 1-dimensional space.

In this paper, we investigate the problem of finding strongly correlated feature subspaces in high dimensional datasets. The correlation can be linear or nonlinear. Such correlations hidden in feature subspace may be invisible to the global feature transformation methods, such as PCA and ISOMAP. Utilizing the concepts of intrinsic dimensionality, we formalize this problem as the discovery of maximum reducible subspaces in the dataset. An effective algorithm, REDUS, is presented to find the maximum reducible subspaces. The experimental results show that REDUS can effectively and efficiently find these interesting local correlations.

Our work reported in this paper focuses on the case where the maximum reducible subspaces are non-overlapping. For future work, one interesting direction is to extend current work to the general case where a feature can be in multiple maximum reducible subspaces. Another interesting direction is finding the feature subspaces that are strongly correlated on a subset of data points. This is a more general problem and has wider applications.

This research was partially supported by EPA grant STAR-RD832720, NSF grants IIS-0448392, IIS-0534580, and a Microsoft New Faculty Fellowship.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

^{1}REDUS stands for REDUcible Subspaces.

1. Aggarwal C, Yu P. Finding generalized projected clusters in high dimensional spaces. SIGMOD. 2000

2. Alizadeh A, et al. Distinct types of di use large b-cell lymphoma identified by gene expression profiling. Nature. 2000;403:503–11. [PubMed]

3. Barbara D, Chen P. Using the fractal dimension to cluster datasets. KDD. 2000

4. Belkin M, Niyogi P. Şlaplacian eigenmaps for dimensionality reduction and data representation. Neural Computation. 2003

5. Belussi A, Faloutsos C. Self-spacial join selectivity estimation using fractal concepts. ACM Transactions on Information Systems. 1998;16(2):161–201.

6. Blum A, Langley P. Selection of relevant features and examples in machine learning. Artificial Intelligence. 1997;97:245–271.

7. Bohm C, Kailing K, Kroger P, Zimek A. Computing clusters of correlation connected objects. SIGMOD. 2004

8. Borg I, Groenen P. Modern multidimensional scaling. Springer; New York: 1997.

9. Camastra F, Vinciarelli A. Estimating intrinsic dimension of data with a fractal-based approach. IEEE Trans. on Pattern Analysis and Machine Intelligence. 2002;24(10):1404–1407.

10. Cover TM, Thomas JA. The Elements of Information Theory. Wiley & Sons; New York: 1991.

11. Eisen M, Spellman P, Brown P, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA. 1998;95:14863–68. [PubMed]

12. Faloutsos C, Kamel I. Beyond uniformity and independence: analysis of r-trees using the concept of fractal dimension. PODS. 1994

13. Fukunaga K. Intrinsic dimensionality extraction. In: Krishnaiah PR, Kanal LN, editors. Classification, Pattern recongnition and Reduction of Dimensionality, Volume 2 of Handbook of Statistics. Amsterdam; North Holland: 1982. pp. 347–360.

14. Fukunaga K, Olsen DR. An algorithm for finding intrinsic dimensionality of data. IEEE Transactions on Computers. 1976;20(2):165–171.

15. Gionis A, Hinneburg A, Papadimitriou S, Tsaparas P. Dimension induced clustering. KDD. 2005

16. Golub G, Loan A. Matrix computations. Johns Hopkins University Press; Baltimore, Maryland: 1996.

17. Iyer V, et al. The transcriptional program in the response of human fibroblasts to serum. Science. 1999;283:83–87. [PubMed]

18. Jolli e I. Principal component analysis. Springer; New York: 1986.

19. Kendall M, Gibbons JD. Rank Correlation Methods. Oxford University Press; New York: 1990.

20. Lay DC. Linear Algebra and Its Applications. Addison Wesley; 2005.

21. Levina E, Bickel PJ. Maximum likelihood estimation of intrinsic dimension. Advances in Neural Information Processing Systems. 2005

22. Liu H, Motoda H. Feature selection for knowledge discovery and data mining. Kluwer Academic Publishers; Boston: 1998.

23. Pagel B-U, Korn F, Faloutsos C. Deflating the dimensionality curse using multiple fractal dimensions. ICDE. 2000

24. Papadimitriou S, Kitawaga H, Gibbons PB, Faloutsos C. Loci: Fast outlier detection using the local correlation integral. ICDE. 2003

25. Rasband SN. Chaotic Dynamics of Nonlinear Systems. Wiley-Interscience; 1990.

26. Reynolds HT. The analysis of cross-classifications. The Free Press; New York: 1977.

27. Roweis S, Saul L. Nonlinear dimensionality reduction by locally linear embedding. Science. 2000;290(5500):2323–2326. [PubMed]

28. Schroeder M. Fractals, Chaos, Power Lawers: Minutes from an Infinite Paradise. W. H. Freeman; New York: 1991.

29. Tenenbaum JB, de Silva V, Langford JC. A global geometric framework for nonlinear dimensionality reduction. Science. 2000;290(5500):2319–2323. [PubMed]

30. Tung AKH, Xin X, Ooi BC. Curler: Finding and visualizing nonlinear correlation. SIGMOD. 2005

31. Yu L, Liu H. Feature selection for high-dimensional data: a fast correlation-based filter solution. ICML. 2003

32. Zhang X, Pan F, Wang W. Care: Finding local linear correlations in high dimensional data. ICDE. 2008 [PMC free article] [PubMed]

33. Zhao Z, Liu H. Searching for interacting features. IJCAI. 2007

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |