PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Proc ACM Int Conf Inf Knowl Manag. Author manuscript; available in PMC 2010 August 4.
Published in final edited form as:
Proc ACM Int Conf Inf Knowl Manag. 2008; 2008: 961–970.
PMCID: PMC2916238
NIHMSID: NIHMS131996

REDUS: Finding Reducible Subspaces in High Dimensional Data

Abstract

Finding latent patterns in high dimensional data is an important research problem with numerous applications. The most well known approaches for high dimensional data analysis are feature selection and dimensionality reduction. Being widely used in many applications, these methods aim to capture global patterns and are typically performed in the full feature space. In many emerging applications, however, scientists are interested in the local latent patterns held by feature subspaces, which may be invisible via any global transformation.

In this paper, we investigate the problem of finding strong linear and nonlinear correlations hidden in feature subspaces of high dimensional data. We formalize this problem as identifying reducible subspaces in the full dimensional space. Intuitively, a reducible subspace is a feature subspace whose intrinsic dimensionality is smaller than the number of features. We present an effective algorithm, REDUS, for finding the reducible subspaces. Two key components of our algorithm are finding the overall reducible subspace, and uncovering the individual reducible subspaces from the overall reducible subspace. A broad experimental evaluation demonstrates the effectiveness of our algorithm.

Keywords: Reducible Subspace, High Dimensional Data

1. INTRODUCTION

Many real life applications deal with high dimensional data. In bio-medical domain, for example, advanced microarray techniques [2, 11, 17] allow to monitor the expression levels of hundreds to thousands of genes simultaneously. By mapping each gene to a feature, gene expression data can be treated as data points distributed in a very high dimensional feature space. To make sense of such high dimensional data, extensive research has been done in finding the latent structures hidden in the large number of features. Two well known approaches in analyzing high dimensional data are feature selection and dimensionality reduction.

The goal of feature selection methods [6, 22, 31, 33] is to find a single representative subset of features that are most relevant for the data mining task at hand, such as classification. The selected features generally have low correlation with each other but have strong correlation with the target feature.

Dimensionality reduction [4, 8, 18, 27, 29] is widely used as a key component of many approaches in analyzing high dimensional data. The insight behind dimensionality reduction methods is that a high dimensional dataset may exhibit interesting patterns on a lower dimensional subspace due to correlations among the features. Though very successful in finding the low dimensional structures embedded in a high dimensional space, these methods are usually performed in the full feature space. They aim to model the global latent structure of the data and do not separate the impact of any original features nor identify latent patterns hidden in some feature subspaces. Please refer to Section 2 for a more detailed discussion of the related work.

1.1 Motivating Example

In many emerging applications, the datasets usually consist of thousands to hundreds of thousands of features. In such high dimensional dataset, some feature subsets may be strongly correlated, while others may not have any correlation at all. In these applications, it is more desirable to find the correlations that are hidden in feature subspaces. For example, in gene expression data analysis, a group of genes having strong correlation is of high interests to biologists since it helps to infer unknown functions of genes [11] and gives rise to hypotheses regarding the mechanism of the transcriptional regulatory network [17]. We refer to such correlation among a subset of features as a local correlation in comparison with the global correlation found by the full dimensional feature reduction methods. Since such local correlations only exist in some subspaces of the full dimensional space, they are invisible to the full dimensional reduction methods. In [32], an algorithm is proposed to find local linear correlations in high dimensional data. However, in real applications, the feature subspace can be either linearly or nonlinearly correlated. The problem of finding linear and nonlinear correlations in feature subspaces remains open.

For example, Figure 1 shows a data sets consisting of 12 features, {f1, f2, · · · , f12}, and 1000 data points. Embedded in the full dimensional space, features subspaces {f1, f2, f3} and {f4, f5, f6} are nonlinearly correlated, {f7, f8, f9} are linearly correlated. Features {f10, f11, f12} contain random noises.

Figure 1
An example dataset

Performing dimensionality reduction methods to the full dimensional space cannot uncover these local correlations hidden in the full feature spaces. For example, Figure 2(a) shows the result of applying Principal Component Analysis (PCA)[18] to the full dimensional space of the example dataset shown in Figure 1. In this figure, we plot the point distribution on the first 3 principal components found by PCA. Clearly, we cannot find any pattern that is similar to the patterns embedded in the dataset. Similarly, Figure 2(b) shows the results of applying ISOMAP [29] to reduce the dimensionality of the dataset down to 3. There is also no desired pattern found in this low dimensional structure.

Figure 2
Applying dimensionality reduction methods to the full dimensional space of the example dataset

How can we identify these local correlations hidden in the full dimensional space?

This question is two-fold. First, we need to identify the strongly correlated feature subspaces, i.e., a subset of features that are strongly correlated and actually have low dimensional structures. Then, after these locally correlated feature subsets are found, we can apply the existing dimensionality reduction methods to identify the low dimensional structures embedded in them.

Many methods have been proposed to address the second aspect of the question, i.e., given a correlated feature space, finding the low dimensional embedding in it. The first aspect of the question, however, is largely untouched. In this paper, we investigate the first aspect of the question, i.e., identifying the strongly correlated feature subspaces.

1.2 Challenges and Contributions

(1) In this paper, we investigate the problem of finding correlations hidden in the feature subspaces of high dimensional data. The correlations can be either linear or nonlinear. To our best knowledge, our work is the first attempt to find local linear and nonlinear correlations hidden in feature subspaces.

Many methods for modelling correlations can be found in the literature, such as mutual information [10], Pearson correlation [26], and rank correlation [19]. However, the commonly used measurements are for correlations between two features. In high dimensional data, the correlation may involve a large number of features, i.e., a high order correlation. Note that a strong high order correlation does not necessarily imply that there are strong correlations between feature pairs. For example, the 3 features, {f1, f2, f3} in Figure 1, are strongly correlated, since the 3-dimensional Swiss roll structure is actually on a 2-dimensional manifold. Figures 3(a) to 3(c) show the projections of the Swiss roll onto the spaces of two features. As we can see from the figures, there are no clear pairwise correlations between any two features.

Figure 3
Pairwise correlations of the Swiss roll in the example dataset

We adopt the concept of intrinsic dimensionality [13] to model the high dimensional correlation. We formalize this problem as finding reducible subspaces in the full dimensional space. Informally, a feature subspace is reducible if its intrinsic dimensionality is smaller than the number of features. Various intrinsic dimensionality estimators have been developed [9, 14, 21]. Our problem formalization does not depend on any particular method for estimating the intrinsic dimensionality. We show two necessary properties that any estimator should satisfy in order to be a generalization of the well-defined concepts in linear case.

(2) We develop an effective algorithm REDUS1 to find the reducible subspaces in the dataset. REDUS consists of the following two steps.

It first finds the union of all reducible subspaces, i.e., the overall reducible subspace. The second step is to uncover the individual reducible subspaces in the overall reducible subspace. The key component of this step is to examine if a feature is strongly correlated with a feature subspace. We develop a method utilizing point distributions to distinguish the features that are strongly correlated with a feature subspace and those that are not. Our method achieves similar accuracy to that of directly using intrinsic dimensionality estimators, but with much less computational cost.

Extensive experiments on synthetic and real life datasets demonstrate the effectiveness of REDUS.

2. RELATED WORK

Feature Selection

Feature selection methods [6, 22, 31, 33] try to find a subset of features that are most relevant for certain data mining task, such as classification. In order to find the relevant feature subset, these methods search through various subsets of features and evaluate these subsets according to some criteria. Feature selection methods can be further divided into two groups according to their evaluation criteria: wrapper and filter. Wrapper models evaluate feature subsets by their predictive accuracy using statistical re-sampling or cross-validation. In filter techniques, the feature subsets are evaluated by their information content, using statistical dependence or information-theoretic measures.

Feature selection finds one feature subset for the entire dataset. The selected feature subset usually contains the features that have low correlation with each other but have strong correlation with the target feature. The correlation measurements are usually defined for feature pairs, such as mutual information and Pearson correlation. Our work, on the other hand, is to find the subsets of features, in which the features are strongly correlated. Moreover, the correlations are not limited to feature pairs.

Dimensionality reduction

Dimensionality reduction methods can be categorized into linear methods, such as Multi-Dimensional Scaling (MDS) [8] and Principal Component Analysis (PCA)[18], and non-linear methods, such as Local Linear Embedding (LLE) [27], ISOMAP [29], and Laplacian eigenmaps [4]. For high dimensional datasets, if there exist low dimensional subspaces or manifolds embedded in the full dimensional spaces, these methods are successful in identifying these low dimensional embeddings.

Dimensionality reduction methods are usually applied on the full dimensional space to capture the independent components among all the features. They are not designed to address the problem of identifying correlation in feature subspaces. It is reasonable to apply them to the feature spaces that are indeed correlated. However, in very high dimensional datasets, different feature subspaces may have different correlations, and some feature subspace may not have any correlation at all. In this case, dimensionality reduction methods should be applied after such strongly correlated feature subspaces have been identified.

Correlation Clustering

The goal of correlation clustering methods is to find clusters hidden in projected feature spaces [1, 7, 30]. They can be viewed as combinations of clustering methods and dimensionality reduction methods. Both ORCLUS [1] and 4C [7] can be treated as PCA-lized clustering methods. To find the low dimensional clusters, ORCLUS and 4C apply PCA on subsets of data points in the full dimensional space and merge the point subsets having similar orientations. Therefore, they do not touch the problem of finding reducible subspaces. Instead, they implicitly assume the full dimensional space is reducible for certain subsets of data points. CURLER [30] finds clusters having non-linear correlations in subspaces. It first applies EM clustering algorithm to find a large number of micro-clusters, and then merges clusters with large overlaps. Since in its first step, micro-clusters are formed by applying EM to the full dimensional space, CURLER also does not address the problem of finding the reducible subspaces.

Intrinsic Dimensionality

Due to correlations among features, a high dimensional dataset may lie in a subspace with dimensionality smaller than the number of features [9, 14, 21]. The intrinsic dimensionality can be treated as the minimum number of free variables required to define the data without any significant information loss [13]. For example, as shown in Figure 1, in the 3-dimensional space of {f1, f2, f3}, the data points lie on a Swiss roll, which is actually a 2-dimensional manifold. Therefore, its intrinsic dimensionality is 2.

The concept of intrinsic dimensionality has many applications in the database and data mining communities, such as clustering [3, 15], outlier detection [24], nearest neighbor queries [23], and spatial query selectivity estimation [5, 12]. Different definitions of intrinsic dimensionality can be found in the literature. For example, in linear cases, matrix rank [16] and PCA [18] can be used to estimate intrinsic dimensionality. For nonlinear cases, estimators such as box counting dimension, information dimension, and correlation dimension have been developed. These intrinsic dimensionality estimators are sometimes collectively referred to as fractal dimension. Please see [25, 28] for good coverage of the topics of intrinsic dimensionality estimation and its applications.

Local Linear Correlation

In [32], the CARE algorithm has been proposed for finding local linear correlations. Adopting a similar criterion used in PCA, the strongly correlated feature subsets are formalized as feature subspaces having small residual variances. However, this work only focuses on linear correlations. The problem of finding nonlinear local correlations remains uninvestigated.

3. PROBLEM FORMALIZATION

In this section, we utilize intrinsic dimensionality to formalize the problem of finding strongly correlated feature subspaces.

Suppose that the dataset Ω consists of N data points and M features. Let ΩP = {p1, p2, · · · , pN} denote the point set, and ΩF = {f1, f2, · · · , fM} denote the feature set in Ω respectively. We use ID(V) to represent the intrinsic dimensionality of the feature subspace V [set membership] ΩF.

Intrinsic dimensionality provides a natural way to examine whether a feature is correlated with some feature subspace: if a feature fa [set membership] ΩF is strongly correlated with a feature subspace VΩF, then adding fa to V should not cause much change of the intrinsic dimensionality of V. The following definition formalizes this intuition.

Definition 3.1. (Strong Correlation)

A feature subspace VΩF and a feature fa [set membership] ΩF have strong correlation, if

ΔID(V,fa)=ID(V{fa})ID(V).

In this definition, ε is a user specified threshold. Smaller ε value implies stronger correlation, and larger ε value implies weaker correlation. If V and fa have strong correlation, we also say that they are strongly correlated.

Definition 3.2. (Redundancy)

Let V={fv1,fv2,,fvm}ΩF. fvi [set membership] V is a redundant feature of V, if fvi has strong correlation with the feature subspace consisting of the remaining features of V, i.e.,

ΔID({fv1,,fvi1,fvi+1,,fvm},fvi).

We say V is a redundant feature subspace if it has at least one redundant feature. Otherwise, V is a non-redundant feature subspace.

Note that in Definitions 3.1 and 3.2, ID(V) does not depend on a particular intrinsic dimensionality estimator. Any existing estimator can be applied when calculating ID(V). Moreover, we do not require that the intrinsic dimensionality estimator reflects the exact dimensionality of the dataset. However, in general, a good intrinsic dimensionality estimator should satisfy two basic properties.

First, if a feature is redundant in some feature subspace, then it is also redundant in the supersets of the feature subspace. We formalize this intuition as the following property.

Property 3.3. For V [set membership] ΩF, if ΔID(V, fa) ≤ ε, then U(VUΩF), ΔID(U, fa) ≤ ε.

This is a reasonable requirement, since if fa is strongly correlated with VU, then adding fa to U will not greatly alter its intrinsic dimensionality.

From this property, it is easy to see that, if feature subspace U is non-redundant, then all of its subsets are non-redundant, which is clearly a desirable property for the feature subspaces.

Corollary 3.4. If UΩF is non-redundant, then for VU, V is also non-redundant.

The following property extend the concept of basis [20] in a linear space to nonlinear space using intrinsic dimensionality. In linear space, suppose that V and U contain the same number of vectors, and the vectors in V and U are all linearly independent. If the vectors of U are in the subspace spanned by the vectors of V, then the vectors in V and the vectors in U span the same subspace. (A span of a set of vectors consists of all linear combinations of the vectors.) Similarly, in Property 3.5, for two non-redundant feature subspaces, V and U, we require that if the features in U are strongly correlated with V, then U and V are strongly correlated with the same subset of features.

Property 3.5. Let V={fv1,fv2,,fvm}ΩF and U={fu1,fu2,,fum}ΩF be two non-redundant feature subspaces. Iffui [set membership] U, ΔID(V, fui) ≤ ε, then forfa [set membership] ΩF, ΔID(U, fa) ≤ ε iff ΔID(V, fa) ≤ ε.

Intuitively, if a feature subspace Y(YΩF) is redundant, then Y should be reducible to some subspace, say V(VY). Concerning the possible choices of V, we are most interested in the smallest one that Y can be reduced to, since it represents the intrinsic dimensionality of Y. We now give the formal definitions of reducible subspace and its core space.

Definition 3.6. (Reducible Subspace and Core Space) YΩF is a reducible subspace if there exists a non-redundant subspace V(VY), such that

  1. fa [set membership] Y, ΔID(V, fa) ≤ ε, and
  2. UY(UV), U is non-redundant.

We say V is the core space of Y, and Y is reducible to V.

Criterion (1) in Definition 3.6 says that all features in Y are strongly correlated with the core space V. The meaning of criterion (2) is that the core space is the smallest non-redundant subspace of Y with which all other features of Y are strongly correlated.

Among all reducible subspaces, we are most interested in the maximum ones. A maximum reducible subspace is a reducible subspace that includes all features that are strongly correlated with its core space.

Definition 3.7. (Maximum Reducible Subspace)

YΩF is a maximum reducible subspace if

  1. Y is a reducible subspace, and
  2. fb [set membership] ΩF, if fb [negated set membership]Y, then ΔID(V, fb) > ε, where V is the core space of Y.

Let {Y1, Y2, · · · , YS} be the set of all maximum reducible subspaces in the dataset. The union of the maximum reducible subspaces OR=i=1SYi is referred to as the overall reducible subspace.

Note that Definition 3.7 works for the general case where a feature can be in different maximum reducible subspaces. In this paper, we focus on the special case where maximum reducible subspaces are non-overlapping, i.e., each feature can be in at most one maximum reducible feature subspace.

To find the maximum reducible subspaces in the dataset, REDUS adopts a two-step approach. The first step is to find the overall reducible subspace. Then, from the overall reducible subspace, it identifies the individual maximum reducible subspaces. In the next section, we present the algorithm for finding the overall reducible subspace. In Section 5, we discuss the method for finding individual maximum reducible subspaces.

4. OVERALL REDUCIBLE SUBSPACE

In this section, we present the algorithm for finding the overall reducible subspace. We first give a short introduction to the intrinsic dimensionality estimator. Then we present the algorithm for finding the overall reducible subspace and the proof of its correctness.

4.1 Intrinsic Dimensionality Estimator

To find the overall reducible subspace in the dataset, we adopt correlation dimension [25, 28], which can measure both linear and nonlinear intrinsic dimensionality, as our intrinsic dimensionality estimator since it is computationally more efficient than other estimators while its quality of estimation is similar to others. In practice, we observe that correlation dimension satisfies Properties 3.3 and 3.5, although we do not provide the proof here. In what follows, we give a brief introduction of correlation dimension.

Let Y be a feature subspace of the dataset, i.e., YΩF. Suppose that the number of points N in the dataset approaches infinity. Let dis(pi, pj, Y) represent the distance between two data points pi and pj in feature subspace Y. Let BY(pi, r) be the subset of points contained in a ball of radius r centered at point pi in subspace Y, i.e.,

BY(pi,r)={pjpjΩP,dis(pi,pj,Y)r}.

The average fraction of pairs of data points within distance r is

CY(r)=limN1N2piΩPBY(pi,r).

The correlation dimension of Y is then defined as

ID(Y)=limr,r0log[CY(r)CY(r)]log[rr].

In practice, N is a finite number. CY is estimated using 1N2piYPB(pi,r). The correlation dimension is the growth rate of the function CY (r) in log-log scale, since log[CY(r)CY(r)]log[rr]=log[CY(r)]log[CY(r)]logrlogr. The correlation dimension is estimated using the slope of the line that best fits the function in least squares sense.

The intuition behind the correlation dimension is following. For points that are arranged on a line, one expects to find twice as many points when doubling the radius. For the points scattered on 2-dimensional plane, when doubling the radius, we expect the number of points to increase quadratically. Generalizing this idea to m-dimensional space, we have CY(r)/CY(r′) = (r/r′)m. Therefore, the intrinsic dimensionality of feature subspace Y can be simply treated as the growth rate of the function CY(r) in log-log scale.

4.2 Finding Overall Reducible Subspace

The following theorem sets the foundation for the efficient algorithm to find the overall reducible subspace.

Theorem 4.1. Suppose that YΩF is a maximum reducible subspace and VY is its core space. We have UY(U=V), U is also a core space of Y.

Proof. We need to show that U satisfies the criteria in Definition 3.7. Let V = {fv1, fv2, · · · , fvm} and U = {fu1, fu2, · · · , fum}.

Since UY, from the definition of reducible subspace, U is non-redundant, and for every fui [set membership] U, ΔID(V, fui) ≤ ε. For every fa [set membership] Y, we have ΔID(V, fa) ≤ ε. Thus from Property 3.5, we have ΔID(U, fa) ≤ ε. Similarly, for every fb [negated set membership] Y, ΔID(V, fb) > ε. Thus ΔID(U, fa) > ε.

Therefore, U is also a core space of Y. □

Theorem 4.1 tells us that any subset UY of size |V| is also a core space of Y.

Suppose that {Y1, Y2, · · · , YS} is the set of all maximum reducible subspaces in the dataset and the overall reducible subspace is OR=i=1SYi. To find OR, we can apply the following method. For every fa [set membership] ΩF, let RFfa = {fb|fb [set membership] ΩF, ba} be the remaining features in the dataset. We calculate ΔID(RFfa, fa). The overall reducible subspace OR = {faID(RFfa, fa) ≤ ε}. We now prove the correctness of this method.

Corollary 4.2. OR = {faID(RFfa, fa) ≤ ε}.

Proof. Let fy be an arbitrary feature in the overall reducible subspace. From Theorem 4.1, we have fyYiOR,ViYi(fyVi), such that Vi is the core space of Yi. Thus ΔID(Vi, fy) ≤ ε. Since fy [negated set membership] Vi, we have ViRFfy. From Property 3.3, we have ΔID(RFfy, fy) ≤ ε.

Similarly, if fy [negated set membership] OR, then ΔID(RFfy, fy) > ε.

Therefore, we have OR = {fyID(RFfy, fy) ≤ ε}. □

The algorithm for finding the overall reducible subspace is shown in Algorithm 1 from Line 1 to Line 7. Note that the procedure of finding overall reducible subspace is linear to the number of features in the dataset.

Table thumbnail

5. MAXIMUM REDUCIBLE SUBSPACE

In this section, we present the second component of REDUS, i.e., identifying the maximum reducible subspaces from the overall reducible subspace found in the previous section.

5.1 Intrinsic Dimensionality Based Method

From Definition 3.7 and Theorem 4.1, we have the following property concerning the reducible subspaces.

Corollary 5.1. Let YiOR be a maximum reducible subspace, and ViYi be any core space of Yi. We have

Yi={faΔID(Vi,fa),faOR}.

Therefore, to find the individual maximum reducible subspaces YiOR(qiS), we can use any core space ViYi to find the other features in Yi. More specifically, a candidate core space of size d is a feature subset COR(C=d). From size d = 1 to |OR|, for each candidate core space, let T = {faID(C, fa) ≤ ε, fa [set membership] OR, fa [negated set membership] C}. If T ≠ θ, then T is a maximum reducible subspace with core space of size d. The overall reducible subspace OR is then updated by removing the features in T. Note that the size of |OR| decreases whenever some maximum reducible subspace is identified. We now prove the correctness of this method.

Corollary 5.2. Any candidate core space is non-redundant. □

Proof. It is easy to see any candidate core space of size 1 is non-redundant. Now, assume that all candidate core spaces of size d−1 are non-redundant, we show all candidate core spaces of size d are non-redundant. We prove this by contradiction.

Let V = {fv1, fv2, · · · , fvd} be an arbitrary candidate core space of size d. Without loss of generality, assume that fd is the redundant feature in V. Let V′ = {f1, f2 · · · , fvd−1}. We have ΔID(V′, fvd) ≤ ε. Since |V′| = d − 1, V′ is non-redundant according to the assumption. Moreover, we have T = {faID(V′, fa) ≤ ε, fa [set membership] OR, fa [negated set membership] V′} ≠ θ, since fvdT. Therefore, fvd [set membership] T would have been removed from OR before the size of the candidate core spaces reaches d. This contradicts the assumption of fvd being in the candidate core space V. Therefore, we have that any candidate core space is non-redundant. □

Corollary 5.3. Let C be a candidate core space. If [exists]fa [set membership] OR such that ΔID(C, fa) ≤ ε, then C is a true core space of some maximum reducible subspace in OR.

Proof. Let Y = {fyID(C, fy) ≤ ε, fy [set membership] OR}. Following the process of finding OR, we know that Y includes all and only the features in ΩF that are strongly correlated with C. Thus CY, such that C satisfies Criterion (1) in Definition 3.6, and Criterion (2) in Definition 3.7. Moreover, according to Corollary 5.2, C is non-redundant. Hence C also satisfies Criterion (2) of Definition 3.6. Thus Y is a maximum reducible subspace with core space C.

In this method, for each candidate core space, we need to calculate ΔID(C) and ΔID(C{fa}) for every fa [set membership] OR in order to get the value of ΔID(C, fa). However, the intrinsic dimensionality calculation is computationally expensive. Since the intrinsic dimensionality estimation is inherently approximate, we propose in the following section a method utilizing the point distribution in feature subspaces to distinguish whether a feature is strongly correlated with a core space.

5.2 Point Distribution Based Method

After finding the overall reducible subspace OR, we can apply the following heuristic to examine if a feature is strongly correlated with a feature subspace. The intuition behind our heuristic is similar to the one behind the correlation dimension.

Assume that the number of data points N in the dataset approaches infinity, and the features in the dataset are normalized so that the points are distributed from 0 to 1 in each dimension. Let ps [set membership] ΩP be an arbitrary point in the dataset, and 0 < l < 1 be a natural number. Let ξsy represent the interval of length l on feature fy centered at ps. The expected number of points within the interval ξsy is lN. For d features C = {fc1, fc2, · · · , fcd}, let QsC be the d-dimensional hypercube formed by the intervals ξsci (fci [set membership] C). If the d features in C are totally uncorrelated, then the expected number of points in QsC is ldN. Let fm be another feature in the dataset, and C′ = {fc1, fc2, · · · , fcd, fm}. If fm is determined by {fc1, fc2, · · · , fcd}, i.e., fm is strongly correlated with C, then C′ has intrinsic dimensionality d. The expected number of points in the d-dimensional hypercube, QsC, which is embedded in the (d + 1)-dimensional space of C′, is still ldN. If, on the other hand, fm is uncorrelated with any feature subspace of {fc1, fc2, · · · , fcd}, then C′ has dimensionality d + 1, and the expected number of points in the (d + 1)-dimensional hypercube QsC is l(d+1)N. The difference between the number of points in the cubes of these two cases is ld(1 − l)N.

Figure 4(a) and 4(b) show two examples on 2-dimensional spaces. In both examples, d = 1 and C = {fa}. In Figure 4(a), feature fb is strongly correlated with fa. Feature fc is uncorrelated with fa, as shown in Figure 4(b). The randomly sampled point ps is at the center of the cubes Qs{fa,fb} and Qs{fa,fc}. The point density in cube Qs{fa,fb} is clearly much higher than the point density in cube Qs{fa,fc} due to the strong correlation between fa and fb.

Figure 4
Point distributions in correlated feature subspace and uncorrelated feature subspace

Therefore, for each candidate core space, we can check if a feature is correlated with it in the following way. We randomly sample n points P = {ps1, ps2, · · · , psn} from the dataset. Suppose that C = {fc1, fc2, · · · , fcd} is the current candidate core space. For feature fa [set membership] OR (fa [negated set membership] C), let C′ = {fc1, fc2, · · · , fcd, fa}. Let δsiC′ represent the number of points in the cube QsiC′ P′ = {psi|δsiC′l(d+1) is the subset of the sampled points such that the cube centered at them have more points than expected if fa is uncorrelated with C. We say fa is strongly correlated with CifPPτ, where τ is a threshold close to 1.

Concerning the choice of l, we can apply the following reasoning. If we let l=(1N)1d+1, then the expected number of points in the cube QsiC is 1, if fa is uncorrelated with C. If fa is correlated with C, then the expected number of points in the cube QsiC is greater than 1. In this way, we can set l according to the size of the candidate core space.

The second step of REDUS is shown in Algorithm 1 from Line 8 to Line 18. Note that in the worst case, the algorithm needs to enumerate all possible feature subspaces. However, in practice, the algorithm is very efficient since once an individual reducible subspace is found, all its features are removed. Only the remaining features need to be further examined.

6. EXPERIMENTS

To evaluate REDUS, we apply it on both synthetic datasets and real datasets. REDUS is implemented using Matlab 7.0.4. The experiments are performed on a 2.4 GHz PC with 1G memory running WindowsXP system.

6.1 Parameter Setting

As shown in Algorithm 1, REDUS generally requires three input parameters: ε, n, and τ. In the first step of finding the overall reducible subspace, ε is the threshold to filter out the irrelevant features. Since features strongly correlated with some core space can only change intrinsic dimensionality a small amount, the value of ε should be close to 0. According to our experience, a good starting point is 0.1. After finding the reducible subspaces, the user can apply the standard dimensionality reduction methods to see if the are really correlated, and the adjust ε value accordingly to find stronger or weaker correlations in the subspaces. In all our experiments, we set ε between 0.002 to 0.25. In the second step, n is the point sampling size and τ is the threshold to determine if a feature is strongly correlated with a candidate core space. In our experiments, n is set to be 10% of the total number of data points in the dataset, and τ is set to be 90%.

6.2 Synthetic Datasets

6.2.1 Effectiveness Evaluation

To evaluate the effectiveness of the REDUS, we generate two synthetic datasets.

Synthetic dataset 1

The first synthetic dataset is as shown in Figure 1. There are 12 features, {f1, f2, · · · , f12}, and 1000 data points in the dataset. 3 reducible subspaces: a 2-dimensional Swiss roll, a 1-dimensional helix-shaped line, and a 2-dimensional plane, are embedded in different 3-dimensional spaces respectively. The overall reducible subspace is {f1, f2, · · · , f9}. Let ci (1 ≤ i ≤ 4) represent constants and rj (1 ≤ j ≤ 3) represent random vectors. The generating function of the Swiss roll is: t=32π(1+2r1), s = 21r2, f1 = t cos(t), f2 = s, f3 = t sin(t). The roll is then rotated 45° counter clockwise on feature space {f2, f3}. The helix-shaped line is generated by: f4 = c1r3, f5 = c2 sin(r3), f6 = c2 cos(r3). The 2-dimensional plane is generated by f9 = c3f7 + c4f8. The remaining 3 features {f10, f11, f12} are random vectors consisting of noise data points.

In the first step, with ε = 0.25, REDUS successfully uncovers the overall reducible space. The parameter setting for the second step is τ = 90%, and point sampling size 10%. We run REDUS 10 times. In all 10 runs, REDUS successfully identifies the individual maximum reducible subspaces from the overall reducible subspace.

Synthetic dataset 2

We generate another larger synthetic dataset as follows. There are 50 features, {f1, f2, · · · , f50} and 1000 data points in the dataset. There are 3 reducible subspaces: Y1 = {f1, f2, · · · , f10} reducible to a 2-dimensional space, Y2 = {f11, f12, · · · , f20} reducible to a 1-dimensional space, and Y3 = {f21, f22, · · · , f30} reducible to a 2-dimensional space. The remaining features contain random noises. Figures 5(a) and 5(b) show two examples of the embedded correlations in 3-dimensional subspaces. Figure 5(a) plots the point distribution on feature subspace {f1, f2, f9} of Y1, and Figure 5(b) plots the point distribution on feature subspace {f11, f12, f13} of Y2.

Figure 5
Examples of embedded correlations in synthetic dataset 2

We apply REDUS on this synthetic dataset using various parameter settings. Table 1 shows the accuracy of finding the overall reducible subspace when ε taking different values. The recall is defined as TP/(TP + FN), and the precision is defined as TP/(TP + FP), where TP represents the number of true positive, FP represents the number of false positive, and FN represents the number of false negative. As we can see, REDUS is very accurate and robust to ε.

Table 1
Accuracy of finding the overall reducible subspace when varying ε

Tables 2(a) and 2(b) show the identified maximum reducible subspaces when varying τ and n. Table 2(a) shows results under different settings of τ. The point sampling size n in this table is the default value, i.e., 10% of the total number of data points. Although changing τ may cause some mis-classified features, Table 2(a) still shows that REDUS achieves reasonably high accuracy under different settings of τ. The reason for different decompositions of maximum reducible subspaces is that features in different maximum reducible subspaces may still have moderate correlations. If these correlated features are identified, they will be removed from the overall reducible subspace, since REDUS focuses on the case where the maximum reducible subspaces are non-overlapping. If we allow each feature to be in different maximum reducible subspaces, the findings under different τ should be more similar to each other. Finding overlapping reducible subspaces is computationally more demanding, and is an interesting problem worth further exploration.

Table 2
Accuracy of identifying the maximum reducible subspaces from the overall reducible subspace when varying τ and n

Table 2(b) shows the identified maximum reducible subspaces when varying the number of the sampled points n. τ is set to be 0.92 in this table. As shown in the table, REDUS is not sensitive to the size of the sampled data points.

6.2.2 Efficiency Evaluation

To evaluate the efficiency and scalability of REDUS, we apply it to synthetic dataset 2. The default dataset for efficiency evaluation contains 1000 points and 50 features if not specified otherwise. The default values for the parameters are the same as before.

Figure 6(a) shows the runtime of finding the overall reducible subspace when varying the number of data points. The runtime scales roughly quadratically. This is because when computing the correlation dimensions, we need to calculate all pairwise distances between the data points, which is clearly quadratic to the number of points.

Figure 6
Efficiency evaluation of finding the overall reducible subspace

Figure 6(b) shows that the runtime of finding the overall reducible subspace is linear to the number of features. This is because REDUS only scans every feature once to examine if it is strongly correlated with the subspace of the remaining features. This linear scalability is desirable for the datasets containing a large number of features.

Figures 7(a) and 7(b) show the runtime comparisons between using the correlation dimension as intrinsic dimensionality estimator and the point distribution heuristic to identify the individual maximum reducible subspaces from the overall reducible subspaces. Since the calculation of intrinsic dimensionality is relatively expensive, the program often cannot finish in a reasonable amount of time. Using the point distribution heuristics, on the other hand, is much more efficient and scales linearly to the number of points and features in the dataset.

Figure 7
Efficiency evaluation of identifying maximum reducible subspaces from the overall reducible subspace

6.3 Real Life Datasets

6.3.1 NBA dataset

We apply REDUS on the NBA statistics dataset. The dataset can be downloaded from http://sports.espn.go.com/nba/teams/stats?team=Bos&year=2007&season=2. It contains the statistics of 28 features for 200 players of season 2006-2007. Since the features have different value scales, we normalized each feature such that points are distributed between 0 and 1. We report two interesting correlations found in the dataset in Figures 8(a) and 8(b). Note that the features shown in the figures are mean-centered.

Figure 8
Correlations identified in the NBA dataset

The correlation shown in Figure 8(a) says that the feature subspace of three features: defence rebounds (DEF), the offense rebounds (OFF), and the total number of rebounds (TOT), is strongly correlated and reducible to a 2-dimensional space. This is an obvious correlation that one would expect. As shown in the figure, the points are linearly distributed on a 2-dimensional plane in the 3-dimensional subspace.

Figure 8(b) shows a nonlinear correlation identified by REDUS. The feature subspace of three features: field goal made (FGM), field goal attempted (FGA), and the percentage of field goal (FG%) is strongly correlated and reducible to 2-dimensional space. Clearly, the data points on this 3-dimensional space are distributed on a 2-dimensional manifold.

6.3.2 Wage dataset

The wage dataset from the 1985 Current Population Survey consists of 11 features in 534 data points. The dataset is available at http://lib.stat.cmu.edu/datasets/CPS_85_Wages. The numerical features are age, years of education, years of work experience, and wage. We apply REDUS on this dataset to find the strongly correlated feature subsets.

REDUS identifies one correlation between the features: age, years of education, and wage. Figure 9 shows the point distribution of the data points in this 3-dimensional feature subspace. From this figure, we can see that wage is clearly a linear function of age and years of education. Therefore, this 3-dimensional space is reducible to the 2-dimensional plane embedded in it.

Figure 9
A linear correlation in the wage dataset

6.3.3 Breast cancer dataset

We apply REDUS to the breast cancer dataset which is available at the UCI Machine Learning Archieve. There are 569 data points and 30 features in this dataset. The features include the statistics of radius, texture, perimeter, area, smoothness, compactness concavity, concave points, symmetry, and fractal dimension. These features are computed from a digitized image of a fine needle aspirate of a breast mass. They describe characteristics of the cell nuclei present in the image.

Figure 10 shows one of the nonlinear correlations identified by REDUS. The three features are the mean of radius, largest radius, and the mean of texture. From the figure, we can see these three features are strongly correlated and reducible to 1-dimensional space.

Figure 10
A correlation in the breast cancer dataset

7. CONCLUSION

In this paper, we investigate the problem of finding strongly correlated feature subspaces in high dimensional datasets. The correlation can be linear or nonlinear. Such correlations hidden in feature subspace may be invisible to the global feature transformation methods, such as PCA and ISOMAP. Utilizing the concepts of intrinsic dimensionality, we formalize this problem as the discovery of maximum reducible subspaces in the dataset. An effective algorithm, REDUS, is presented to find the maximum reducible subspaces. The experimental results show that REDUS can effectively and efficiently find these interesting local correlations.

Our work reported in this paper focuses on the case where the maximum reducible subspaces are non-overlapping. For future work, one interesting direction is to extend current work to the general case where a feature can be in multiple maximum reducible subspaces. Another interesting direction is finding the feature subspaces that are strongly correlated on a subset of data points. This is a more general problem and has wider applications.

8. ACKNOWLEDGMENTS

This research was partially supported by EPA grant STAR-RD832720, NSF grants IIS-0448392, IIS-0534580, and a Microsoft New Faculty Fellowship.

Footnotes

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

1REDUS stands for REDUcible Subspaces.

9. REFERENCES

1. Aggarwal C, Yu P. Finding generalized projected clusters in high dimensional spaces. SIGMOD. 2000
2. Alizadeh A, et al. Distinct types of di use large b-cell lymphoma identified by gene expression profiling. Nature. 2000;403:503–11. [PubMed]
3. Barbara D, Chen P. Using the fractal dimension to cluster datasets. KDD. 2000
4. Belkin M, Niyogi P. Şlaplacian eigenmaps for dimensionality reduction and data representation. Neural Computation. 2003
5. Belussi A, Faloutsos C. Self-spacial join selectivity estimation using fractal concepts. ACM Transactions on Information Systems. 1998;16(2):161–201.
6. Blum A, Langley P. Selection of relevant features and examples in machine learning. Artificial Intelligence. 1997;97:245–271.
7. Bohm C, Kailing K, Kroger P, Zimek A. Computing clusters of correlation connected objects. SIGMOD. 2004
8. Borg I, Groenen P. Modern multidimensional scaling. Springer; New York: 1997.
9. Camastra F, Vinciarelli A. Estimating intrinsic dimension of data with a fractal-based approach. IEEE Trans. on Pattern Analysis and Machine Intelligence. 2002;24(10):1404–1407.
10. Cover TM, Thomas JA. The Elements of Information Theory. Wiley & Sons; New York: 1991.
11. Eisen M, Spellman P, Brown P, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA. 1998;95:14863–68. [PubMed]
12. Faloutsos C, Kamel I. Beyond uniformity and independence: analysis of r-trees using the concept of fractal dimension. PODS. 1994
13. Fukunaga K. Intrinsic dimensionality extraction. In: Krishnaiah PR, Kanal LN, editors. Classification, Pattern recongnition and Reduction of Dimensionality, Volume 2 of Handbook of Statistics. Amsterdam; North Holland: 1982. pp. 347–360.
14. Fukunaga K, Olsen DR. An algorithm for finding intrinsic dimensionality of data. IEEE Transactions on Computers. 1976;20(2):165–171.
15. Gionis A, Hinneburg A, Papadimitriou S, Tsaparas P. Dimension induced clustering. KDD. 2005
16. Golub G, Loan A. Matrix computations. Johns Hopkins University Press; Baltimore, Maryland: 1996.
17. Iyer V, et al. The transcriptional program in the response of human fibroblasts to serum. Science. 1999;283:83–87. [PubMed]
18. Jolli e I. Principal component analysis. Springer; New York: 1986.
19. Kendall M, Gibbons JD. Rank Correlation Methods. Oxford University Press; New York: 1990.
20. Lay DC. Linear Algebra and Its Applications. Addison Wesley; 2005.
21. Levina E, Bickel PJ. Maximum likelihood estimation of intrinsic dimension. Advances in Neural Information Processing Systems. 2005
22. Liu H, Motoda H. Feature selection for knowledge discovery and data mining. Kluwer Academic Publishers; Boston: 1998.
23. Pagel B-U, Korn F, Faloutsos C. Deflating the dimensionality curse using multiple fractal dimensions. ICDE. 2000
24. Papadimitriou S, Kitawaga H, Gibbons PB, Faloutsos C. Loci: Fast outlier detection using the local correlation integral. ICDE. 2003
25. Rasband SN. Chaotic Dynamics of Nonlinear Systems. Wiley-Interscience; 1990.
26. Reynolds HT. The analysis of cross-classifications. The Free Press; New York: 1977.
27. Roweis S, Saul L. Nonlinear dimensionality reduction by locally linear embedding. Science. 2000;290(5500):2323–2326. [PubMed]
28. Schroeder M. Fractals, Chaos, Power Lawers: Minutes from an Infinite Paradise. W. H. Freeman; New York: 1991.
29. Tenenbaum JB, de Silva V, Langford JC. A global geometric framework for nonlinear dimensionality reduction. Science. 2000;290(5500):2319–2323. [PubMed]
30. Tung AKH, Xin X, Ooi BC. Curler: Finding and visualizing nonlinear correlation. SIGMOD. 2005
31. Yu L, Liu H. Feature selection for high-dimensional data: a fast correlation-based filter solution. ICML. 2003
32. Zhang X, Pan F, Wang W. Care: Finding local linear correlations in high dimensional data. ICDE. 2008 [PMC free article] [PubMed]
33. Zhao Z, Liu H. Searching for interacting features. IJCAI. 2007