Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2916184

Formats

Article sections

- Abstract
- I. Introduction
- II. Preliminaries
- III. Related Work
- IV. Model
- V. Mining AOPC
- VI. Experiments
- VII. Conclusions
- References

Authors

Related links

Proc Int Conf Data Eng. Author manuscript; available in PMC 2010 August 4.

Published in final edited form as:

Proc Int Conf Data Eng. 2008 April 7; 2008: 160–168.

doi: 10.1109/ICDE.2008.4497424PMCID: PMC2916184

NIHMSID: NIHMS132004

See other articles in PMC that cite the published article.

Subspace clustering has attracted great attention due to its capability of finding salient patterns in high dimensional data. Order preserving subspace clusters have been proven to be important in high throughput gene expression analysis, since functionally related genes are often co-expressed under a set of experimental conditions. Such co-expression patterns can be represented by consistent orderings of attributes. Existing order preserving cluster models require all objects in a cluster have identical attribute order without deviation. However, real data are noisy due to measurement technology limitation and experimental variability which prohibits these strict models from revealing true clusters corrupted by noise. In this paper, we study the problem of revealing the order preserving clusters in the presence of noise. We propose a noise-tolerant model called approximate order preserving cluster (AOPC). Instead of requiring all objects in a cluster have identical attribute order, we require that (1) at least a certain fraction of the objects have identical attribute order; (2) other objects in the cluster may deviate from the consensus order by up to a certain fraction of attributes. We also propose an algorithm to mine AOPC. Experiments on gene expression data demonstrate the efficiency and effectiveness of our algorithm.

In recent years, the advent of high throughput data generation techniques have increased not only the number of objects collected in databases, but also the number of attributes describing these objects. The resultant datasets are often referred to as high dimensional. Clustering high dimensional data using traditional algorithms has suffered from the fact that many attributes may be irrelevant and can thus mask clusters located in some subspaces. Subspace clustering algorithms have recently been proposed to solve this problem. They search for clusters in subspaces formed by relevant attributes [1]. Among various subspace clustering models, one was designed to mine a set of objects which show identical attribute order, called *order preserving cluster* (OPC) [6]. We will give a formal description of this model in next section. This model originally attracts researchers’ interests because of its important utility in gene expression data analysis. Based on the understanding of cellular processes, it is a general belief that some subsets of genes may be co-expressed under certain experimental conditions, but behave independently under other conditions. Finding such local expression patterns exhibited under relevant conditions is one important contribution of the OPC algorithm and may be the key to uncover significant but previously unknown genetic pathways.

However, noise is ubiquitous in real data due to technical errors, missing values and variable experimental conditions, etc. The underlying OPCs may be broken into small ones by noise and cannot be captured by any strict model due to their vulnerability to noise. Figure 1 shows an example. The dataset contains four objects {*w*, *x*, *y*, *z*} with attributes {*a*, *b*, *c*, *d*, *e*}. Originally, all objects follow the same attribute order: their values on *a*, *b*, *c*, *d* and *e* are in increasing order. Thus {*w*, *x*, *y*, *z*} is an OPC on {*a*, *b*, *c*, *d*, *e*}. However, after distributing some noise into this dataset, *w* and *z* deviate from their original attribute order. As a result, {*w*, *x*, *y*, *z*} is no longer an OPC on {*a*, *b*, *c*, *d*, *e*}. Instead, it is broken into several smaller ones with overlap, such as {*x*, *y*, *z*} on {*a*, *b*, *c*, *d*}, {*w*, *x*, *y*} on {*a*, *b*, *d*, *e*} and {*x*, *y*} on {*a*, *b*, *c*, *d*, *e*}. From this example we can see that the true cluster cannot be captured by the strict model in the presence of noise.

An example illustrates how an OPC is disrupted by noise. Objects *w* and *z* are excluded from the cluster due to noise

Mining subspace OPCs in the presence of noise is very challenging for the following reasons. First, the search space is often huge due to the curse of dimensionality. For a dataset with *n* attributes, there are totally 2* ^{n}* candidate subspaces. For OPC mining, the complexity is much higher, since in addition to identifying subspaces, we also need to distinguish different orders. For a subspace of

$$\sum _{m=1}^{n}\left(\begin{array}{c}n\\ m\end{array}\right)\times m!$$

(1)

For each attribute order, there could be a potential OPC associated with it. This means even for a dataset with 10 attributes, there are *O*(10^{6}) orders to search. Second, when taking noise into account, the problem becomes even harder. Ambiguity is brought into the problem by noise. Objects in the same cluster may not have identical attribute order anymore. Thus given a cluster candidate, how to identify its consensus attribute order becomes a new challenge which was not an issue in previous models. Third, some good properties such as the anti-monotonicity do not hold anymore when tolerating noise. All of above facts pose significant challenges for OPC mining in the presence of noise.

To our best knowledge, no previous work has been done on the problem of subspace OPC mining with noise tolerance. In this paper, we study this problem and propose a new model. Experimental results demonstrate that our new model can capture OPCs contaminated by noise and thus is more robust to noise than the previous strict model. We also propose an algorithm to mine clusters under the new model. Although we deal with a much more challenging problem, our algorithm can discover more significant clusters that cannot be found by the previous strict model in an efficient way.

The remainder of this paper is organized as follows. Section II is the preliminary section. It introduces the notations and terminologies we use throughout the paper. Section III gives a brief review of related work. In Section IV, we propose our new model and some experimental evidence to demonstrate its noise tolerance capability. Section V presents the algorithm, where we first propose a basic mining algorithm then followed by the discussion of several optimization techniques. Section VI shows the experimental results. We conduct a series of experiments to evaluate both the efficiency and effectiveness of our algorithm. We conclude the paper in Section VII.

In this section, we discuss the terminologies, notations and assumptions of this paper. We also formally define the OPC mining problem here. First, some notations we use are listed as below:

D | A set of objects |

A | The set of all attributes of objects in D |

(C, T) | A subset of input dataset, C D, T A |

x, y.. | Individual object in D |

a, b.. | Individual attribute in A |

d_{xa} | Value of object x on attribute a |

The (attribute) *order* of an object *x* on a subset of attributes *T* is a permutation of the attributes in *T* induced by the values of *x* on these attributes. The order of *x* on *T*, denoted by *o _{xT}*, is

$${d}_{xa}<{d}_{xb}<{d}_{xc}<\dots $$

(2)

If the order of *x* on *T* is *o _{xT}*, we also say

Clustering is the process of grouping *similar* objects. Various similarity measures have been employed, some are based on distances and others are based on patterns. OPC model belongs to the second category. In this section we review previous work on pattern based clustering. Since none of them considers the presence of noise, we call them *strict models*. A common characteristic of the strict models is that all objects in a cluster must follow the same pattern. Beside the strict models, we also review the work of approximate frequent itemset mining, since its idea of handling noise is related to our problem.

The earliest work on pattern based clustering is the bi-cluster model proposed by Cheng *et al*. in [3]. This model tries to measure the coherence between genes and the experiment conditions in a sub-matrix of a DNA array. Later, Wang *et al*. proposes δ-pCluster model [4] which aims to discover clusters of objects show shifting or scaling patterns. With powerful pruning strategies, δ-pClusters can be mined efficiently. However, this algorithm can group objects exhibiting either pure shifting or pure scaling pattern, but not both at the same time. Xu *et al*. proposes another model called Reg-Cluster [5] to relax this limitation. Through a linear transformation, this model is able to capture the shifting and scaling patterns simultaneously. But the problem is: for most real world applications, requiring exact shifting or scaling is too restrictive. In order to include more diverse patterns, we often need to lower the required pattern significance. This, in turn, can result in undesirable inconsistency inside a cluster.

The concept of OPC was first proposed by Ben-Dor *et al*. in [2] where they called it *order preserving sub-matrix* (OPSM). Each OPSM represents a subset of genes identically ordered among a subset of experiment conditions in a gene Micro-array dataset. Since this problem is NP-hard, they proposed a probabilistic model to mine an OPSM from a random matrix. The local patterns found by this algorithm seem to be significant. A drawback of this algorithm is that only one cluster can be found at a time and the result is very sensitive to input parameters and initial seeds. To find multiple OPSMs at the same time, Liu *et al*. proposes a deterministic algorithm to mine all OPCs in [6]. They develop an efficient pruning strategy and an auxiliary data structure called OPC-tree, this algorithm searches the full order space and thus can find all orders exhibited by a subset of objects along a subset of attributes. The OPC model is more flexible than those models only capturing specific patterns such as shifting and scaling, thus can be applied more widely. However due to the noisy nature of real data, it still fail to discover some significant clusters, since it requires all objects in a cluster have identical order and thus excludes those objects originally in the cluster but contaminated by noise. We will show that the noise tolerance capability of OPC models is very weak in the next section.

The task of frequent itemset mining is to mine a sub-matrix of ‘1’s containing a sufficiently large set of rows (transactions) in a binary matrix representation of the input dataset. This problem also suffers from the presence of noise which may corrupt true frequent itemsets. Liu *et al*. proposes a noise tolerant model for this problem called *approximate frequent itemset* (AFI) in [7]. AFI criteria place restrictions on the fraction of noise in both the rows and columns of a sub-matrix which ensures a relatively uniform distribution of noise in any discovered patterns. AFI is proven to be effective in revealing significant underlying frequent itemsets. However, it only deals with binary data. For continuous data, it is much more difficult to discover noise-tolerant clusters. In this paper, we study noise tolerance in continuous data.

Due to the noisy nature of real data, it is often too optimistic to expect all objects in a cluster to have the same order. Co-expression patterns in gene expression data is such an example. So in order to find more significant clusters, a more flexible model which can tolerate noise is needed. In this section, we propose a new model called *approximate order preserving cluster* (AOPC). The general idea is that the members of an AOPC should follow similar (not necessarily identical) orders. At the same time, there should be enough members supporting an order as the *consensus* order of the cluster. The novelty of this model is that it allows relaxation in a systematic way. Instead of requiring all objects have an identical order, it allows a group of objects with similar orders to form a cluster. The formal definition of this model is in DEFINITION 4.1. In this definition, *LCS*(*a*, *b*) denotes the *longest common subsequence* of two sequences *a* and *b* [8]. Also, we use |·| to denote the size (or length) of a set (or sequence). *δ _{c}* and

Given a dataset D and its attribute set A. Let (C, T) be a subset of the dataset where C D and T A. If o is an order of attributes in T, then C is an **approximate order preserving cluster** with order o if and only if it satisfies the following two criteria:

- (consistency criterion) For each object x in C, |LCS(o
_{xT}, o)| ≥ |o|×δ_{c}, where 0 <δ_{c}≤ 1. - (supporting criterion) There exist at least |C|×δ
_{s}objects in C which support o, where 0<δ_{s}≤1.

From the above definition, we know that the OPC model is actually a special case of AOPC where *δ _{s}* = 1. Thus an OPC is an AOPC as well. Moreover, we define the

Let (C, T) be an AOPC with order o, then its **core** is the subset of C consisting of all objects in O which support o.

In this section, we present some experimental results to demonstrate that the AOPC model is more robust to noise than the OPC model.

The objective of the experiment is to compare the noise tolerant capability of the OPC model and the AOPC model. The experiment is designed in the following way. First, we generate an OPC where all objects in it follow an identical order. The value of each data entry in this OPC satisfies a normal distribution with mean zero and variance one. Second, we add some noise to each data entry. The noise also satisfies a normal distribution with mean zero and a controlled variance *σ*. We set *σ* to zero initially and increase it gradually by 10^{−7} at each step. With each *σ* value, we test the satisfiability of the resulting data for both OPC and AOPC models. At the beginning, since there is no noise in the dataset, it satisfies both models. As *σ* increases, the resulting data becomes messier. There exists some *σ* from which the data no longer satisfies the model, we call this value *turning point*. We want to find turning points for OPC model and AOPC model respectively and compare their magnitude difference. In our experiment, we totally generated 1000 clusters with various sizes. The parameters *δ _{s}* and

In this section, we propose an algorithm to mine AOPCs in a given dataset. Our algorithm can be divided into two phases. We find all valid OPCs in the given dataset in Phase 1. In Phase 2, we mine AOPCs from the result of Phase 1. The general idea of this algorithm is inspired by the observation we made in the introduction. As shown in Figure 1, the presence of noise breaks original clusters into smaller ones with overlap. So a natural thought is to reverse this process, i.e. first find all small OPCs and then merge them to recover the true clusters.

Given a dataset *D*, user-specified parameters *s _{min}* and

We made some modifications to the original algorithm to make it work well with Phase 2. The major changes we made are as below. First, the original algorithm has a pre-processing phase which groups attributes with similar values together. This process not only introduces extra computation, but also has the risk of missing some OPCs. In our algorithm, in order to find more OPCs as a good foundation for Phase 2, we remove this pre-processing phase. Second, the original algorithm returns a set of OPCs without any order. However, as we will explain later, to organize the OPCs in a way such that similar OPCs are adjacent to each other is important for the efficiency of Phase 2. Since the OPC algorithm traverses the order space in depth-first manner, the OPCs generated in consecutive steps are likely to have similar orders thus similar to each other. Figure 3 illustrates this with an example. It shows the search process of a set of three attributes {*a*, *b*, *c*}. By depth-first searching manner, similar attribute orders such as *ab*, *abc*, *ac* and *acb* are traversed in consecutive steps. When the total number of attributes increases, this property is more prominent. In order to take advantage of the temporal locality of similar OPCs generated during the traversal of the search space, we use a FIFO queue to store the OPCs in the order of which they are generated. Due to space limitation, we cannot cover every detail here. For more details of the OPC algorithm, please refer to [6].

In Section IV.B, we showed that an OPC corrupted by noise can be modelled by AOPC model. However, generating all AOPCs is NP-hard. Thus we propose an efficient greedy algorithm to generate significant AOPCs. Specifically, we propose a greedy algorithm to mine AOPCs through a recursive merging process. The time complexity of this algorithm is polynomial with respect to the number of OPCs found in Phase 1. Based on this basic algorithm, we propose an enhanced version with a hierarchical merge scheme which is much faster. Experiment demonstrates that the AOPCs generated by the enhanced algorithm are as significant as those found by the basic algorithm.

The basic process of Phase 2 is to merge smaller AOPCs (initially OPCs) into bigger ones. To merge two AOPCs, we first take the union of their object sets then construct a common super-order of their orders as the consensus order of the new AOPC. Among all common super-orders, the one with the highest support is selected. The merge result is a valid AOPC only if both consistency and supporting criteria in DEFINITION 4.1 are satisfied. To verify them, we need to (1) compute the LCS between the super-order and the order of every object in the union to check whether the length of the LCS is at least *δ _{c}* percent of the length of the super-order; (2) confirm that the super-order is supported by at least

In this section, we present a technique called *prefiltering,* which can exclude most AOPC pairs that cannot be merged with time complexity *O*(*s _{min}*) for each AOPC pair. It is much faster than performing the full test discussed above which has time complexity

If C is an AOPC, then all objects follow o(C) are in core(C).

Suppose that *C* is generated from *C _{n}* and

A necessary condition under which two given AOPCs C_{1} and C_{2} can be merged is:

$$\frac{\mid \mathit{core}({C}_{1})\cap \mathit{core}({C}_{2})\mid}{\mid {C}_{1}\cup {C}_{2}\mid}>{\delta}_{s}$$

(3)

Suppose that *C*_{1} and *C*_{2} can be merged, and *C* is the AOPC after merge. For any object in *core*(*C*), it follows *o*(*C*). Since *o*(*C*) is a super-order of both *o*(*C*_{1}) and *o*(*C*_{2}), this object follows *o*(*C*_{1}) and *o*(*C*_{2}) as well. From LEMMA 5.1, we know that this object is in *core*(*C*_{1}) and *core*(C_{2}) at the same time. Thus,

$$\mid \mathit{core}(C)\mid \le \mid \mathit{core}({C}_{1})\cap \mathit{core}({C}_{2})\mid $$

(4)

During the merge, the object set of *C* is generated by taking the union of the object sets of *C*_{1} and *C*_{2}. Thus,

$$\mid C\mid =\mid {C}_{1}\cup {C}_{2}\mid $$

(5)

Finally, according to the definition of AOPC, we have

$$\frac{\mathit{core}(C)}{\mid C\mid}>{\delta}_{s}$$

(6)

Therefore, Eq. (4) – Eq. (6) imply Eq. (3) holds.

Based on THEOREM 5.2, we propose a linear time test called FILTER_TEST for each AOPC pair as shown in Figure 4. Each AOPC pair will first take this test before the merge routine. Those pairs failing this test will not be merged. Since this test only requires an intersection and a union operation of two sets, the time complexity is linear with respect to the size of the AOPCs. In addition to FILTER_TEST, COROLLARY 5.3 suggests that we should exclude an AOPC from future merge attempt if its FILTER_TEST with all other AOPCs fail. With FILTER_TEST, the algorithm can speedup substantially which we will show in the experiment section.

During the merge process, if the FILTER_TEST for an AOPC C with every other AOPC fails, C need not be considered for future merge.

From THEOREM 5.2 we know that the core of the newly merged AOPC is a subset of the intersection of the cores of two AOPCs before merge. After each subsequent merge, the size of the core decreases while the size of the AOPC increases. Thus, the ratio at the left hand side of Eq. (3) decreases monotonically after each merge. Once it fails below *δ _{s}* when we do FILTER_TEST between

We propose a greedy algorithm consisting of a series of iterations. The algorithm selects and merges the best pair of AOPCs at each iteration. If two AOPCs *C*_{1} and *C*_{2} pass both the FILTER_TEST and the full test, we compute their *adhesion* (AD value) as follows:

$$AD({C}_{1},{C}_{2})=\frac{\mid \mathit{core}(C)\mid}{\mid C\mid}\times \frac{\mid \mathit{LCS}(o({C}_{1}),o({C}_{2}))\mid}{\mid o(C)\mid}$$

(7)

where *C* denotes the resulting AOPC, should *C*_{1} and *C*_{2} be merged. The intuition behind this is that the more the objects supporting the new super-order and the more similar *o*(*C*_{1}) and *o*(*C*_{2}) are, the more likely *C*_{1} and *C*_{2} should be merged.

Initially, the algorithm starts from the OPCs found in Phase 1. It prefilters most AOPC pairs that cannot be merged. For the rest pairs, it computes the adhesion for those pass the full test. Among them, the AOPC pair with the highest adhesion is selected to merge into a new AOPC. At each subsequent iteration, the above operations are only need to be done between the newly generated AOPC and other existing AOPCs. This process iterates until no new AOPC can be generated. The pseudo code is shown in Figure 5.

To analyse the complexity of this algorithm, let us assume that totally *N* OPCs are found in Phase 1. Before the first merge, the algorithm needs to do an operation on each AOPC pair, with totally *O*(*N*^{2}) operations. Each operation can be either a prefiltering test or a full test (only if the pair passes the prefiltering test, the full test is needed). After that, at each iteration, the operations are only need to be done between the newly generated AOPC and other existing AOPCs. So there are totally *O*(*N*) operations at each iteration. To find the largest *AD* value in an efficient way, we keep all *AD* values in a priority queue. The complexity of pushing and popping of this queue is *O*(lg*N*) at each iteration. Since each merge decreases the number of AOPCs by one, there are at most *O*(*N*) iterations. So the complexity of this algorithm is *O*(*N*^{2}*m*+*N*^{2}(lg*N*+*m*))) = *O*(*N*^{2}(lg*N*+*m*)), where *m* denotes the *average* cost of a single operation for each AOPC pair. If a pair can be prefiltered, then this cost is only *O*(*s _{min}*), otherwise it is

A possible way to speed up the algorithm is to use a hierarchical merge scheme. In such a scheme, we first partition the OPCs generated by Phase 1 into groups. There are 2* ^{n}* groups at the

The time complexity of this scheme is 2*p*^{2}^{n}^{+2}/(2*p*^{2}−1) times the cost of the basic algorithm, where *n* is the number of hierarchical levels and *p* is an *average* fraction of AOPCs remaining after merge for each group, i.e. the ratio between the numbers of AOPCs after and before merge. The analysis details can be found in the appendix. This result suggests that a smaller *p* value makes the hierarchical merging scheme more effective. Therefore, it is desirable to group AOPCs that are likely to be merged so that the number of AOPCs to be carried to the next level is less. Thus when initially partitioning the OPCs at Level *n*, we want to put similar OPCs in the same group. As we discussed in Section V.A, the search method and the FIFO queue employed in Phase 1 naturally support this objective. The pseudo code of the hierarchical merge algorithm is shown in Figure 7.

In this section, we study the performance of our algorithm through a series of experiments. To make the experiment results more realistic and thus convincing, we conduct all experiments on a real gene expression dataset. This dataset is the yeast cell cycle data from [9]. Each row of this dataset records the expression levels across 18 time points for a gene. Totally 799 genes are in this dataset. All experiments were run on a 3.4GHz Dell PC with 2G memory.

First, we study the influence of the input parameters to the mining results and provide some guidance on how to choose them properly. There are totally four parameters that need to be specified in our algorithm. In Phase 1, *s _{min}* and

The experiments in this section are run with *δ _{c}*=0.6,

Figure 10 gives the percentage of AOPC pairs that are prefiltered under different (*δ _{c}*,

Finally, we evaluate the quality of AOPCs found by our algorithm. We compare their significance with the OPCs found in Phase 1, since these are the result under the strict model. As mentioned in Section VI.A, with *s _{min}*=60,

The Distributions of Clusters Found by Three Methods in Different Intervals of Number of Strongly Associated Gene Categories. (#) behind Each Method Indicates the Total Number of Clusters Found by It

We also trace the merge process of an AOPC and compare its biological significance with all the six OPCs it is merged from. This result is summarized in Table IV. The AOPC has stronger associations to 5–7 categories (highlighted by bold fonts) than each OPC individually. This demonstrates that the AOPC found by our algorithm is biologically more significant than every single OPC it is merged from. Thus the merging process is an effective way to discover more significant clusters.

In this paper, we study the problem of subspace OPC mining with noise tolerance. Due to its challenging nature, no previous work has been done on this topic. In this paper, we propose a new model called approximate order preserving cluster (AOPC). This model is proven to be more robust to noise than OPC model. In addition to its robustness, this model also has several good properties which allow us to mine the AOPCs using an efficient greedy algorithm. We propose a prefiltering technique which can quickly exclude most AOPC pairs that cannot be merged. We also propose a hierarchical merging scheme to further improve the execution time. Experiments on real gene expression data demonstrate that these techniques are efficient to speed up the mining process and the AOPCs are more biologically significant than the OPCs. Note that, although our experiments are run on gene expression data, the method presented in this paper can be used widely in many applications beyond gene expression analysis.

Suppose Phase 2 starts from *N* OPCs found in Phase 1 and the time used by the basic algorithm is *T _{b}*. As discussed in Section V.B.3, for fixed

At Level *n*, there are 2* ^{n}* groups each contains

$$\begin{array}{l}{T}_{n}={2}^{n}\times c\times {(N/{2}^{n})}^{2}lg(N/{2}^{n})\\ =c\times {N}^{2}(lgN-lg{2}^{n})/{2}^{n}\\ <c\times {N}^{2}lgN/{2}^{n}={T}_{b}/{2}^{n}\end{array}$$

(a.1)

When the merge at Level *n* terminates, each group has *p*×*N*/2* ^{n}* AOPCs. Then the algorithm proceeds to Level

$${T}_{n-1}<{2}^{n-1}\times {p}^{2}\times T/{({2}^{n-1})}^{2}={p}^{2}\times {T}_{b}/{2}^{n-1}$$

(a.2)

For Level *n-k*, we have:

$${T}_{n-k}<{p}^{2k}\times {T}_{b}/{2}^{n-k},\phantom{\rule{0.38889em}{0ex}}k=0,1,\dots ,n$$

(a.3)

Taking the sum of *T _{k}* for all

$$\begin{array}{l}{T}_{h}=\sum _{k=0}^{n}{T}_{k}<\sum _{k=0}^{n}{p}^{2k}\times \frac{{T}_{b}}{{2}^{n-k}}\\ =\frac{{T}_{b}}{{2}^{n}}\times \sum _{k=0}^{n}{(2{p}^{2})}^{k}\\ =\frac{{T}_{b}}{{2}^{n}}\times \frac{{(2{p}^{2})}^{n+1}-1}{2{p}^{2}-1}<{T}_{b}\times \frac{2{p}^{2n+2}}{2{p}^{2}-1}\end{array}$$

(a.4)

Since 0<*p*<1, as *n* increases, the dividend of Eq. (a.4) decreases faster than the denominator. Eq. (a.4) suggests that more hierarchical levels and smaller *p* make the advantage of the hierarchical merging scheme more prominent.

^{1}That is, the OPC is not the subset of any other OPC.

1. Parsons L, Haque E, Liu H. Subspace clustering for high dimensional data: a review. ACM SIGKDD Explorations Newsletter. 2004;6(1):90–105.

2. Ben-Dor A, Chor B, Karp RM, Yakhini Z. Discovering local structure in gene expression data: the order-preserving submatrix problem. Journal of Computational Biology. 2003;10(3/4):373–384. [PubMed]

3. Cheng Y, Church G. Biclustering of expression data. Proceedings of the 8th International Conference on Intelligent System for Molecular Biology; 2000.

4. Wang H, Wang W, Yang J, Yu P. Clustering by pattern similarity in large data sets. Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD); 2002. pp. 394–405.

5. Xu X, Liu Y, Tung KH, Wang W. Mining shifting-and-scaling co-regulation patterns on gene expression profiles. Proceedings of the 22nd IEEE International Conference on Data Engineering (ICDE); 2006. pp. 89–98.

6. Liu JZ, Wang W. OP-Cluster: Clustering by tendency in high dimensional space. Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM); 2003. pp. 187–194.

7. Liu JZ, Paulsen S, Sun X, Wang W, Nobel A, Prins J. Mining Approximate frequent itemset in the presence of noise: algorithm and analysis. Proceedings of the 6th SIAM Conference on Data Mining (SDM); 2006. pp. 405–416.

8. Bergroth L, Hakonen H, Raita T. A survey of longest common subsequence algorithms. SPIRE. 2000:39–48.

9. Spellman, et al. Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell. 1998;9:3273–3297. [PMC free article] [PubMed]

10. Fisher RA. On the interpretation of χ^{2} from contingency tables, and the calculation of P. Journal of the Royal Statistical Society. 1922;85(1):87–94.

11. Gene ontology consortium. www.geneontology.org.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |