DNA arrays provide simultaneous readouts for the expression levels of thousands of genes in samples (
DeRisi et al., 1996). Innovative techniques to efficiently and effectively analyze these rapidly growing data are required, which will have a significant impact on the field of bioinformatics.
The raw array images are transformed into gene expression matrices where the rows usually represent genes and the columns represent samples. It is meaningful to cluster both genes and samples in gene expression data (
Brazma and Vilo, 2000). Co-expressed genes can be grouped based on their expression patterns (
Eisen et al., 1998) and in such
gene-based clustering, the genes are treated as the objects, while the The samples can be partitioned into homogeneous groups and each group may correspond to a particular macroscopic phenotype, such as the present or absent clinical syndromes or cancer types (
Golub et al., 1999). Thus,
sample-based clustering regards the samples as the objects and the genes as the attributes. To group samples to reveal their macroscopic phenotypes is regarded as the process of empirical sample pattern detection.
In typical array datasets, the volume of genes and the number of samples are very different, e.g. 10
1–10
2 samples versus 10
3–10
4 genes. Gene- and sample-based methods therefore face very different challenges. Techniques that are effective for gene-based clustering, e.g. CAST (
Ben-Dor et al., 1999), MST (
Xu et al., 2002), HCS (
Hartuv and Shamir, 2000), and CLICK (
Shamir and Sharan, 2000), are not necessarily adequate for analyzing samples.
The existing methods of grouping samples fall into two major categories: supervised analysis and unsupervised analysis. The supervised approach assumes that phenotype information is attached to the samples and that biological samples are labeled, e.g. as being diseased versus normal. The major supervised analysis methods include the neighborhood analysis (
Golub et al., 1999), the support vector machine (
Brown et al., 2000), the tree harvesting method (
Hastie et al., 2001), the decision tree method (
Zhang et al., 2001), the genetic algorithm (
Li et al., 2001), the artificial neural networks (
Khan et al., 2001), a variety of statistical approaches (
Jiang et al., 2001;
Thomas et al., 2001) and rank-based methods (
Park et al., 2001). In these methods, a subset of samples is used as the training set to select a small percentage of informative genes (around 50–200) which manifest the phenotype distinction of the training samples: finally, the whole set of samples is classified based on the selected informative genes.
We will focus on unsupervised sample pattern detection which assumes no phenotype information being assigned to any sample. Since the initial biological identification of sample phenotypes has been slow, typically evolving through years of hypothesis-driven research, automatically discovering sample pattern presents a significant contribution in array data analysis (
Golub et al., 1999). Unsupervised sample pattern detection is much more difficult than supervised manner because the training set of samples, which can be utilized as a reference to guide informative gene selection, is not available. Many mature statistical methods such as
t-test,
Z-score (
Thomas et al., 2001), and Markov filter (
Xing and Karp, 2001) cannot be applied without the phenotypes of samples being known in advance. Thus identifying informative genes and empirical partition of samples become very challenging problems.
In this paper, we tackle the problem of unsupervised sample pattern detection by developing a novel analysis model called empirical pattern detection (ESPD) which includes a series of statistics-based metrics and iterative adjustment. We claim the following contributions.
- A formalized problem statement of ESPD of sparse high-dimensional datasets is proposed. Major differences from traditional clustering or recent subspace clustering problems are elaborated.
- A series of statistics-based metrics incorporated in unsupervised empirical pattern discovery are introduced. These metrics delineate local pattern qualities to coordinate between sample pattern discovery and informative genes selection.
- An iterative adjustment algorithm is presented to approach the optimal solution. The method dynamically manipulates the relationship between samples and genes while conducting an iterative adjustment to approximate the informative space and the empirical pattern simultaneously.
- An extensive experimental evaluation over real datasets is presented. It shows that our method is both effective and efficient and outperforms the existing methods.
The remainder of this paper is organized as follows. Section 2 gives the problem description and the pattern quality metrics while the algorithm is presented in Section 3. Experimental results appear in Section 4. The related work and concluding remarks are in Section 5.