Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC2607484

Formats

Article sections

Authors

Related links

Bioinformatics. Author manuscript; available in PMC 2008 December 23.

Published in final edited form as:

PMCID: PMC2607484

NIHMSID: NIHMS75131

The publisher's final edited version of this article is available free at Bioinformatics

See other articles in PMC that cite the published article.

DNA arrays provide a broad snapshot of the state of the cell by measuring the expression levels of thousands of genes simultaneously. Visualization techniques can enable the exploration and detection of patterns and relationships in a complex data set by presenting the data in a graphical format in which the key characteristics become more apparent. The dimensionality and size of array data sets however present significant challenges to visualization. The purpose of this study is to present an interactive approach for visualizing variations in gene expression profiles and to assess its usefulness for classifying samples.

The first Fourier harmonic projection was used to map multi-dimensional gene expression data to two dimensions in an implementation called VizStruct. The visualization method was tested using the differentially expressed genes identified in eight separate gene expression data sets. The samples were classified using the oblique decision tree (OC1) algorithm to provide a procedure for visualization-driven classification. The classifiers were evaluated by the holdout and the cross-validation techniques. The proposed method was found to achieve high accuracy.

Detailed mathematical derivation of all mapping properties as well as figures in color can be found as supplementary on the web page http://www.cse.buffalo.edu/DBGROUP/bioinformatics/supplementary/vizstruct. All programs were written in Java and Matlab and software code is available by request from the first author.

DNA arrays allow simultaneous measurements of the expression levels of thousands of genes to be made from a single sample. The method has been used to address a range of biomedical research problems including the identification of malignant tissues, the delineation of drug responses, and the discovery of disease subtypes (Alizadeh *et al*., 2000; Alon *et al*., 1999; Golub *et al*., 1999; Hedenfalk *et al*., 2001; Khan *et al*., 2001; Virtaneva *et al*., 2001).

Visualization can facilitate the discovery of structures, features, patterns and relationships in data and exploratory visualization is an important component in hypothesis generation. The most prominent visualization-enhanced analysis tool for gene expression data is TreeView (Eisen *et al*., 1998), which provides a user-friendly computational and graphical environment for assessing the results from hierarchical clustering. The graphical presentations from TreeView include a dendrogram to reflect the distance relationships between clusters and a heat plot to convey gene expression changes visually between samples. The heat plot can be viewed as a variation of the parallel co-ordinates plot in which color is used to convey dimension values.

Multi-dimensional scaling (MDS) is an important class of visualization techniques that use optimization to preserve distance relationships between points in the multi-dimensional space in the two-dimensional mapping required for effective visualization. Ewing and Cherry (2001) used Sammon’s mapping, a variant of MDS, as a visualization approach for gene expression data sets. More recently, Hautaniemi *et al*. (2003) have presented a heat map-based strategy for visualizing the U-matrix from self-organizing maps (SOM).

Here, we present an alternative mapping for multi-dimensional data that is based on the first harmonic of the discrete Fourier transform (DFT). The mapping, which is analogous to radial co-ordinate visualization techniques such as Radviz (Hoffman *et al*., 1997), has interesting properties and preserves certain key characteristics of a variety of data sets, including those from gene expression profiling experiments. An interactive visualization tool, VizStruct, was implemented to perform visualization and sample classification.

The mapping converts multi-dimensional data into two-dimensions for visualization. An array-derived profile for *N* genes results in a *N*-dimensional data point containing real valued numerical data. For mapping, we view each data point as a discrete-time real signal on *N* evenly distributed time points (Cadzow and Landingham, 1985) and each *N*-point signal (data point) is represented as an indexed sequence of *N* real numbers denoted by **x**[*n*]. Each term of **x**[*n*] is denoted by *x*[*n*].

The two-dimensional mapping of a *N*-dimensional data point is defined by real and imaginary components of:

(1)

where **W*** _{N}* =

On closer inspection, the mapping of Equation (1) is seen to be identical to the first harmonic of the DFT (Morrison, 1994). The DFT is defined by:

(2)

Each harmonic * _{k}* in the DFT is a measure of the

Equivalently, the complex numbers representing the individual terms of FFHP in Equation (1) can be expressed in terms of magnitude *r* and phase *θ* to provide a useful geometric interpretation of the mapping. For a data point with *N* dimensions, the complex exponential divides a unit circle centered at the origin of the complex plane into *N* equally spaced angles. The value of the first dimension is projected on the radial line corresponding to *θ* = 0 and, similarly, the value of the *k*-th dimension is projected on to the radial line corresponding to the *θ* = 2*π*(1 − *k*)/*N* radians. The overall two-dimensional FFHP mapping is the complex sum of all *N* projections from a data point. Figure 1 illustrates the geometric interpretation for a point containing eight dimensions.

Geometric interpretation of the FFHP. A normalized eight-dimensional data point is shown on the right by the stem plot. The twiddle factor divides the unit circle centered at the origin into eight equal angles and each dimension of the data point is projected **...**

The FFHP has several useful properties that preserve the correlation between dimensions in the multi-dimensional data point. A detailed mathematical description of the properties is available as supplementary material.

The FFHP approach is implemented as a visualization tool called VizStruct, which is written in Java and Matlab. VizStruct is capable of mapping multi-dimensional data and can perform binary classification and multi-class classification.

The dimension tour is a feature of VizStruct that allows the user to interact with the data via dynamic animations. It is analogous to the grand tour (Cook *et al*., 1995) and the user interacts with the visualization by changing the dimension parameter *w _{n}*, for each dimension in the following variation of the FFHP mapping:

(3)

The default value for each *w _{n}* is 0.5 and the individual parameters can be changed over the range −1 to +1 either manually or systematically using the program. Because no two-dimensional mapping can capture all the interesting properties of the original multi-dimensional space, two points that are close in the visualization can theoretically be far apart in the multi-dimensional space. The dimension tour, which creates animations that explore dimension parameter space, can reveal structures in the multi-dimensional input that were hidden due to overlap with other points in the visualization.

Figure 2 illustrates the capabilities of the dimension tour using a synthetic data set containing 25 000 points in three dimensions. At the default settings of dimensional parameters (Fig. 2A), five clusters are apparent. However, during the course of the animations (Fig. 2B and C), the multi-layered structures of the original five clusters become increasingly apparent.

The approach was tested using Fisher’s *iris* data set and several gene-array data sets. The *iris* data set is widely used as a benchmark for classification and clustering (Fisher, 1936). The *multiple sclerosis* (MS) data set consists of array-derived gene expression profiles that were provided by our collaborators in the Department of Pharmaceutical Sciences and in the Department of Neurology. The data set contains two pair-wise group comparisons of interest. The first data subset, ‘MS versus Controls’, contains array data from 15 MS patients and 15 age- and sex-matched controls while the second subset is referred to as ‘MS–IFN’ because it contains array data from 14 MS patients prior to and 24 h after interferon (IFN)-*β* treatment. The remaining gene-array data sets were obtained from the publicly available sources listed in Table 1.

Informative genes are genes whose expression levels are significantly different across different sample classes. Neighborhood analysis (Golub *et al*., 1999) was used to create subsets of informative genes for the *MS versus Controls*, *MS–IFN* and *Leukemia-A* data sets. The Significance Analysis of Microarrays (SAM) algorithm (Tusher *et al*., 2001) was used for identifying subsets of informative genes for the remaining data sets.

The oblique decision tree approach, which constructs straight lines of arbitrary slope to separate classes of data, was used to construct classifiers in the visualizations (Murthy *et al*., 1993). The Oblique Classifier 1 (OC1) algorithm (Murthy *et al*., 1994), which combines hill climbing with two forms of randomization, was used to identify good oblique splits at each node.

The accuracy of the classifier was evaluated using the holdout and leave-one-out cross-validation techniques. The holdout technique was used for the larger data sets, and involves constructing the classifiers using a subset of training samples and assessing the accuracy using a mutually exclusive subset of the test samples. The leave-one-out method was used for the smaller data sets. The classifiers were constructed using all the data points except one, which was withheld and used for testing. The step of classifier construction and testing were repeated for all the data points in the data set.

The accuracy of the classifier predictions was assessed using overall error rate (the ratio of the number of errors to the total number of test samples) and the *κ*-coefficient. The *κ*-coefficient (Cohen, 1960) is defined as:

(4)

where *p*_{0} is the proportion of cases in which the raters agree and *p _{c}* is the proportion of cases for which agreement is expected only by chance on the assumption that the marginal distributions of ratings are independent.

The *iris* data set, which has been widely employed to assess discriminant and cluster analysis algorithms, was used as a benchmark data set to obtain a preliminary assessment of the usefulness of the FFHP approach prior to analysis of the more challenging, high-dimensional array data sets. The *iris* data set contains 150 observations of four attributes (sepal length and width, petal length and width) from three iris species (50 observations each of iris setosa, iris versicolor, iris virginica). Figure 3A and B show the data set for two-dimensional parameter settings. The three clusters are clearly apparent from the visualization and segregate by species. The iris setosa cluster (plus symbols) is clearly separated from the cluster formed by iris virginica (circles) and iris versicolor (stars). The separation between virginica and versicolor improves in Figure 3B compared with Figure 3A.

Visualization of the *iris* data set in VizStruct. The iris setosa cluster (plus symbols) are clearly separated from the cluster formed by iris virginica (circles) and iris versciolor (stars) in all two panels. The dimension parameters for (**A**) and (**B**) were **...**

These findings demonstrated the feasibility of using the FFHP mapping for array-derived gene expression profiles.

Figure 4A presents the visualization results from the *MS–IFN* data set, which contains gene expression profiles from 14 MS patients before and 24 h after IFN treatment while Figure 4B presents the results from the *MS versus Controls* data set which compares 15 MS patients to age- and sex-matched controls. Subsets of 88 informative genes were identified from the 4132 genes in each data set. The informative genes were mapped using VizStruct; oblique classifiers then were constructed and evaluated using the leave-one-out method. The pre-treatment group (plus symbols) in Figure 4A segregated to upper half of the visualization field while the post-treatment group (circle) occupied the lower half. The results for MS patients (plus symbols) and controls (circles) in Figure 4B were generally similar. The pre-treatment and post-treatment groups in Figure 4A were clearly separated while the separation between MS patients and controls in Figure 4B less distinct. Table 2 shows that the accuracy of the prediction was 100% (the *κ*-coefficient was 1.00) for the *MS–IFN* data set and 90% (the *κ*-coefficient was 0.80) for the *MS versus Controls* data set.

Two-class classifications. (**A**) shows the results from mapping the *MS–IFN* data set. The pre-treatment samples are shown in plus symbols, the post-treatment samples are in circles and a test sample is shown in a filled circle highlighted with arrow. **...**

The *leukemia-A* data set (Golub *et al*., 1999) consists of expression values of 7129 genes from 72 patients. Fifty informative genes were identified using the neighborhood analysis approach of the original report. The training set consists of 27 acute lymphoblastic leukemia (ALL) and 11 acute myeloid leukemia (AML) patients’ samples (plus symbols and circles in Fig. 4C). An oblique classifier was constructed based on visualization of the training set, and the holdout method was used for classifier evaluation. The testing set contained 20 ALL and 14 AML samples (triangles and stars in Fig. 4C). The overall accuracy of the classifier was 88% and the *κ*-coefficient was 0.75.

The visualization of *colon cancer* data set (Alon *et al*., 1999) consists of gene expression profiles for 2000 genes from 22 normal samples and 39 cancerous colon tissues. The result from mapping 50 informative genes is summarized in Figure 4D. The upper part of the visualization field was dominated by normal colon tissue while cancerous colon tissues occupied the lower half. The overall accuracy was 89% and the *κ*-coefficient was 0.75.

Figure 4E shows the results from mapping the 100 informative gene subset of the *lymphoma* data set (Alizadeh *et al*., 2000). The germinal center lymphomas were relatively well separated from the activated B cell-like lymphomas. The robustness of the separation contributed to 100% (the *κ*-coefficient was 1.00) accuracy upon classification.

The performance of VizStruct for multi-class array-derived data sets was examined in the next series of experiments. The *leukemia-B* data set (Virtaneva *et al*., 2001) contains three classes and a 50-gene informative subset was mapped using VizStruct (Fig. 5A). The normal CD34+ samples segregated to the northwestern region of the visualization and were well separated from the AML samples. The samples from cytogenetically normal AML (AML–CN) patients and trisomy 8 AML (AML+8) patients segregated, albeit less distinctly. This was reflected in the lower accuracy of 70% and the *κ*-coefficient was 0.55.

Multi-class classifications. (**A**) shows the results from mapping the *leukemia-B* data set. The plus symbols are samples from AML patients with trisomy 8 patients (AML + 8), the circles are samples of AML patients with normal cytogenetics (AML–CN) **...**

The FFHP mapping of 50-gene informative subset of the *BRCA* data set (Hedenfalk *et al*., 2001) resulted in a clear separation of the BRCA1, BRCA2 and sporadic classes (Fig. 5B) and the accuracy was 100% (the *κ*-coefficient was 1.00).

For the four-class *SRBCT* data set (Khan *et al*., 2001), a 100-gene informative subset was mapped (Fig. 5C). The four classes of small round blue cell tumors segregated to different regions of the visualization field. The accuracy of the oblique classifiers was of 95% and the *κ*-coefficient was 0.93 (Table 2).

As a rule of thumb, *κ* values of 0.41–0.60 indicate moderate agreement, values in the range 0.61–0.80 indicate substantial agreement and values in the range 0.81–1.0 indicate almost perfect agreement (Landis and Koch, 1977). The summary in Table 2 indicates that for *MS–IFN*, *MS–Control*, *lymphoma*, *BRCA* and *SRBCT* data sets, almost perfect agreement was obtained using VizStruct. For the *leukemia-A* and *colon cancer* data sets substantial agreement was obtained. However, only moderate agreement was achieved in for the *Leukemia-B* data set. Taken together, these experimental results highlight the effectiveness of the FFHP mapping for visualization-driven sample space classification of both binary and multi-class gene-array data sets.

Multi-dimensional scaling is a competing approach for visualizing multi-dimensional data. We compared FFHP to Sammon’s mapping (Sammon, 1969), variant of MDS that optimizes the following stress function *ε*:

(5)

where
is the distance between points *i* and *j* in the *N*-dimensional space and *d _{ij}* is the distance between

Figure 6A shows the Sammon’s mapping of the *iris* data set. A comparison of Figure 6A with Figure 3A reveals the extensive similarities between FFHP and Sammon’s mapping. For example, the iris setosa cluster is clearly separated from the virginica and the versicolor cluster and the versicolor cluster is closer to setosa than the viriginia. The relative locations of individual samples are also remarkably similar (data not shown).

Sammon’s mapping. (**A**) shows the results from Sammon’s mapping of the *iris* data set (iris setosa: plus symbols; iris versicolor: circles; and iris virginica: stars). (**B**) shows results from Sammon’s mapping of the informative genes **...**

Figure 6B is the Sammon’s mapping of the *SRBCT* data set. A comparison of Figure 6B with Figure 5C reveals extensive similarities; e.g. the relative locations of various tumor clusters are quite similar and the Burkitt’s lymphoma cluster is distal from the rhabdomyosarcomas.

Although, the FFHP uses an approach to mapping multidimensional data that is distinctively different from Sammon’s mapping, it yields results that are consistently comparable.

In this paper, we have explored the use of the FFHP mapping for visualizing and classifying multi-dimensional array data sets. The accuracy, which ranges from 70 to 100% with a *κ*-coefficient range of 0.75–1.0, demonstrated its promise for sample classification in gene-array data sets. The FFHP mapping results in two-dimensional visualizations that are identical to those of radial co-ordinate visualization techniques, e.g. Radviz (Hoffman *et al*., 1997). However, in place of the vector notation and the *spring paradigm* of Radviz, we have used a complex number notation. This substantive reformulation of the mapping provides valuable theoretical insights and allows several important properties of the mapping, including its relationship to the DFT, to be easily derived.

We assessed the robustness of the visualization approach by assessing its sample classification accuracy in the face of large perturbations in input data. The *leukemia-A* and *SR–BCT* data sets were perturbed by switching the training and test inputs. The accuracy for the *leukemia-A* data set with switching was 88% and it remained the same as that in Table 2. For the *SRBCT* data set, the accuracy after switching was 73%; we attribute the reduced accuracy to the small training sample size *(n* = 3*)* in the Burkitt’s lymphomas (BL) group. Using this operational approach, it became apparent that the robustness of the approach is dependent on several factors, including the intrinsic separation of the classes in the input array data, the sample size of the training set and the classifier algorithm. Generally, data sets with large relative separation between classes—large inter-class effect sizes—are least sensitive to perturbations of various types, e.g. changes in the position and angles of classifier lines, or small decreases in the training set size. Another factor is the training sample size: the statistical uncertainty with small samples sizes is large and accuracy diminishes once the classifiers overlap the prediction interval of the data. When the training sets become larger, the lines and their splits can be more accurately positioned to separate the classes. Classifiers of arbitrary shape may be more robust compared with the oblique classifiers when the distribution of points in the various classes in the two-dimensional mapping is not defined by a polygonal space partitioning. In a separate report (Zhang *et al*., 2003), we compared oblique classifiers to SOM (Kohonen, 1995): In *colon cancer* data set, the oblique classifier misclassified seven samples (Table 2) while a 4 × 3 unit SOM misclassified eight samples. Furthermore, the same seven samples misclassified by the oblique classifier were also misclassified by the SOM. The VizStruct approach readily admits more complex clustering and classification strategies but oblique classifiers appear effective and parsimonious and their simplicity may appeal to end-users in the life sciences.

Parallel co-ordinates and MDS represent competing approaches to FFHP for visualizing multi-dimensional data sets. Parallel co-ordinates visualization has obvious drawbacks: it becomes increasingly unreadable when the data size gets larger. The FFHP mapping was shown to yield results that were similar to those from Sammon’s mapping, a variant of MDS. Despite providing results that are mathematically optimal in some sense, MDS is not ideal for gene-array data because: (1) it provides a single final result and the user cannot intervene interactively during visualization and (2) the incremental addition of any single point requires a complete repetition of the optimization procedure and possibly extensive reorganization of all the previously mapped points to new locations. Secondarily, from a computational complexity standpoint, the parallel co-ordinates method requires relatively little computational effort because each dimension is plotted directly, while MDS requires time-consuming optimization procedures of time complexity *O*(*N*_{2}) or greater. The computational complexity of FFHP (*N* log *N*) is intermediate.

The FFHP is sensitive to dimension order and VizStruct has the capability to reorder dimensions so that class separation is enhanced. To determine whether the canonical reordering could cause misleading pseudo-classes to appear during visualization, a random data set containing 50 data points each with 100 dimensions was simulated; 20 points were arbitrarily assigned to one class and the remainder to another class followed by canonical dimension reordering. The visualization of the reordered data set did not suggest the presence of pseudo-classes.

The performance of VizStruct suggests that visualization may be capable of a larger and richer role than is currently appreciated. However, with multi-dimensional gene expression data, one also has to be always mindful of the curse of dimensionality (Bellman, 1961) and rigorously confirm any experimental findings in the two-dimensional mapping with appropriate quantitative techniques.

This work was supported by grants from the National Science Foundation and the grant RG3258A2 from the National Multiple Sclerosis Society.

- Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature. 2000;403:503–511. [PubMed]
- Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide array. Proc Natl Acad Sci USA. 1999;96:6745–6750. [PubMed]
- Bellman R. Adaptive Control Processes: A Guided Tour. Princeton University Press; Princeton, NJ: 1961.
- Cadzow JA, Landingham HF. Signals, Systems, and Transforms. Prentice-Hall, Inc; Englewood Cliffs, NJ: 1985.
- Cohen J. A coefficient of agreement for nominal scales. Ed & Psych Meas. 1960;20:37–46.
- Cook C, Buja A, Cabrera J, et al. Grand tour and projection pursuit. J Comput Graph Stat. 1995;2:225–250.
- Cooley JW, Tukey JW. An Algorithm for the machine calculation of complex Fourier series. Math Comput. 1965;19:297–301.
- Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998;95:14863–14868. [PubMed]
- Ewing RM, Cherry JM. Visualization of expression clusters using Sammon’s non-linear mapping. Bioinformatics. 2001;17:658–659. [PubMed]
- Fisher RA. The use of multiple measurements on taxonomic problems. Ann Eugenics. 1936;7:179–188.
- Golub TR, Slonim DK, Tamayo P, Huard C, Gassenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–537. [PubMed]
- Hautaniemi S, Yli-Harja O, Astola J, et al. Analysis and visualization of gene expression microarray data in human cancer using self-organizing maps. Machine Learning: Special Issue on Methods in Functional Genomics. 2003 (To appear)
- Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Kallioniemi OP, et al. Gene-expression profiles in hereditary breast cancer. N Eng J Med. 2001;344:539–548. [PubMed]
- Hoffman PE, Grinstein GG, Marx K, Grosse I, Stanley E. IEEE Visualization’97. Phoenix, AZ: 1997. DNA visual and analytic data mining; pp. 437–441.
- Khan J, Wei JS, Ringnér M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C, et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med. 2001;7:673–679. [PMC free article] [PubMed]
- Kohonen T. Self-Organizing Maps, Springer Series in Information Sciences. Vol. 30. Springer; Berlin, Heidelberg, New York: 1995.
- Landis J, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–174. [PubMed]
- Morrison N. Introduction to Fourier Analysis. John Wiley & Sons, Inc; New York, NY: 1994.
- Murthy SK, Kasif S, Salzberg S, Beigel R. National Conference on Artificial Intelligence. Washington, DC: 1993. OC1: a randomized induction of oblique decision trees; pp. 322–327.
- Murthy SK, Kasif S, Salzberg S. A system for induction of oblique decision trees. J Art Intell Res. 1994;2:1–32.
- Sammon JW. A nonlinear mapping for data structure analysis. IEEE Trans Comp. 1969;C-18:401–409.
- Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA. 2001;98:5116–5121. [PubMed]
- Virtaneva K, Wright F, Tanner S, et al. Expression profiling reveals fundamental biological differences in acute myeloid leukemia with isolated trisomy 8 and normal cytogenetic. Proc Natl Acad Sci USA. 2001;98:1124–1129. [PubMed]
- Zhang L, Tang C, Song YQ, Zhang A, Ramanathan M. VizCluster and its application on clustering gene expression data. Int J Distributed and Parallel Databases. 2003;13:79–97.

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |