In this paper, we have explored the use of the FFHP mapping for visualizing and classifying multi-dimensional array data sets. The accuracy, which ranges from 70 to 100% with a κ
-coefficient range of 0.75–1.0, demonstrated its promise for sample classification in gene-array data sets. The FFHP mapping results in two-dimensional visualizations that are identical to those of radial co-ordinate visualization techniques, e.g. Radviz (Hoffman et al., 1997
). However, in place of the vector notation and the spring paradigm
of Radviz, we have used a complex number notation. This substantive reformulation of the mapping provides valuable theoretical insights and allows several important properties of the mapping, including its relationship to the DFT, to be easily derived.
We assessed the robustness of the visualization approach by assessing its sample classification accuracy in the face of large perturbations in input data. The leukemia-A
data sets were perturbed by switching the training and test inputs. The accuracy for the leukemia-A
data set with switching was 88% and it remained the same as that in . For the SRBCT
data set, the accuracy after switching was 73%; we attribute the reduced accuracy to the small training sample size (n
in the Burkitt’s lymphomas (BL) group. Using this operational approach, it became apparent that the robustness of the approach is dependent on several factors, including the intrinsic separation of the classes in the input array data, the sample size of the training set and the classifier algorithm. Generally, data sets with large relative separation between classes—large inter-class effect sizes—are least sensitive to perturbations of various types, e.g. changes in the position and angles of classifier lines, or small decreases in the training set size. Another factor is the training sample size: the statistical uncertainty with small samples sizes is large and accuracy diminishes once the classifiers overlap the prediction interval of the data. When the training sets become larger, the lines and their splits can be more accurately positioned to separate the classes. Classifiers of arbitrary shape may be more robust compared with the oblique classifiers when the distribution of points in the various classes in the two-dimensional mapping is not defined by a polygonal space partitioning. In a separate report (Zhang et al., 2003
), we compared oblique classifiers to SOM (Kohonen, 1995
): In colon cancer
data set, the oblique classifier misclassified seven samples () while a 4 × 3 unit SOM misclassified eight samples. Furthermore, the same seven samples misclassified by the oblique classifier were also misclassified by the SOM. The VizStruct approach readily admits more complex clustering and classification strategies but oblique classifiers appear effective and parsimonious and their simplicity may appeal to end-users in the life sciences.
Parallel co-ordinates and MDS represent competing approaches to FFHP for visualizing multi-dimensional data sets. Parallel co-ordinates visualization has obvious drawbacks: it becomes increasingly unreadable when the data size gets larger. The FFHP mapping was shown to yield results that were similar to those from Sammon’s mapping, a variant of MDS. Despite providing results that are mathematically optimal in some sense, MDS is not ideal for gene-array data because: (1) it provides a single final result and the user cannot intervene interactively during visualization and (2) the incremental addition of any single point requires a complete repetition of the optimization procedure and possibly extensive reorganization of all the previously mapped points to new locations. Secondarily, from a computational complexity standpoint, the parallel co-ordinates method requires relatively little computational effort because each dimension is plotted directly, while MDS requires time-consuming optimization procedures of time complexity O(N2) or greater. The computational complexity of FFHP (N log N) is intermediate.
The FFHP is sensitive to dimension order and VizStruct has the capability to reorder dimensions so that class separation is enhanced. To determine whether the canonical reordering could cause misleading pseudo-classes to appear during visualization, a random data set containing 50 data points each with 100 dimensions was simulated; 20 points were arbitrarily assigned to one class and the remainder to another class followed by canonical dimension reordering. The visualization of the reordered data set did not suggest the presence of pseudo-classes.
The performance of VizStruct suggests that visualization may be capable of a larger and richer role than is currently appreciated. However, with multi-dimensional gene expression data, one also has to be always mindful of the curse of dimensionality (Bellman, 1961
) and rigorously confirm any experimental findings in the two-dimensional mapping with appropriate quantitative techniques.