Clustering microarray gene expression data has been useful for grouping genes that are co-expressed. Many clustering algorithms are suited to only cluster numeric or categorical data but not both. Clustering microarray gene expression data along with biological information about the samples has proven to be advantageous for grouping genes and samples that share biological relevance. For instance, a novel clustering approach called heritable clustering, incorporates epigenetic (genes monitored for hypermethylation according to a binary [0,1] status) and phenotypic data (clinical measurements encoded as ordinal categorical variables) to group tumor samples sufficiently well enough for discovery of informative pathways that adhere to strict heritability in breast cancer. 15
Other clustering methods have also accommodated either clinical data or histopathological observations about the samples in the grouping process by either linear models with regression coefficients representing strength of the association or by correlation with principal components of the microarray data. 16,17
In addition, a clustering method recently published partially integrates clinical measurements with microarray data through separate Bayesian networks that are joined by a single phenotype variable. 18
However, the extension of these types of clustering algorithms for full integration and optimized analysis of high dimensional gene expression data integrated with clinical data as continuous measurements and phenotypic data as categorical values simultaneously has not been investigated. The simulated annealing (SA)-Modk
-prototypes algorithm presented in this paper is a continuation of the work of Bushel et al. 2
to permit grouping of biological samples based on microarray gene expression data and classes of known phenotypic variables in a more formal and optimized fashion. The method utilizes simulated annealing for optimization of an objective function comprised of the sum of the squared Euclidean distances for numeric microarray and clinical chemistry data and simple matching for histopathology categorical values in order to measure dissimilarity of samples. Separate weighting terms are used for microarray, clinical chemistry and histopathology measurements to control the influence of each data domain on the clustering of the samples.
A cluster’s prototype, formed from the mean of the values for numeric features and the mode of the categorical values of all the samples in the group, is representative of the phenotype of the cluster members. The advantage of SA-Modk-prototypes clustering is that the phenotypic prototypes are derived from optimally-formed clusters of the biological samples ( and ).
From the clustering of samples based on gene expression, clinical and pathology data derived from rats exposed to acetaminophen, phenotypic prototypes were obtained from three clusters of the biological samples which showed signs of no, mild and moderate centrilobular necrosis of the rat liver. The clinical chemistry portion of the phenotypic prototype clearly indicates that the ALT and AST enzymes levels are elevated (indicative of liver injury) in the cluster of the samples with the moderate necrosis of the centrilobular region phenotype observation (). Furthermore, the genes in the phenotypic prototypes with expression ratio values that contribute the most to discerning the three clusters of samples partitioned by their manifestation of a given severity of centrilobular necrosis of the rat liver (, and ), contain genes related to proliferation, hyperbilirubinemia, injury and hemorrhaging of the liver. Pathway analysis revealed that Map kinase (MAPK) signalling and the linoleic acid metabolism were significant biological processes that had genes which are influenced by the exposures of acetaminophen manifesting centrilobular necrosis (Table 5). Linoleic acid is a polyunsaturated fatty acid that the liver converts to arachidonic acid, a primary target for lipid peroxidation. The role of lipid peroxidation in acetaminophen-induced toxicity has been controversial for sometime. 19–21
Differential expression of network focus genes according to the centrilobular necrosis phenotype.
Only 24 genes were found to be in common between the lists of genes identified by Modk
-prototypes clustering as discerning between necrosis of the liver in rats from acetaminophen treatment. The difference in the number and the specific genes that statistically distinguish between the levels of centrilobular necrosis is likely due to the cluster assignment of the samples when grouped using the two approaches. The Modk
-prototypes clustering algorithm searches for clusters formed closest to the global minima of the objective function but does not guarantee finding the optimal clustering solution. On the other hand, the SA-Modk
-prototypes clustering algorithm uses simulated annealing of the objective function to escape local optima in search for the global optimum. This is advantageous for effectively linking the phenotype of samples to groups of genes. This process of phenotypic anchoring has been described previously and approached by ad-hoc methods to link cause of a disease or response with the effect observed. 22–25
Dugas et al. 26
approached phenotypic anchoring of gene expression data to characteristics of samples more formally by using multidimensional clustering (mdclust). However, the method can only match a single phenotype variable to a set of clustered samples. Li and Hong 27
proposed the use of the Rasch model (an item-response theory approach) for relating gene expression profiles to phenotypes described by latent factors. However, the method can suffer from the loss of information due to the discretization of the microarray data. The SA-Modk
-prototypes method described in this paper goes further in that it allows a set of end-point measurements, either categorical or continuous, to be coupled to groups of samples that share gene expressions patterns and phenotypic characteristics without transforming the data to discrete values.