Diabetes affects hundreds of millions world wide, contributing to cardiovascular disease, blindness, amputation, kidney failure and many other diseases. Obesity and impaired insulin sensitivity are among the major factors responsible for development of type 2 diabetes (DM2). Skeletal muscle and white adipose tissue are believed to play a major role in insulin resistance [1
]. However, long-term studies indicate that even major factors such as insulin resistance are not sufficient to fully predict the onset of disease [3
]. Recently a series of papers connected insulin sensitivity and type 2 diabetes to expression of a group of oxidation phosphorylation genes that are co-regulated by the peroxisome proliferator activator protein (PGC-1α
]. These experiments, as well as others [6
], suggest that mitochondrial dysfunction plays a role in the genesis of DM2 and have fuelled discussions about energy metabolism as a primary factor in insulin resistance.
Microarray expression profiling allows researchers to monitor expression levels of thousands of genes in a single analysis. Classification of samples by such molecular signatures allows for improved stratification of patients, rational application of treatment, and better risk assessment. Importantly, these techniques often uncover previously unanticipated pathways and identify of new targets for therapy. These experimental and computational approaches were first developed and applied in numerous cancer-related research projects [8
]. Applying these same techniques to classify skeletal muscle samples of DM2 and non-diabetic patients has encountered very serious problems. When comparing the differences in gene expression of DM2 and non-diabetic patients, the differences are modest, with analytic noise masking the underlying informative changes in gene expression. Two different approaches have been suggested to counter this challenge. Mootha et al. have developed the Gene Set Enrichment Approach (GSEA). In the absence of significantly over- or under-expressed genes, they identified groups of genes to discriminate between DM2 and normal samples based on function, gene ontology (GO) annotation, chromosomal location and other factors. Joining genes from common functional groups is effectively the same as using multiple replicates as it dramatically increases the power of the experiment. Similarly, Patti et al. found no single gene differentially expressed between diabetic and non-diabetic muscle samples after correction for multiple comparisons. They also used extensive functional annotation to identify genes differentially expressed between DM2 and non-diabetic patients. Even though statistical significance of differential expression of these genes was lacking, classification by occurrence of GO terms [9
] revealed disparate expression of genes involved in energy metabolism between DM2 and normal. Taken together, both papers implicate genes involved in energy metabolism as the major contributors to DM2 status of the patients. These findings are logical from a biological standpoint and build upon prior data [10
The analytical strategies employed by Mootha et al. and Patti et al. were based on presumption of two distinct categories (DM2 and not DM2) and that these clinical categories should manifest themselves through the gene expression patterns in skeletal muscle. Instinctively, we perceive diabetic and non-diabetic patients in two different categories. However, the onset of diabetes depends on other factors such as lipotoxicity [12
], a failure of leptin signaling [15
], abnormalities in hypothalamic function [16
], to name a few. These and many other factors can mitigate the effect of gene expression in skeletal muscle with regards to the onset of diabetes. Taking into account the complexity of the disease, the very existence of distinct categories such as diabetics and non-diabetics when analyzing gene expression data cannot be taken for granted.
The approach described herein does not assign patients' transcriptomes (interrogated by the microarray experiment) to a diagnostic category (DM2 vs. IGT vs. NGT or normal). Instead, we rely on the "natural classification" to identify the groups of samples within the data that are similar to each other by their transcriptome. The concept of natural classification is well established in computer analysis of biological data [17
]. In the first step of our analysis, our goal was to identify natural categories (clusters) in the expression data. To accomplish this goal we applied a high-dimension unsupervised cluster analysis algorithm developed based on FOREL (see Methods). This algorithm performs a "class discovery" type of clustering [19
] without pre-selection of a small set of genes to reduce dimensionality. The output consists of a set of finished clusters, which can be further analyzed and hyper-clustered in order to establish the relationship between natural classes. The second step of the analysis is to relate the observations in the data set to the clinical characteristics, and to identify the underlying discriminant genes implicated in formation of specific clusters. Our strategy is based entirely on the observation of similarities within the data and avoids speculative assumptions about gene function. When using a "natural classification" strategy the only assumption being made is that the most common gene expression patterns associated with the development of diabetes are expected to be found many times, providing that a sufficiently large number of samples are included and the microarray technique accurately reflects the underlying molecular mechanisms.