Consider a dataset consisting of g genes and n samples, where the samples are drawn from m classes or experimental groups, and which is partitioned into a set of p non-overlapping clusters of genes using complete-linkage clustering (or any other clustering method). We assume that each group has nk replicate samples for groups k = 1,…, m and the total number of samples in the dataset is given by n = ∑k=1m nk. The number of genes in a single cluster c is denoted by gc and we assume that every gene appears in one of the p clusters, g = ∑c=1p gc.
Let Yijk,c represent the expression value of the i-th gene of replicate sample jfor group k where i = 1,…, gc genes; j = 1,…, nk replicates; k = 1,…, m groups and c = 1,…, p clusters.
The mean expression profile for cluster
c is given by the
n-dimensional vector,
where
represents the mean expression value for a replicate sample
j from group
k that has been averaged over the
gc genes in cluster
c. Here the dot notation indicates the index over which the summation takes place. For example,

represents the value averaged over all genes (as indexed by
i) from
i = 1 up to the number of genes
gc for a particular cluster
c.
For each of the
p clusters, we then fit an analysis of variance (ANOVA) model to the mean expression profile of that cluster which estimates the degree of dependency between the mean expression profile

and a covariate that denotes group membership; in other words, we are able to quantify how much of the variability in

can be explained by group membership alone. We call a cluster ‘informative’ if its mean expression distinguishes the different biological classes or groups as defined by a statistically significant model fit.
Formally, we fit a one-way fixed-effects ANOVA model to the mean expression profile

[as defined in (
2)] of each cluster
c using a single factor that denotes each sample's group effect through the model parameter μ
k,c for
k = 1,…,
m groups; a standard representation of the linear model underlying the ANOVA is represented by the following model equation:
where μ
c represents the overall mean, μ
k,c measures the effect of group
k and ε
jk,c represents the random normal residual error term.
The null hypothesis, H0(c):μ1,c = μ2,c = … = μ m,c, states all group means are equivalent while the alternative hypothesis, H1 assumes that not all μk's are equal or, equivalently, that at least two groups have different mean expression values.
The mean expression for group
k in cluster
c is given by
which is simply the expression averaged over the genes in cluster
c, then averaged over the
nk replicates. Note this is reflected by the double dot notation which indicates the two indices over which the summations occur, one over the gene index
i [from 1 to
gc as shown in (
2)] and the second over the replicate index
j (from 1 to
nk for the
k-th group).
The overall mean value represents the average of all
m group means in cluster
c,
where the triple dots indicate summations over the gene index
i (from 1 to
gc genes for cluster
c), the replicate index
j (from 1 to
nk replicates for the
k-th group) and the group index
k (from 1 to
m groups).
From the fitted model (30 for cluster
c, we obtain the MSS
c statistic (also known as the mean treatments sum of squares) which captures the amount of variation attributed to group-specific effects:
and the RSS
c statistic (the residual sum of squares, also known as the mean error sum of squares) which represents the residual variation remaining after group-specific effects have been accounted for (),
An informative cluster c will yield a large MSSc statistic relative to the RSSc statistic. This is because when genuine group structure exists in the data, this is manifested by observing group means that adopt distinctly different values. In this situation, the sum of squares calculation in the MSSc statistic is large whereas the sum of squares calculation in the RSSc statistic shrinks to zero ().
In calculating the RSSc statistic, we are always comparing elements of the mean expression profile back to their respective group means and so the statistic is not influenced by the presence of group structure in the data. The MSSc statistic on the other hand compares the group means to the overall mean directly (which ignores any group structure) and therefore the more distinct the group means become, the sum of the deviations from the overall mean will increase giving rise to a larger MSSc statistic.
We define two final measures that collectively represent how informative the overall cluster analysis is. These measures are obtained by averaging the cluster-specific MSS
c and RSS
c statistics for the
p clusters found in the dataset:
and
Given that a single informative cluster
c will be associated with a large MSS
c and a small RSS
c value then by extension, the overall MSS
(p) and RSS
(p) values will be large and small respectively, for an informative set of
p clusters. The MSS
(p) statistic best captures the size of the group-specific effect directly for each of the
p clusters and therefore we define the informativeness metric to be the MSS
(p) statistic defined in (
8).
In standard ANOVA analysis, it is more common to focus on the ratio of the MSS and RSS statistics or equivalently, the
F statistic:
to assess the significance of a fitted model.
For the
p clusters generated by the cluster analysis, we can extend the
Fc cluster-based statistic and similarly define
F(p) in the following way:
Our results presented for the modified
F statistic are calculated from the
F1(p) definition. Note that there is an alternative way to define
F(p):
In theory, the
F-based statistic appears to be potentially useful as a means to measure informativeness since a large MSS
c and small RSS
c will give rise to a large
F value. However, based on tests using simulated datasets, the
F-based statistic as defined in (
10) was inconsistently incorrect in estimating the correct number of clusters (see
Supplementary Material). For the experimental dataset, the
F-based statistic estimated a set of clusters which were sub-optimal describing the diversity of expression profiles for the biological classes in this dataset. This can be demonstrated by comparing the profiles in panel A versus B in where the emergence of Cluster 1 in panel B reveals a cluster that would otherwise have been masked when fewer numbers of clusters are specified, as shown in panel A (and
Supplementary Material). Therefore, our tests of the
F-based statistics on both simulated and experimental datasets indicate that these statistics do not perform reliably as measures of cluster information content.
By clustering the expression dataset, the goal is to reveal underlying sub-structures that reflect the gene sets driving the group-specific differences observed. Changing the number of clusters will effectively alter the resolution at which those substructures can be observed, and the optimal number of clusters will give rise to a set of clusters that highlight these group-specific differences at maximum resolution. Therefore, as the number of clusters (p) approaches this optimal value, the clusters become more informative, as reflected by an increase in the MSS(p) statistic and a much smaller RSS(p) statistic. The F(p)-based statistic failed to provide reliable discriminatory power and a careful analysis of all three collective measures on both simulated and real data indicate that MSS(p) has the greatest discriminatory power and consequently, we chose MSS(p) as a measure of cluster information content—the informativeness metric, coupled with the simple expectation that an informative set of clusters will be associated with a much smaller RSS(p) statistic.
To determine the optimal number of clusters using the informativeness metric, we vary the number of clusters in a cluster analysis over a finite range and calculate the informativeness metric for each value within this space (). The value which maximizes the informativeness metric is taken to be the optimal number of clusters with the accompanying condition that the RSS
(p) statistic computed for the
p clusters should be much smaller than the informativeness metric. Instances where the informativeness metric and the RSS
(p) statistic produce similar values over the interval in which the number of clusters is altered suggest there is an absence of group structure (see
Supplementary Material). The range over which the number of clusters (
p) is tested can be chosen arbitrarily. In practice, the approach we have adopted is to set the lower limit of this range to one and the upper limit is determined by the maximum value of
p that gives rise to clusters that have a minimum number of genes (for example, a minimum of five genes).