Many of the popular methods for analyzing DNA microarray expression data, from
clustering [

1] to more sophisticated
machine-learning approaches [

2-

5], require expression data over a
large number of different conditions as input. However, it is common to only have
expression data for a few different strains and/or conditions. In this case, what is of
interest are the changes in mRNA abundance for each gene, usually represented as the
logarithm of the fold-change between test and control. The traditional way of analyzing
such data is to first identify significantly up- and down-regulated genes, and
subsequently to characterize these sets in terms of enrichment for functional annotation
[

6] or upstream promoter elements
[

7-

9]. However, by requiring statistically significant differential
expression at the level of individual genes, a lot of information about differential
expression will be lost that could have been detected using analysis methods working at
the level of pathways.

To understand this, assume that we are comparing two conditions and that the measurement
error for the fold-change of individual genes is 20%. Now consider a specific pathway
consisting of 100 genes that are all upregulated by 10%. This level of differential
expression is well within the noise for individual genes, none of which will therefore
be classified as significantly induced. However, the error in the

*average
*expression of 100

*randomly *chosen genes will be on the order of
20%/

= 2%. The 10% change in expression at the level of the whole pathway therefore
corresponds to five units of standard error and is highly statistically significant.

In recent years, two distinct classes of methods have been developed that use prior
information about how genes can be viewed as belonging to different regulatory or
functional pathways (Figure ). This information can be used to
score differential expression at the pathway level rather than at the gene level. The
first class of methods represents pathways as gene sets, to which individual genes
either belong or do not belong. One well-known source of such gene sets is the Gene
Ontology (GO) project [

6], where the
classification is based on the function of the proteins encoded by each gene. The second
class of methods takes a more sophisticated approach by assigning a regulatory
susceptibility to each gene, quantifying how strongly this gene is expected to respond
to a change in the activity of a specific regulatory pathway. For example, the affinity
of a gene's promoter sequence for a specific transcription factor (TF) could be
predicted using consensus motifs or weight matrices [

10] and be used to predict the response of that gene to changes in
TF activity.

In this review, we describe how such pathway-level analyses can be implemented
mathematically. It is helpful to understand that, in general, information about genes
comes in two different types:

*categorical *information of boolean type ("true"
or "false"), which tells us whether or not a gene belongs to a specific gene set; and

*quantitative *information, e.g., the mRNA expression log-ratio between two
conditions for a gene or the ChIP-chip [

11] fold
enrichment for the gene's promoter region. Given any two distinct features
characterizing each gene, their genomewide statistical association can be scored using
an appropriate statistical test (Table ).

| **Table 1**When to use which statistical test. |

The traditional approach: scoring over-representation of predefined gene sets

Suppose that we want to know whether a specific set of genes of interest is
statistically enriched for genes with a specific annotation in Gene Ontology. In this
case, both features (namely, "does the gene belong to the set of genes of interest"
and "is the gene associated with GO term X") are categorical, and the appropriate
statistic is the *overlap *between both gene sets. Let the total number of
genes in set *A *be *a*, the total number of genes in set *B *be
*b*, and the total number of genes in the genome be *n*.
Furthermore, let the overlap *x *denote the number of genes shared between
*A *and *B*. If the two sets are chosen randomly and independently,
the average overlap will be:

This makes sense: if a fraction *b/n *of all genes belongs to set *B
*then the expected fraction of genes in set *A *that also belongs to set
*B *equals *x/a*. In the case of over-representation, when *x
*> <*x*>, the P-value that quantifies how likely it is to get at least
the same number of overlapping genes by chance, is given by

where *H *is the hypergeometric distribution given by

and

It is also possible to have significant under-representation (*x *<
<*x*>). In that case, the P-value is given by

This use of the cumulative hypergeometric distribution is also known as "Fisher's
exact test." The test is by nature non-parametric because both input features are
non-parametric. Under specific conditions the hypergeometric distribution may be
approximated by the binomial or chi-square distribution. Several implementations of
this approach are reviewed by Khatri and Draghici [

12]. Since typically a large number of gene sets are scored in
parallel, the p-values must be corrected for multiple testing. Grossman et al.
[

13] recently addressed technical
complications arising from the strong overlap between the hierarchically organized
Gene Ontology categories.

An alternative: scoring the distribution of expression levels for predefined gene
sets

An early example of the use of predefined gene sets to analyze differential
expression at the pathway level can be found in Lascaris et al. [

14]. The authors used a z-score to represent the
difference between the average expression in a gene set

*S *consisting of

*n *genes and the genomewide mean

*μ*:

Here

=

*σ*/

is the standard error of the mean,

*σ *being the standard deviation of
the genomewide distribution of log-ratios. The same metric is used by the "parametric
analysis of gene expression" (PAGE) method of Kim and Volsky [

15]. For larger gene sets, however, the standard t-test for
the difference between means yields more accurate results [

16]. The t-test, in general, scores the statistical
association between a categorical and quantitative feature. The categorical feature
is used to partition the set of all genes,

*G*, into two complementary subsets

*S *and S'.

The

*t *statistic measures the difference between the means of the two subsets
in units of its standard error:

Here

_{S
}and

are the mean expression value of genes in set

*S *and

*S'*,
respe

ctively, and the standard error of the
difference is given by

with *σ*_{S }and *σ*_{S'
}the standard deviation of the expression values of the genes within set
*S *and *S'*, respectively. Using a t-distribution with *n *-
2 degrees of freedom, each t-value can be converted to a p-value, which should again
be corrected for multiple testing.

Figure shows a side-by-side comparison of Fisher's exact test
and the t-test for a specific combination of GO category and genomewide differential
expression profile. Fisher's exact test can only be applied once a set of "genes of
interest" has been defined. We thresholded the fold-induction of individual genes to
define this gene set, and computed GO category enrichment P-values at different
thresholds (solid line/symbols). The smallest, most significant, P-value is obtained
at an individual-gene threshold significantly below 2-fold induction, satisfied by
over 500 genes. In general, the optimal threshold will depend on both the GO category
and the expression data. By contrast, the two-sample t-test uses the expression value
for all genes; no threshold for individual genes is required, an important practical
advantage. While the optimal P-value from Fisher's exact test is slightly smaller
than that of the two-sample t-test (dashed line), this seeming advantage disappears
as soon as multiple-testing correction associated with the required threshold
optimization is taken into account. Note that at the commonly used threshold of
2-fold induction, the two-sample t-test performs dramatically better.

Other statistical tests have also been used to detect differential expression of gene
sets based on the distribution of expression values. The original version of the
"gene set enrichment analysis" (GSEA) method [

17] used the Kolmogorov-Smirnov (KS) statistic to test whether
the distribution of expression levels in a specific gene set was different from that
of all genes; this approach was later found to require a modification to work
reliably [

18]. The Wilcoxon-Mann-Whitney
test, a non-parametric equivalent of the t-test that uses expression values only to
rank the genes, has also been applied to this problem [

19].

Beyond gene sets: approaches based on regression analysis

The assignment of genes to gene sets is categorical: Either the gene belongs to the
set, or it does not. However, gene sets are often a proxy for regulatory pathways.
This is most obvious in the case of the gene sets based on ChIP-chip data
[

11], which were used by Boorsma et al.
[

16] to analyze differential mRNA
expression using the t-test. The strict delineation of "targets" of a given TF based
on thresholding of the ChIP-chip signals is an oversimplification. In reality, the
degree to which the transcription rate for a given gene responds to a change in the
activity of the TF depends in a continuous fashion on the binding affinity between
the TF and the promoter DNA (as well as interactions with co-factors, chromatin,
etc.). Thus, if an estimate of this affinity is used as a predictor for changes in
transcription rate (and therefore expression), a single parameter that quantifies the
global change in TF activity may explain a wide range of transcriptional responses
across the genome. This intuition can be formalized in the form of a linear
regression model:

where *C *is an intercept and *F *a slope estimating the change in TF
activity. The dependent ("response") variable *A*_{g }is the
mRNA expression log-ratio of gene *g *between conditions. The independent
("predictor") variable *N*_{g }represents the regulatory
network connectivity between the TF and the promoter region of gene *g*. For
given *A*_{g }and *N*_{g}, the
deviance *D *between the measured and predicted expression values

is minimized. The solution is given by

and

where <

*X*> = (1/

*G*) ∑

_{g
}*X*_{g }denotes an average over all genes
and

*δX*_{g } *X*_{g }-
<

*X*> denotes the deviation from the genomic mean, so that
<

*δX*^{2}> equals the variance of

*X*. Because we
are dealing with

*univariate *regression (a single independent variable), the
Pearson correlation coefficient between

*A *and

*N*,

can be directly related to the slope *F *by the following equation:

It can furthermore be shown that, in the univariate case, *R*^{2},
defined as the fraction of the variance in expression that can be explained by the
linear model, is given by the square of Pearson correlation:

A transformation of *r *due to R.A. Fisher

yields a statistic *t *that is distributed according a t-distribution with
*n *- 3 degrees of freedom, and can thus be easily converted to a p-value.
Again, multiple testing will need to be accounted for whenever the association with
multiple features is scored in parallel.

There are many ways in which the regulatory network connectivities

*N*_{g }can be chosen. The first application of
regression analysis to microarray data, by Bussemaker et al. [

20], used integer motif counts in promoter regions. Continuous
sequence scores based on position-specific scoring matrices (PSSMs) [

21,

22] and position-specific
affinity matrices (PSAMs) [

23,

24] have also been used. The values for

*R*^{2
}obtained with such sequence-based predictors are typically in the range of
1–5%. Another possible choice for

*N *are ChIP-chip enrichment
(log-)ratios [

25,

26].
As these values are relatively noisy experimental measurements, the values for

*R*^{2 }observed in this case are usually smaller (< 1%).