The term “pathway analysis” has been used in very broad contexts in the literature 
. It has been applied to the analysis of Gene Ontology (GO) terms (also referred to as a “gene set”), physical interaction networks (e.g., protein–protein interactions), kinetic simulation of pathways, steady-state pathway analysis (e.g., flux-balance analysis), and in the inference of pathways from expression and sequence data. However, the definition of a “pathway” in some of these uses may be misleading or incorrect. For instance, the cellular compartment ontology in GO does not describe a pathway.
It is beyond the scope of this review to discuss the large number of analytic methods covered by such a broad application of the term “pathway analysis.” Therefore, this review focuses on methods that exploit pathway knowledge in public repositories such as GO or Kyoto Encyclopedia of Genes and Genomes (KEGG), rather than on methods that infer pathways from molecular measurements. We call this approach knowledge base–driven pathway analysis. It identifies pathways that may be affected in a condition by correlating information in at least one pathway knowledge base with gene expression patterns for the condition. The result is differential expression of a set of genes or proteins rather than a list of individual genes.
Instead of individually reviewing a large number of pathway analysis approaches, our goal here is to group approaches by the type of analysis they perform and discuss their relative merits. However, for those desiring specific information about individual tools, Text S2
provides feature comparisons for a number of individual tools in each group.
Virtually all of the approaches and tools discussed here are independent of the data generated from most high-throughput technologies, including next-generation sequencing data and the knowledge bases used for pathway annotations. In this review, we use gene expression measurements as example data for discussing and explaining various approaches.
First Generation: Over-Representation Analysis (ORA) Approaches
The immediate need for functional analysis of microarray gene expression data and the emergence of GO during that period gave rise to over-representation analysis (ORA), which statistically evaluates the fraction of genes in a particular pathway found among the set of genes showing changes in expression (). It is also referred to as “2×2 table method” in the literature 
. ORA uses one or more variations of the following strategy 
(): first, an input list is created using a certain threshold or criteria. For example, a researcher may choose genes that are differentially over- or under-expressed in a given condition at a false discovery rate (FDR) of 5%. Then, for each pathway, input genes that are part of the pathway are counted. This process is repeated for an appropriate background list of genes (e.g., all genes measured on a microarray). Next, every pathway is tested for over- or under-representation in the list of input genes. The most commonly used tests are based on the hypergeometric, chi-square, or binomial distribution. We refer the readers to recent comparisons of ORA tools for more details 
. Many of the ORA tools differ very slightly from each other as they use the same statistical tests as well as overlapping pathway databases (Table S1
Examples of pathway analysis tools in each generation.
Overview of existing pathway analysis methods using gene expression data as an example.
Despite the availability of a large number of tools and their widespread usage, ORA has a number of limitations. First, the different statistics used by ORA (e.g., hypergeometric distribution, binomial distribution, chi-square distribution, etc.) are independent of the measured changes. This means that these tests consider the number of genes alone and ignore any values associated with them such as probe intensities. By discarding this data, ORA treats each gene equally. However, the information about the extent of regulation (e.g., fold-changes, significance of a change, etc.) can be useful in assigning different weights to input genes, as well as to the pathways they are involved in, which in turn can provide more information than current ORA approaches.
Second, ORA typically uses only the most significant genes and discards the others. For instance, the input list of genes from a microarray experiment is usually obtained using an arbitrary threshold (e.g., genes with fold-change
). With this method, marginally less significant genes (e.g., fold-change
1.999 or p
0.051) are missed, resulting in information loss. Breitling et al. addressed this problem by proposing an ORA method for avoiding thresholds. It uses an iterative approach that adds one gene at a time to find a set of genes for which a pathway is most significant 
Third, by treating each gene equally, ORA assumes that each gene is independent of the other genes. However, biology is a complex web of interactions between gene products that constitute different pathways. One goal of gene expression analysis might be to gain insights into how interactions between gene products are manifested as changes in gene expression. A strategy that assumes the genes are independent is significantly limited in its ability to provide insights in this regard. Furthermore, assuming independence between genes amounts to “competitive null hypothesis” testing (see below), which ignores the correlation structure between genes. Consequently, the estimated significance of a pathway may be biased or incorrect.
Fourth, ORA assumes that each pathway is independent of other pathways, which is erroneous. For instance, GO defines a biological process as a series of events accomplished by one or more ordered
assemblies of molecular functions (http://www.geneontology.org/GO.doc.shtml
). Another example of dependence between pathways is the cell cycle pathway in KEGG (http://www.genome.jp/kegg/pathway/hsa/hsa04110.html
), where the presence of a growth factor activates the MAPK signaling pathway. This, in turn, activates the cell cycle pathway. No ORA methods account for this dependence between molecular functions in GO and signaling pathways in KEGG.
Second Generation: Functional Class Scoring (FCS) Approaches
The hypothesis of functional class scoring (FCS) is that although large changes in individual genes can have significant effects on pathways, weaker but coordinated changes in sets of functionally related genes (i.e., pathways) can also have significant effects. With few exceptions 
, all FCS methods use a variation of a general framework that consists of the following three steps 
(; ): first, a gene-level statistic is computed using the molecular measurements from an experiment. This involves computing differential expression of individual genes or proteins. Statistics currently used at gene-level include correlation of molecular measurements with phenotype 
, ANOVA 
, Q-statistic 
, signal-to-noise ratio 
, and Z-score 
. Although the choice of a gene-level statistic has a negligible effect on the identification of significantly enriched gene sets 
, when there are few biological replicates, a regularized statistic may be better. Furthermore, untransformed gene-level statistics can fail to identify pathways with up- and down-regulated genes. In this case, transformation of gene-level statistics (e.g., absolute values, squared values, ranks, etc.) is preferable 
Second, the gene-level statistics for all genes in a pathway are aggregated into a single pathway-level statistic. This statistic can be multivariate 
and account for interdependencies among genes, or it can be univariate 
and disregard interdependencies among genes. The pathway-level statistics used by current approaches include the Kolmogorov-Smirnov statistic 
, sum, mean, or median of gene-level statistic 
, the Wilcoxon rank sum 
, and the maxmean statistic 
. Irrespective of its type, the power of a pathway-level statistic can depend on the proportion of differentially expressed genes in a pathway, the size of the pathway, and the amount of correlation between genes in the pathway. Interestingly, although multivariate statistics are expected to have higher statistical power, univariate statistics show more power at stringent cutoffs when applied to real biological data (
), and equal power as multivariate statistics at less stringent cutoffs (
The final step in FCS is assessing the statistical significance of the pathway-level statistic. When computing statistical significance, the null hypothesis tested by current pathway analysis approaches can be broadly divided into two categories: i) competitive null hypothesis and ii) self-contained null hypothesis 
. A self-contained null hypothesis permutes class labels (i.e., phenotypes) for each sample and compares the set of genes in a given pathway with itself, while ignoring the genes that are not in the pathway. On the other hand, a competitive null hypothesis permutes gene labels for each pathway, and compares the set of genes in the pathway with a set of genes that are not in the pathway. Text S2
has a detailed discussion on the differences between the two null hypotheses.
FCS methods address three limitations of ORA. First, they do not require an arbitrary threshold for dividing expression data into significant and non-significant pools. Rather, FCS methods use all available molecular measurements for pathway analysis. Second, while ORA completely ignores molecular measurements when identifying significant pathways, FCS methods use this information in order to detect coordinated changes in the expression of genes in the same pathway. Finally, by considering the coordinated changes in gene expression, FCS methods account for dependence between genes in a pathway, which ORA does not.
Although FCS is an improvement over ORA 
, it also has several limitations. First, similar to ORA, FCS analyzes each pathway independently. This is a limitation because a gene can function in more than one pathway, meaning that pathways can cross and overlap. Consequently, in an experiment, while one pathway may be affected in an experiment, one may observe other pathways being significantly affected due to the set of overlapping genes. Such a phenomenon is very common when using the GO terms to define pathways due to the hierarchical nature of the GO.
Second, many FCS methods use changes in gene expression to rank genes in a given pathway, and discard the changes from further analysis. For instance, assume that two genes in a pathway, A and B, are changing by 2-fold and 20-fold, respectively. As long as they both have the same respective ranks in comparison with other genes in the pathway, most FCS methods will treat them equally, although the gene with the higher fold-change should probably get more weight. Importantly, however, considering only the ranks of genes is also advantageous, as it is more robust to outliers. A notable exception to this scenario is approaches that use gene-level statistics (e.g., t-statistic) to compute pathway-level scores. For example, an FCS method that computes a pathway-level statistic as a sum or mean of the gene-level statistic accounts for a relative difference in measurements (e.g., Category, SAFE in Table S2
Third Generation: Pathway Topology (PT)-Based Approaches
A large number of publicly available pathway knowledge bases provide information beyond simple lists of genes for each pathway. Unlike GO and the Molecular Signatures Database (MSigDB), these knowledge bases also provide information about gene products that interact with each other in a given pathway, how they interact (e.g., activation, inhibition, etc.), and where they interact (e.g., cytoplasm, nucleus, etc.). These knowledge bases include KEGG 
, MetaCyc 
, Reactome 
, RegulonDB 
, STKE (http://stke.sciencemag.org
), BioCarta (http://www.biocarta.com
), and PantherDB 
ORA and FCS methods consider only the number of genes in a pathway or gene coexpression to identify significant pathways, and ignore the additional information available from these knowledge bases. Hence, even if the pathways are completely redrawn with new links between the genes, as long as they contain the same set of genes, ORA and FCS will produce the same results. Pathway topology (PT)-based methods (; Table S3
) have been developed to utilize the additional information. PT-based methods are essentially the same as FCS methods in that they perform the same three steps as FCS methods. The key difference between the two is the use of pathway topology to compute gene-level statistics.
Rahnenfuhrer et al. proposed ScorePAGE, which computes similarity between each pair of genes in a pathway (e.g., correlation, covariance, etc.) 
. The similarity measurement between each pair of genes is analogous to gene-level statistics in FCS methods, which is averaged to compute a pathway-level score. However, instead of giving equal weight to all pairwise similarities, ScorePAGE divides the pairwise similarities by the number of reactions needed to connect two genes in a given pathway (). Although the approach is designed to analyze metabolic pathways, it is theoretically also applicable to signaling pathways.
A recent impact factor (IF) analytic approach was proposed to analyze signaling pathways. IF considers the structure and dynamics of an entire pathway by incorporating a number of important biological factors, including changes in gene expression, types of interactions, and the positions of genes in a pathway 
(). Briefly, IF analysis models a signaling pathway as a graph, where nodes represent genes and edges represent interactions between them. Further, it defines a gene-level statistic, called perturbation factor (PF) of a gene, as a sum of its measured change in expression and a linear function of the perturbation factors of all genes in a pathway (see Equation 1 in Text S1
). Because the PF of each gene is defined by a linear equation, the entire pathway is defined as a linear system. Representing a pathway as a linear system also addresses loops in the pathways 
. The IF of a pathway (pathway-level statistic) is defined as a sum of PF of all genes in a pathway (see Equation 2 in Text S1
). IF analysis was recently improved to address the dominating effect of change in expression on PF and high false positive rate for a small list of input genes 
FCS methods that use correlations among genes 
implicitly assume that the underlying network, as defined by the correlation structure, does not change as the experimental conditions change. However, this assumption may be inaccurate. For example, the correlation structure between ARG2
and other genes in the urea-cycle pathway changes with a change in expression of ARG2
, suggesting changes in the topology of the pathway.
Shojaie et al. proposed a method, called NetGSA, that accounts for the the change in correlation as well as the change in network structure as experimental conditions change 
. Their approach, like IF analysis, models gene expression as a linear function of other genes in the network. However, it differs from IF in two aspects. First, it accounts for a gene's baseline expression by representing it as a latent variable in the model. Second, it requires that the pathways be represented as directed acyclic graphs (DAGs). If a pathway contains cycles, NetGSA requires additional latent variables affecting the nodes in the cycle. In contrast, IF analysis does not impose any constraint on the structure of a pathway 
Although PT-based methods are difficult to generalize, they have several common limitations. One obvious problem is that true pathway topology is dependent on the type of cell due to cell-specific gene expression profiles and condition being studied. However, this information is rarely available and is fragmented in knowledge bases, even if it is fully understood 
. As annotations improve, these approaches are expected to become more useful. Other limitations of PT-based methods include the inability to model dynamic states of a system and the inability to consider interactions between pathways due to weak inter-pathway links to account for interdependence between pathways. These limitations are discussed in detail in the Outstanding Challenges section below.