PathVar identifies and analyzes deregulation patterns in pathway expression using two possible analysis modes, a supervised and an unsupervised mode, chosen automatically depending on the availability of sample class labels.
In the first step, the user uploads a pre-normalized, tab-delimited microarray dataset and chooses an annotation database to map genes/proteins onto cellular pathways and processes (see Section 4
). Next, in the supervised analysis mode, the software computes two gene/protein set rankings in terms of differential pathway expression variance using a parametric T
-test and a non-parametric Mann–Whitney U
-test (or respectively, an F
-test and Kruskal–Wallis test for multi-class data). Alternatively, in the unsupervised analysis mode, three feature rankings are obtained from the pathway expression variance matrix (rows = pathways, columns = samples) by computing the absolute variances across the columns/samples, the magnitude of the loadings in a sparse principal component analysis (Zou and Hastie, 2008
) and a recently proposed entropy score (Varshavsky et al., 2006
). These rankings are combined by computing the sum of ranks across the three methods and normalizing the sum-of-ranks scores by dividing by the maximum possible score. The resulting sortable ranking table of pathways contains the test statistics and significance scores, the number and identifiers of the mapped genes/proteins, and buttons to generate box plots for each pathway and forward the genes/proteins to other bioscientific web services for further analysis. Moreover, a heat-map visualization of the expression level variances is provided as output.
In the next step, the user can forward the extracted pathway variance data to a clustering module, for identifying sample groups with similar expression variance across multiple pathways, or to a classification module (for labelled data), to build models for sample classification. The clustering module provides a selection of four hierarchical clustering algorithms, three partition-based approaches and one consensus clustering approach to combine the results of the individual methods see Glaab et al. (2009)
and Supplementary Material
. In order to compare the outcome for different clustering approaches and identify a number of clusters that is optimal in terms of cluster compactness and separation between the clusters, five validity indices are computed and aggregated by computing the sum of validity score ranks across all methods and numbers of clusters. Moreover, the clustering results are visualized using both 2D plots (cluster validity score plots, principal component plots, dendrograms and silhouette plots) and interactive 3D visualizations using dimensionality reduction methods (Supplementary Material
For a supervised analysis of the data, the classification module contains six diverse feature selection methods and six prediction algorithms, which can be combined freely by the user [see Glaab et al. (2009)
and Supplementary Material
]. To estimate the accuracy of the generated classification models, the available evaluation schemes include an external n
-fold cross-validation as well as user-defined training/test set partitions. In addition to the average prediction accuracy and SD obtained from these evaluation methods, several other performance statistics like the sensitivity and specificity, and Cohen's Kappa statistic are computed. Additionally, a Z
-score estimate of each gene set's utility for sample classification is determined from the frequency of its selection across different cross-validation cycles, and a heat map is generated to visualize the expression variance for the most informative gene sets. All machine learning technique implementations stem from a fully automated data analysis framework Glaab et al. (2009)
, which has previously been employed in variety of bioscientific studies (Bassel et al., 2011
; Glaab et al., 2010
; Habashy et al., 2011
To alleviate statistical limitations resulting from incomplete mappings of genes/proteins onto pathways and from multiple hypothesis testing, only pathways with a minimum of 10 mapped identifiers are considered in all analyses and p
-values are adjusted according to Benjamini and Hochberg (1995)
(see section on limitations in the Supplementary Material
for details and advice).