Databases of genome-wide gene expression data represent a rich source of detailed information about patterns of gene expression that result from experimental perturbations and disease states. An important use of such storehouses is to understand the similarities and differences between various biological perturbations. Current approaches for querying these gene expression databases limit the analysis to comparisons of gene expression differences between pre-determined experimental groups. In contrast, openSESAME searches for instances of coordinate differential gene expression of a query signature without using phenotypic data.
The phenotype-naïve approach of openSESAME is similar in some regards to a strategy developed by West and colleagues [
37,
38]. In this approach, the major principal components of a gene expression signature identified in the query dataset are used to train a binary regression model to serve as a predictor of signature "activation". This model can then be used to predict signature activation in other datasets and to examine the potential association between signature activation probabilities and phenotypes within these target datasets. For example, probabilities for a gene expression signature reflecting the response to lactic acidosis is associated with better survival in breast cancer patients, consistent with the hypothesis that tumors in these patients may repress glycolysis via inhibition of the Akt pathway [
39]. Such signature activation probabilities are thus an alternative to the SA scores calculated by openSESAME in that both can be calculated per sample without regard to phenotypic comparisons. However, we are not aware of work to assess the significance of observed signature activation probabilities across datasets other than by association with phenotypic variables. The aim of this work is to demonstrate the ability of a phenotype-naïve method to identify datasets that are biologically related to each other.
The openSESAME algorithm consists of two components: a signature association (SA) score that reflects the similarity of a sample's pattern of relative gene expression to a query signature, and a test to determine whether the distribution of SA scores of all samples in a dataset represents a significant pattern of coordinate differential expression of the query signature. The SA score is a normalized Wilcoxon rank-sum statistic that compares the expression levels of the up- and down-regulated genes in the query signature. We chose this statistic over parametric methods such as a Pearson correlation to the query signature vector or Student's t test to compare the expression levels of the "up" and "down" genes because we could not assume that the normalized expression values would be normally distributed in the absence of significant coordinate differential expression. Using a signature of E2 treatment and a dataset of MCF7 cells treated with various compounds, including E2, we have validated that this SA score is a sensitive and specific method for detecting differential expression of a query signature.
We have explored two approaches to determine whether the SA scores for all samples in a dataset represents significant coordinate differential expression of a query signature. One method first establishes the significance of each SA score using threshold values and then uses Fisher's exact test to determine whether the proportion of samples exceeding these thresholds in each dataset is significantly different from the proportion of samples exceeding these thresholds across all samples in the compendium. The threshold values are determined based on the FDR corrected p value (q value) of the SA scores. The second method uses a Kolmogorov-Smirnov (K-S) test to determine whether the distribution of SA scores in each dataset is significantly different from the distribution of all SA scores across all samples in the compendium. The specificity of Fisher's exact test appears to be superior to that of the K-S test, presumably because the categorical assignment of significance to SA scores acts as a "band-pass filter" to remove noise from moderately skewed distributions of SA scores. However, the relative sensitivity of the Fisher's exact and K-S tests seems to be dependent on the strength of the coordinate differential signature expression signal, the former being more sensitive when the signal is strong and the latter being more sensitive when the signal is weaker.
To explore how openSESAME might perform with other query signatures, we examined how the performance of openSESAME depends on the size of the query signature and the relative proportion of up- and down-regulated genes. To do this, we sampled genes from the E2 treatment signature and tested the ability of these derivative queries to detect significant estrogen pathway activation in two datasets: one in which the gene expression signal of estrogen pathway activation was strong (GSE2225 [
33]) and another in which the signal is more difficult to detect (GSE21653 [
34]). These analyses indicate that the performance of openSESAME increases as both the size of the query signature increases and the relative proportion of up- and down- regulated genes becomes more balanced, although the absolute performance of openSESAME is strongly dependent upon the strength of the signal of coordinate differential expression. For example, when using Fisher's exact test, openSESAME achieves > 90% sensitivity to detect the estrogen pathway activation in GSE2225 at a
p value threshold of 0.001 when the signature contains at least 80 genes and 30% of the signature is composed of upregulated genes. Similar performance is achieved with a 68-gene signature containing 47% upregulated genes. However, in GSE21653, which has a weaker signal of estrogen pathway activation, the overall effect of signature size and composition upon performance remains similar, but openSESAME has only 11% sensitivity to detect the activation of the estrogen pathway at
p < 0.001 using Fisher's exact test (although sensitivity increases 23% using the K-S test). These analyses suggest the potential utility of openSESAME to identify coordinate differential expression signals with a wide range of signature sizes of varying composition.
We used a similar approach to explore how the size of a target dataset affects the ability of openSESAME to detect coordinate differential expression of a query signature within that dataset. The power of openSESAME to detect coordinate differential expression of a query signature generally increased with the size of the target dataset, but the absolute performance of the algorithm was again highly dependent upon the strength of the signal of coordinate differential expression. In series GSE2225, which has a strong estrogen pathway activation signal, openSESAME achieves > 90% sensitivity to detect this signal in subsets of GSE2225 as small as 12 samples (67% of the full-size dataset) at a p value threshold of 0.001 using Fisher's exact test. However, in GSE21653, which has a weaker signal of estrogen pathway activation, openSESAME only achieves > 90% sensitivity to detect this signal in subsets of GSE21653 containing between 225 and 250 samples (85-94% of the full-size dataset) at a p value threshold of 0.01 using the K-S test. With approximately the same number of samples, Fisher's exact test only achieves 36% sensitivity to detect the pathway activation signal at the same p value threshold. These analyses suggest the potential for openSESAME to detect strong signals of coordinate differential gene expression in small datasets and the potential importance of the K-S approach and larger datasets to powerfully detect weaker signals of coordinate differential gene expression.
We used openSESAME to identify experiments related to estrogen receptor (ER) pathway activation in a collection of approximately 75,000 human samples profiled on Affymetrix microarrays that have been deposited in the Gene Expression Omnibus (GEO) data repository. Our success in identifying numerous datasets representing differential ER pathway activation is due not only to the size of the gene expression compendium we searched, but also the relatively high frequency with which the ER pathway is modulated in the experimental and clinical datasets contained within this repository due to its biological and clinical importance. We identified experiments related to ER pathway activation using a signature of genes that are differentially expressed in response to E2 treatment in MCF7 cells. openSESAME identified experiments in which MCF7 cells were perturbed with E2, phytoestrogens such as genistein, or the E2 precursor testosterone (in cells overexpressing aromatase).
In addition to these results, we identified significant enrichment of the estrogen pathway activation signature in several series of primary breast tumors or breast cancer cell lines and found that ER-positive tumors and cell lines had significantly higher association with the E2 treatment signature than did ER-negative samples. These results likely reflect increases in steady-state ER pathway activation in response to available estrogen in ER-positive tumors and cell lines. Similarly, openSESAME identified two series of human endometrial tissue samples in which the E2 treatment signature was positively associated with the proliferative phase of the menstrual cycle and negatively associated with the mid secretory phase. These two phases correspond to higher and lower serum estrogen levels, respectively, suggesting that the perturbation of the E2 treatment signature in these datasets reflects a response to changing levels of available estrogen.
We also explored conditions related to the differentiation and maintenance of stratified squamous epithelia using a gene expression signature of the consequences of p63 downregulation. In two independent experiments, the pattern of gene expression associated with p63 silencing was evident in samples of metastatic melanomas relative to primary melanomas. Subsequent analysis of these datasets confirmed that the expression of
TP63 is down regulated in metastatic melanoma. This confirms previously published observations in which p63 staining was absent from greater than 90% of malignant melanomas [
40-
44] and may offer a mechanistic explanation as to why genes involved in stratified epithelial differentiation are markedly downregulated in such tumors [
32]. Taken together, the results of this openSESAME query implicate p63 as a regulator of the differentiation, maintenance and proliferation of stratified epithelium, highlighting the utility of openSESAME for the elucidation of gene function.
Finally, we chose to compare openSESAME with two other methods that relate gene expression differences in GEO-deposited data to a query signature. One of the key differences between openSESAME and both GeneChaser [
12] and MARQ [
14] is that, unlike openSESAME, these methods precompute the degree of differential expression associated with phenotypic comparisons gleaned from sample annotation information that has been deposited together with the gene expression data. We found that GeneChaser was not useful if any of the genes in the "up" or "down" sets were not differentially expressed in the same direction as the others in each set. This highlights one of the strengths of openSESAME, which is that it can identify experiments that are relevant to a signature of interest even if a subset of genes in a signature are not differentially expressed in the same direction in two different experiments.
Using the E2 treatment signature, MARQ identified perturbations of estrogen signaling in four of the series detected by openSESAME. Similarly, when using the p63 silencing signature, MARQ identified phenotypic comparisons related to stratified squamous cell type or differentiation within three of the series detected by openSESAME. MARQ also identified phenotypic comparisons within a number of series that were not detected by openSESAME. For example, MARQ identified E2 signaling as downregulated in MCF7 cells treated with dioxin, which is a well-known antiestrogenic agent [
45], and identified p63 signaling as being upregulated in acne lesions compared with normal skin and in the bronchial epithelium of smokers compared with that of nonsmokers (both of which may reflect squamous repair mechanisms in injured or inflamed tissue). The failure of openSESAME to identify datasets identified by MARQ suggests that methods like MARQ that leverage phenotypic data to identify patterns of coordinate differential gene expression may generally be more sensitive than phenotype-naïve approaches such as openSESAME, as they incorporate additional sources of information.
However, there were a number of datasets that were identified by openSESAME that were not identified by MARQ, and these point to potential advantages of a phenotype-naïve approach. For example, openSESAME detected significant perturbation of the p63 silencing signature in GSE2280 [
36], in which p63 silencing is positively associated with metastatic squamous cancer samples, consistent with the loss of p63 expression observed in the metastatic melanoma samples described earlier. MARQ did not assign a significant
p value to this comparison, however, because there was substantial heterogeneity among these samples. Taken together, these results suggest that methods such as MARQ may generally be more specific in identifying interesting biological perturbations, given the directed nature of its queries, but openSESAME is more robust to variability within prespecified comparisons. There were also a number of examples of significant coordinate differential expression of the E2 and p63 query signatures that were detected by openSESAME but were not detected by MARQ (Additional Files
2 and
4), including the phytoestrogen and melanoma datasets, because these datasets were not included in the curated data from which MARQ calculated phenotype-associated differential expression. The perturbations that are detected uniquely by openSESAME highlight the key advantage of openSESAME in that it is able to detect coordinate differential expression of a query signature without regard to explicit annotation-driven comparisons between defined experimental groups.
In this work, we have shown that openSESAME can identify datasets with similar patterns of coordinate differential gene expression using an approach that does not require identification of phenotype-associated differential expression and that these gene expression similarities can be used to identify conditions that are biologically related to each other. We therefore believe that openSESAME will be a broadly useful approach to leverage the increasing body of available genome-wide gene expression profiling experiments to generate hypotheses about the causes and consequences of observed patterns of differential gene expression. To further explore the potential of this approach, we have created a publicly accessible implementation of openSESAME (
http://opensesame.bu.edu) for the scientific community to test with their own gene expression signatures.