We present an inferential framework for selecting gene sets that are associated with a continuous phenotypic endpoint for microarray data with multiple categories of experimental conditions. We proceed in three steps: 1) compute gene specific R2 between gene expression and endpoints for each individual category; 2) score the association of sets of genes with endpoints based on the gene specific association measurements; 3) combine category specific inference results to identify sets of genes that are associated with endpoint in most of the categories. When the phenotype is binary, many statistics used for identifying differentially expressed genes can be used in step 1. However, these statistics might not work well for continuous and non-linear/non-monotonic relationships between endpoint and gene expression. Based on natural cubic spline regression, our proposed R2 not only captures non-linear associations between endpoint and gene expression but also accommodates any existing linear association between the two variables. In step 2, we adapt the framework of GSEA/SAFE using our R2 as the association measurement. The combination of steps 1 and 2 can be regarded as a special case of the GSEA/SAFE procedure in a generalized sense. In step 3, we combine the results from multiple categories for each gene set and give a conservative FDR upper bound. This step depends on a subjective choice of threshold that can be tailored to the inferential goal. Different thresholds would result in different numbers of significant sets.
Assuming that the experimental information is fully represented by gene expression levels (Equation 1), we calculate the gene specific R2 separately for each category. For some genes, this assumption might be violated if the curves relating gene expression level and endpoints are different for the same category at different experimental conditions. In this case, a pooled approach inside each category might yield reduced R2 values. However, limited sample size of the category might not allow splitting the data into subcategories. Also, the curve (Xij, Yi) itself is driven by the experimental conditions, e.g., time and dose. If we fix all these factors, then pairs (Xij, Yi) may distribute around an average point and we would not be able to capture the main association trend.
We use the quantity R2 to screen strong associations between endpoint and gene expression levels rather than to select the best model for predicting the endpoint from the gene expression level. Since the spline's degrees of freedom is fixed, it is not necessary to use a penalty term to control over-fitting. Finer model adjustment for purposes beyond screening can be considered after associated gene sets are identified.
We illustrate our method using the NCT compendium data in which the expression values from 20,500 genes in both rat liver and blood were analyzed at 3–4 dose levels and 3 time points for 8 hepatotoxicants. We are interested in identifying pre-defined gene sets that are associated with liver injury indicated by the ALT activity level in blood for most of the 8 hepatotoxicants. For this compendium data, we did observe non-linear association between ALT and gene expression. This might be due to the fact that ALT level increases as the degree of liver injury increases and, on the other side, the gene expression levels can be either up-regulated or down-regulated.
In the data example, we fixed 4 inner knots at 20%, 40%, 60%, and 80% quantiles of the gene expression. This model space was selected by manually examining scatter plots between gene expression levels and ALT levels. We believe that this model space is rich enough to account for biologically plausible non-linear/non-monotonic trend. If extra degrees of freedom are desired, a graph similar to Figure can be used to evaluate the benefits. Although our proposed R2 can capture non-linear associations between ALT and gene expression and lead to biologically meaningful inferences, computational feasibility restricts us from considering more complicated models, e.g., with two or more genes jointly in the model. While a high R2 indicates strong association, a low R2 does not always imply no association. Also, weak association may not be well captured by R2. However, this concern is alleviated in GSEA/SAFE procedure, because the method is designed to accumulate the weak local signals from individual genes into a stronger global signal at gene set level.
We compare the R2
from a non-linear model and a linear model in Figure to illustrate the need for using a non-linear model. Further performance comparison based on identified sets is difficult without knowing the true state of nature of whether a specific gene set is associated with the phenotypic endpoint. Also, we believe that in our case, it will not provide a very helpful reference to pursue a computational simulation, which has to include arguable assumptions that expressions of genes from one pathway follow some convenient multivariate distribution (e.g., multivariate normal distribution) and the outcome variable follows a skew distribution. As shown in the additional file 1
, section 1, the null distributions of R2
will be very sensitive to these assumptions.
In summary, we describe a framework of GSEA/SAFE for microarray data in which the gene expression data were generated under multiple categories of experimental conditions and the phenotypic endpoint was continuous. The proposed association measure R2 successfully captured non-linear trends between the gene expression levels and endpoint. The R2 was then incorporated into the GSEA/SAFE procedure in per category analysis. The usage of R2 has the advantage over the t-statistics or Pearson correlation in identifying the gene sets with both genes positively and negatively correlated to endpoints. Inference across categories serves to identify gene sets, and the corresponding functional pathways whose alteration plays a role in related biological mechanism. Our method is general and can be applied to GSEA/SAFE analysis of microarray data with other continuous phenotype or multiple GSEA/SAFE analyses.
Finally, it is important to note that, in GSEA/SAFE analyses, global statistics is usually designed to test whether the distribution of local statistics within a gene set is different from that of genes outside the gene set. Proper local statistics and global statistics should be selected carefully to avoid the situation where a gene set is statistically significant but actually clusters in a region of weak association in the gene list and is off biological interest. In this article, we have used unsigned local statistics R2, signed global statistics and one-side p-value to eliminate this possibility.