Gene set enrichment (GSE) is a computational technique used in the analysis of gene expression data. The technique determines whether a priori defined set of genes show statistically significant differential expression between two sample tissues, time points or conditions [1
]. Gene sets are determined by prior biological knowledge relating to co-expression, function, location or known biochemical pathways. The fundamental principle of GSE is that all biochemical pathways are determined by sets of genes and that if that pathway is in any way related to a biological trait then the co-functioning genes should display a higher degree of enrichment compared with the rest of the transcriptome. A focus on the expression of gene sets rather than that of individual genes makes better use of the information generated by a microarray experiment by allowing genes which show only minor differential expression to contribute to the calculation of the enrichment score (ES). The GSE approach should also lead to a greater incidence of replication within array data by identifying the same biological processes underlying a particular phenotype. These arguments can also be applied to the interpretation of multiple weak association signals in genomewide association studies (GWAS) [2
The most commonly used algorithm to detect the presence of enrichment for a particular gene set is the gene set enrichment analysis (GSEA) technique [1
]. GSEA determines whether members of a gene set S
tend to occur towards the top or the bottom of list L
, indicating a correlation with a particular phenotype. Calculation of the GSEA requires N
, the total number of genes being examined, k
, the number of samples, S
, the gene set of interest and L
, a list containing the N
genes ranked by their correlation scores with a specific phenotype (L
}). For each gene set, the Phit
values are calculated. Phit
is defined as the difference between the fraction of the genes in S
that are present before a given position i
is the fraction of all the N
genes (except those in S
) that are present before position i
across all possible positions i
in the list L.
The measure of whether there is a significant difference in expression values for a given gene set between two phenotypes is determined by the ES, which is the score of the maximum of Phit
over all positions i
in the list L.
Determining the statistical significance (P
-value) of the ES for each gene set requires a permutation test. The two phenotypes are randomly permuted 1000 times, the ES for the gene set is then re-calculated for each permutation and the P
-value is estimated as the proportion of the 1000 random permutations that have an ES lower than the ES for the actual experimental data.
Although there are several variations on the original GSEA algorithm (including parametric analysis of gene set enrichment [5
] and generally applicable gene set enrichment [6
]), all means of calculating enrichment are highly dependent on the nature of the gene sets used. One major determinant on the ES in GSE is simply the size of the gene set. The use of larger sets results in higher statistical power and higher sensitivity where there is only slight enrichment, making them suitable to detect subtle changes in gene expression. Conversely, a large gene set causes the sensitivity to be decreased where there is a greater degree of enrichment. The composition of the gene set is also important as each individual gene will have a varying degree of association with the specified trait that the set is designed to encapsulate [7
]. GSE has weak power to detect a differentially expressed gene set where there is a mixture of strongly associated genes and weakly associated genes as the calculated enrichment will not reflect the diversity of the expression values. It is also wrong to assume that genes with large changes in expression values are making a stronger contribution to a pathway than those with smaller changes. Also, some variation in expression levels may simply be a consequence of other signal regulation events (this is arguably a weakness of both the single gene method and of GSE). Here, we assess the current sources of gene sets and how gene expression data may be used to develop methodologies for the creation of new, more specific gene sets for GSE.