Cells sharing the same genomic information are able to express it in different ways to achieve cell-specific functions or respond to different environmental changes. Transcriptional regulation is the first step at which this specificity is determined, as it is the most basic level at which gene expression is controlled. Recent surveys of transcriptomic data across numerous cell types revealed two broad categories of gene expression: ubiquitous; and tissue- or cell-type-specific expression [
1,
2]. The first category contains genes that are expressed in most tissues at similar levels and they are thought to provide core cellular functionality [
3,
4]. The second category comprises genes with distinct expression in a few tissues or conditions, which are likely to be important for defining cell-specific functions.
In datasets with only a few conditions, it is possible to compare pairs of conditions using standard or moderated
t-tests [
5-
7]. However, this becomes impractical with large datasets, as the number of pairwise comparisons increases exponentially with respect to the number of conditions studied. An alternative method is the non-standard ANOVA, which tests all possible groups of samples against each other. However, this involves computationally intensive dynamic programming and cannot detect specificity in individual conditions. Moreover, the method requires equal standard deviations between all groups of conditions being compared: this cannot be assumed as genes might have similar expression levels in some conditions - and thus small standard deviations - and more divergent expression levels in others. A further alternative is the Tukey test, although this method requires independence between groups of conditions and a normal distribution of group means, criteria that are often not met in microarray experiments. Importantly, most of these and other methods assume that expression values follow a single normal distribution. This assumption is generally not satisfied, which means that methods do not model the data correctly and therefore lead to false positive results [
8].
An alternative to these approaches is a mixture model-based procedure to model gene expression. EMMIX-GENE [
9] and EMMIX-FDR [
10] are software packages that apply this technique to cluster genes displaying similar expression patterns. However, these packages were not specifically developed to detect condition-specific expression, and therefore cannot be readily applied for this purpose on large datasets. Moreover, the method is not implemented in commonly used analysis platforms such as Bioconductor, making it difficult to integrate with additional analysis pipelines.
Two additional methods were recently developed with the specific aim of identifying condition-specific gene expression. First, a method called ROKU [
11] implements Shannon's information theory entropy followed by an outlier detection method [
12] to detect tissue specificity. This method is implemented in the Tissue Specific Genes Analysis (TSGA) R package [
13]. It returns a list of conditions in which each gene is specifically expressed. Unfortunately, this method depends on a pre-defined set of ubiquitously expressed genes to model background expression levels - information that is generally not available prior to analysis. Furthermore, the TSGA method produces qualitative outputs - a gene is classified as either condition-specific or not without ranking genes or conditions - which makes the resulting lists difficult to prioritize for further analysis. Second, Vaquerizas
et al. [
2] previously used a propensity measure for a given gene to be expressed at a certain level in particular conditions relative to its expression across other conditions. The method provides a ranking of condition-specificity across samples. However, there is no control over the number of conditions in which a gene can be specific and there is no statistically meaningful threshold for specificity. Therefore, to our knowledge there is currently no straightforward and statistically robust method available to detect condition-specific gene expression.
Here we present a new method called SpeCond (for Specific Condition) to detect condition-specificity from a dataset of gene expression measurements. The method fits a normal mixture model to the expression profile of each gene, and identifies outlier conditions. We compare SpeCond against several alternative approaches using a gold standard dataset and demonstrate that SpeCond outperforms other methods. Finally, we apply the SpeCond approach to a subset of the Genome Novartis Foundation SymAtlas dataset [
14], and identify specifically expressed genes from 32 human tissues samples. The method is freely available as an R package within the Bioconductor software project [
15-
17] at [
18].