|Home | About | Journals | Submit | Contact Us | Français|
Global gene expression measurements are increasingly obtained as a function of cell type, spatial position within a tissue and other biologically meaningful coordinates. Such data should enable quantitative analysis of the cell-type specificity of gene expression, but such analyses can often be confounded by the presence of noise. We introduce a specificity measure Spec that quantifies the information in a gene's complete expression profile regarding any given cell type, and an uncertainty measure dSpec, which measures the effect of noise on specificity. Using global gene expression data from the mouse brain, plant root and human white blood cells, we show that Spec identifies genes with variable expression levels that are nonetheless highly specific of particular cell types. When samples from different individuals are used, dSpec measures genes’ transcriptional plasticity in each cell type. Our approach is broadly applicable to mapped gene expression measurements in stem cell biology, developmental biology, cancer biology and biomarker identification. As an example of such applications, we show that Spec identifies a new class of biomarkers, which exhibit variable expression without compromising specificity. The approach provides a unifying theoretical framework for quantifying specificity in the presence of noise, which is widely applicable across diverse biological systems.
Multicellular organisms have evolved a diversity of cell types, which attain their distinct identity and function through differential gene activity. An understanding of the global regulation of genes within specialized cells addresses fundamental biological questions, such as how different cell types carry out distinct functions, how new cell types evolve, and which genes are the best diagnostic markers for cancer cells (1–3). Recent studies have characterized genome-wide transcription of cell types within an organ, such as in mouse brain (4), the Arabidopsis root (5,6) and other complex tissues (7,8). A theoretical basis for analyzing such data is needed to address questions about the global structure of gene expression within an organism, e.g. which components of the genome are dedicated to the specialization of single cell types? How is gene expression at the genome level partitioned and reused among specialized cells?
While the concept of cell specificity is fundamental in developmental biology, the field lacks a measure that quantifies the biological concept of specificity. The need for a quantitative description of specificity arises from the inherent variability of gene expression within cells and cell types (9–12). For example, Figure 1a depicts three idealized genes whose distributions represent their biological variance in gene expression within three cell-type populations. Gene A varies in a narrow range in each cell type. Gene B's profile exhibits inherently more variability among target cells, giving it reduced specificity even though its mean expression level is the same as gene A. Gene C has virtually no specificity. How should such profiles be quantified with respect to cell-type specificity? Here we develop a quantitative measure, based on the information content of gene expression, which provides both a conceptual basis for describing cell type specificity in general and a quantitative approach that we apply to obtain a genome-wide view of cell-specific gene expression.
To obtain the estimate of the specificity measure (Spec) based on a few discrete samples from the distribution P(x|y) obtained from microarray data, we employ a conservative binning procedure that is designed to minimize spurious high specificities that could otherwise be obtained due to outlier replicates. The number of bins used is denoted nbin, and we use a value of nbin=3, except if indicated otherwise. See below for a justification of bin number. For each gene, we obtain the size of each bin as follows. In each cell type y, we calculate the harmonic mean of the expression level of the gene over all replicates, h(y). We find the maximum value of this quantity over all cell types, denoted hmax. The bin size is then given by hmax/nbin, with the first bin spanning expression levels (0, hmax/nbin), and continuing with equally spaced bins. Expression levels greater than hmax are binned into the last bin. In this way, the continuous distribution P(x|y) is replaced by a discrete distribution in x over the nbin bins; from this point, the computation of Spec proceeds as described in Figure 1. We note that each gene's bin size is chosen separately.
While the above procedure provides one way to estimate Spec, and works well on the data analyzed here, it is by no means the only way to estimate Spec, and may not necessarily be ideal in every scenario. For example, the distribution P(x|y) could be modeled as a Gaussian or other continuous probability distribution, in which case a parametric estimate of Spec can be made. However, such an approach is only useful in cases where the appropriate model of the noise is known for the given set of experiments, for each gene in each cell type. In the absence of such knowledge (e.g. for the data used here), a non-parametric approach such as our binning method is strongly preferred, since errors in the choice of model will result in large errors in the estimate of Spec. The shuffling controls presented in Figure 2 provide a rigorous test of the reliability of any given estimator, and should be carried out each time Spec estimation is performed on a new data set.
To test whether the bin-based estimator of Spec gave reliable results, given the small number of replicates available in each cell type (i.e. between two and four replicates per cell type), we constructed continuous probability distributions, P(x|y), using the plant microarray data. We calculated the mean and variance of the logarithm of expression levels, in each cell type. These were used to construct lognormal mock distributions of gene expression levels in each cell type. These distributions are merely for testing purposes, and do not necessarily reflect the true shapes of the distributions in each cell type. Using these mock distributions, we resampled the entire dataset, obtaining the same number of replicates in each cell type as existed in the real dataset. The constructed test dataset was used to obtain the bin-based estimate of Spec. The estimates were compared with the values obtained from the mock distributions by numerically integrating the lognormal distributions, according to the Spec formula given in Figure 1. The results are plotted in Supplementary Figure S5, and demonstrate that using three bins of expression provides a significant improvement over two bins. The improvement of using four bins over three bins is marginal, and we chose to use three bins for the entire analysis, since using fewer bins provides more statistical power when comparing the analysis with shuffled data. As the number of bins is increased beyond four, the correlation between estimated and true Spec values decreases, due to under-sampling of bins which results in spurious specificity values. Matlab and Perl code for Spec are available upon request.
The cell-type network was generated by identifying all genes whose expression domain size was 2 or 3 cell types (raw data, plant Supplementary Table S6, mouse Supplementary Table S7). By delineating the ‘fingers’ in Figure 2 that emanate from integral Nspec values along the dSpec=0 axis, genes were identified for expression domains of size d=2 or 3, according to the respective criteria 1.5≤Nspec<2.5 or 2.5≤Nspec<3.5. We also required a dSpec level <0.4, to avoid the false positive regions in Figure 2. Genes were then grouped into patterns based on their expression domain size d: the d cell types with the highest Spec values were labeled 1, and all other cell types were labeled 0. The number of genes g that comprised a significantly enriched pattern was identified by permuting gene expression values, generating Nspec values, and repeating the expression domain identification procedure. The observed value of g for each pattern was used to detect significantly enriched patterns, by requiring a value of g that was beyond the 95% percentile expected by chance assessed using the permuted data. This corresponded to having at least five genes display a pattern for the plant data and at least three genes for the mouse data; patterns satisfying this criterion were used as follows. The data was converted to an n-by-n matrix A, by summing the number of genes contributing to each pair of cell types occurring in each pattern. For example, for n=5 cell types, if 10 genes shared pattern (1, 1, 0, 0, 1) a value of 10 was added to the entries A12, A15 and A25, and so on for all significant patterns. To visualize major patterns, only edges with >100 genes for plants and >50 genes for mouse were drawn. The network was generated from the cell type by cell-type matrix using the Matlab biograph function with a hierarchal layout. Dendrograms in Figure 4 were generated using hierarchical clustering using Pearson correlation of overall gene expression values, and average linkage, after filtering out the 25% least varying genes in the dataset based on the variance of their average expression level across all cell types.
Lists of genes as described in the text were tested for over-representation of Gene ontology (GO) terms using Virtual Plant (13) (http://virtualplant.bio.nyu.edu) for Arabidopsis and GOrilla (14,15) for mouse. The background set was all genes in the Spec analysis (Supplementary Tables S1–S3), which excluded probes that mapped to two or more different transcripts. In each case, the hypergeometric distribution was used to assess overrepresentation.
The data for hormone marker analysis included all treatments for each hormone without corresponding controls (raw data; Supplementary Table S8). Treatments for any given hormone were used as replicates regardless of whether they were generated in the same experimental series or lab in order to test performance in a metadata analysis. Spec values were generated as described above with each hormone treated as a separate category (Supplementary Table S9). GenePattern scores were generated using the Comparative Marker Selection tool with a ‘One versus All’ analysis for each hormone category using default settings (Supplementary Table S10). Genes were ranked by their overall ‘score’ statistic, as output from GenePattern. The documented auxin-responsive gene list was gathered from the literature and GO annotations (Supplementary Table S4).
For the cell-type network analysis, microarray datasets for mouse neuronal cell types were obtained from (4) and Arabidopsis root tip cell types were obtained from (5,6,16). The publicly available mapping between microarray probes and genomic loci were used for each microarray (ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/). In most cases, Spec was computed for each locus based on a unique probe. In the rare cases when multiple probes mapped to a single genetic locus, the probe with the lowest Nspec value was used. All Affymetrix cell files were normalized with the MAS5 average intensity normalization method as implemented in the Affymetrix software. At a target intensity of 250, we determined empirically, using known markers, that a hybridization value of 50 represented a reliable expression signal. Thus, genes which did not express >50 were excluded from the analysis.
Published microarray datasets used for the hormone analysis were as follows (names of laboratories/datasets follow the conventions established in http://affymetrix.arabidopsis.info/narrays/help/usefulfiles.html). Abscisic Acid (ABA)—Helenius, Holman, RIKEN-GODA; Gibberellins (GA)—De Grauwe, Griffiths, Riken-Goda, RIKEN-LI; GA inhibitor—Riken-Goda; Auxin—Riken-Goda, Raghavan; Auxin inhibitor—Riken-Goda; Auxin transport inhibitor—Riken-Goda, Wenzel; Brassinosteroids—Riken-Goda; Brassinosteroid inhibitor—Riken-Goda; Cytokinin—Ljung, Riken-Goda, Sakakibara; Ethylene—Millenaar, Riken-Goda; Ethylene inhibitor—Riken-Goda; Jasmonate—Lewsey, Matthes, Riken-Goda; Salicylic Acid—Lewsey, Riken-Goda, St Clair. While all 13 types of hormones or inhibitors were used in the Spec analysis, Figure 7 displays the results for a subset of 10 selected hormones. For Human white blood cell profiles, all data and probe annotations were taken from Supplementary Data in (17). We used the quantile normalized, log2 transformed data in the file HaemAtlasMKEBNormalizedIntensities.csv.proc, which was obtained from ArrayExpress (E-TABM-633), converting values back to their original scale (anti-log2). We used a summary of well-documented CD marker by cell type expression (18) to infer a pattern of expression for 51 CD markers that we extracted from gene annotation files and manually annotated for presence/absence in seven cell types, filling a matrix with +1 (positive marker), −1 (negative marker), and 0 (unspecified marker) for each cell type. The megakaryocyte cell type was left out of the analysis because it was not specifically annotated in the summary table. A corresponding matrix of the Nspec and dSpec values was computed based on the mRNA expression profiles using each of the probes mapping to the known markers. The corresponding rows (CD markers) of the known expression table and the Nspec table were compared using Pearson correlation.
We begin by addressing the problem of cell-type specificity abstractly, and then tailor our approach to various kinds of data. We assume that a gene is characterized by a distribution of its expression level x, in each cell type y (out of n different cell types) and label this distribution P(x|y). The question of how to measure P(x|y) in practice is addressed in ‘Materials and Methods’ section. For now, let us imagine that P(x|y) is known to us and to a colleague, and suppose we play the following game. In a given cell type y, we make a single measurement of the gene expression level. We present this measurement x to our colleague, but we do not reveal the cell identity y. We would like to know, when presenting the measurement from cell type y, on average how much information regarding the cell type have we given to our colleague. For example, if there are many cell types, and a gene is expressed only in a single cell type y0, a measurement of the gene's level in y0 will be highly informative for our colleague, while a measurement from a different cell type will be uninformative. The measure Spec(y), defined below, quantifies this notion precisely.
To define Spec(y), we first invert the distribution using Bayes’ rule (19), to obtain P(y|x), i.e. the probability that a cell is of type y given that the gene's level is observed to be x (Figure 1b),
where we assume that our colleague has no a priori knowledge about cell types, i.e. P(y)=1/n, though one can easily incorporate a non-uniform prior. We then ask how informative is expression level x regarding the cell type. Figure 1b presents an idealized example, where a single gene has been measured in six cell types, and expression levels are binned into three possible states. Measurements in the ‘low’ range are uninformative, since five of the cell types are found to exhibit such levels. Measurements in the ‘medium’ and ‘high’ ranges are partially informative, since only two cell types have these levels, with the high range being more informative than the medium range. No measurement is maximally informative, since some ambiguity about the cell type remains no matter which level is measured.
We compute the ‘information’ (20), I(x), of gene expression level x:
By using the base of the logarithm to be n (the number of cell types), possible values of I(x) lie in the range from 0 (uninformative) to 1 (maximally informative). To obtain the ‘specificity value’, Spec(y), we average the values of I(x) over the distribution P(x|y) of gene expression levels for each cell type:
Intuitively, Spec(y) indicates the average amount of information about the cell's identity that is provided by a measurement of the gene's expression level in cell type y. In the Figure 1b example, Spec(y) is found to be highest in cell type C, intermediate in cell types E and B, and low in the other three types; I(x) indicates that this gene is not maximally informative at any level, and therefore Spec is not maximally high in any cell type.
We also define the quantity Nspec(y)n1−Spec(y) which is the effective number of cell types (not necessarily integral) specified by the gene's expression level in cell type y. This is equivalent to the effective number of states specified by a probability distribution using the exponential of the distribution's entropy (20). To summarize specificity of a gene, we will omit y, writing Nspec to indicate the maximal value of Nspec(y) over all cell types. A maximally specific gene has Nspec=1, while a minimally specific gene has an Nspec=n, the number of cell types.
Spec is related to a well-known information-theoretic measure, the mutual information (20), Imutual, which in our above formulation is given by
where P(x,y)=P(x)P(y|x) is the joint probability of x and y. If we average the values of Spec(y) over the cell type distribution P(y), we find
Thus, the average value of Spec(y) over all cell types gives the mutual information between cell type and gene expression level. Mutual information provides an overall measure of the specificity of a gene, but is not a cell-type specific measure and can give counter-intuitive results for highly specific genes. To see this, we return to the example of the gene that is expressed only in cell type y0. It is clear that the larger the number of cell types n, the lower the mutual information between this gene's expression level and the cell type; Spec(y0), on the other hand, will be 1 regardless of the number of cell types. Spec(y) is a more detailed and sensitive quantity than mutual information.
Depending on experimental approach, different types of noise exist in each data set. Typical sources of noise include (i) technical noise, due to the experimental preparation and measurement technique, (ii) sample composition noise, due to replicate samples, which may differ depending on how they were collected, and (iii) single-cell biological noise, due to inherently random processes within single cells. For the datasets we examine, which consist of large pools of cells, the main sources of noise are (i) and (ii). Our approach does not attempt to disentangle these different sources, and their sum total effect results in the distribution of expression levels P(x|y). We now consider how specificity is affected in genes with different levels of noise.
Consider Scenario A (low noise) in which two different genes are measured in 12 cell types (Supplementary Figure S1). Gene 1 is expressed at high level in two cell types, and at low level in the 10 other cell types. This gene has a Spec value of 0.72 in the two high-expression cell types, with a corresponding Nspec value of 2. Gene 2 is expressed at high level in three cell types, and at low levels in nine other cell types. Accordingly, in the three high-expression cell types, its Spec value is 0.56 with an Nspec value of 3. With low noise in each case, high Spec values identify the cell types that share a specific expression level, i.e. the gene's ‘expression domain’. However, in a separate Scenario B (high noise), expression levels in the first gene are noisy at low levels and its Spec value is thus significantly <0.72, with a value of 0.52 and an Nspec value of 3.3. Adding noise to a gene with an expression domain of two cell types leads to Spec values resembling a gene with an expression domain of three cell types.
To distinguish the lack of specificity due to larger expression domains from lack of specificity due to noise we generalize the Spec measure. We analyze the case of presenting our colleague with two independent measurements, x1 and x2, from the same cell type. In Scenario A above, two measurements provide no more information than one measurement, since low noise implies that both measurements will record similar values. In Scenario B, however, the two measurements are likely to differ in cell types in which expression is noisy, and to be identical in cell types with less noise. This results in excess information of two measurements versus one, which our colleague can use to distinguish among cell types. The lower the excess information, the more difficult it is to distinguish among cell types within the gene's expression domain, and the more specific of the domain as a unit is the gene. The measure dSpec(y), below, quantifies this notion.
To define dSpec, we first use Bayes’ rule to obtain the distribution of y conditional on the two measurements, x1 and x2:
The information is likewise generalized as I(x1,x2)=, and we define the quantity Spec2(y), which is the amount of information expected based on the two measurements in cell type y:
We then define dSpec as the relative excess information:
In Scenario B (Supplementary Figure S1), dSpec detects that the gene's specificity is affected by noise, with dSpec values of 0.21 in cell types 2 and 4, and dSpec values of 0.91 in types 5 and 11. In Scenario C, however, expression in three cell types is noisy, but with very similar noise profiles in all three types. Although noise is present, expression is nevertheless specific of three cell types (2, 4 and 7), resulting in an Nspec value of 3 (Spec=0.56), and a dSpec value of 0. Thus, noise does not necessarily reduce specificity, and dSpec effectively distinguishes and quantifies the effect of noise on specificity.
It is natural to consider further generalization, in which m independent measurements from cell type y are presented to our colleague, and a more general quantity Specm(y) is computed, for m>2. Clearly, the larger the number of measurements, the finer the differences in gene expression profiles P(x|y) that can be distinguished. While it is straightforward to work out the amount of information obtained in the limit of large m, this limit is not useful for our purposes, because it is dominated by the process of distinguishing among the most similar cell types, typically those belonging to the gene's expression domain. For example, for a gene expressed only in cell type y0, any noise in its expression level reduces its specificity in a way that is detected by dSpec(y0) and thus attributed to noise rather than to having a larger expression domain. If instead of m=2, we use a much larger number of measurements, we will find that Specm(y0)≈1, i.e. noise no longer reduces specificity, and for large enough m we may even have Specm(y)≈1 in many other cell types y≠y0. The value m=2 is thus appropriate for assessing the impact of noise on expression domains. One can think of dSpec as measuring the first-order or principal effect of noise on domains, whereas larger values of m can detect higher order or secondary effects.
We applied Spec to quantify transcriptome structure over a broad taxonomic range, using two datasets that profiled a large number of cell types within a specific organ, and calculated Spec for each gene in each cell type. The first dataset included profiles of 13 morphologically identifiable cell types in the Arabidopsis root (5,6,16), obtained by Fluorescence Activated Cell Sorting (FACS) of cell-type specific marker lines and profiled on the ATH1 microarray (Affymetrix). The second dataset consisted of 12 different types of mouse neurons (4), which were manually sorted, pooled, and applied to the MOE430A microarray (Affymetrix).
We define the gene's expression domain as the set of cell types for which the gene's expression level is most informative (see ‘Materials and Methods’ section). Nspec measures the size of the gene's expression domain, i.e. the effective number of cell types which the gene specifies in its expression pattern. For example, the gene in Scenario B (Supplementary Figure S1) has an Nspec value of 3.3, indicating that the gene is effectively informative of between three and four cell types, without forming a perfect pattern. The value of dSpec≈0.2 in the two highest Spec cell types (types 2 and 4) indicates the presence of noise in other cell types (types 5 and 11) which reduces specificity. Alternatively, for genes that display perfect patterns, dSpec values will be 0, and Nspec values will be integral.
The global distribution of specificity (Figure 2) displays the expression domain sizes and their noise levels across the genome. The bright spot at Nspec=1 indicates the existence of a large set of ‘private transcripts’, genes whose expression domain is exclusive to single cell types. One striking feature of the data is that a majority of genes have larger expression domains, i.e. they are not specific to a single cell type but are rather shared by at least two to five other cell types (Figure 2). These expression domains appear as bright fingers along the left side of the Figure 2 in the low-noise part of the distribution, which demonstrates that these are essentially clean patterns, with a small amount of gene expression noise. The shuffling analysis described in the caption to Figure 2 shows that these patterns are significant and indeed more prevalent than single cell-type specificity. While previous studies of genomic datasets identified particular expression patterns using clustering methods (5,6), the Spec-based analysis shows rigorously that the vast majority of the transcriptome exhibits multi cell-type expression domains, a feature that appears to be shared across the wide taxonomic distance between plants and mice.
The Spec measure provides a way to systematically study cell-type specificity across the genome by examining genes with increasingly larger expression domains, as shown in Figure 3 for two different cell types. The different Nspec intervals identify genes with different size expression domains that overlap a given cell type. The number of unique and shared transcripts in each cell type are shown in Supplementary Figure S3. We used the Spec measure to map expression domains onto specific cells using a network representation (Spec network), shown in Figure 4, in which each edge denotes a sufficiently large number of genes whose domains include the two cell types (see ‘Materials and Methods’ section). For comparison, we also generated a similarity tree based on averages of the normalized data.
The analysis reveals the transcriptional programs that are shared by neighboring tissues. For example, in plants, the Spec network reveals a strong transcriptional link between phloem cells and a subset of pericycle cells that neighbor them in the central cylinder (Figure 4); this connection is not apparent in the similarity tree, yet is supported by both the spatial proximity of the cell types and genetic perturbations that simultaneously affect both phloem and pericycle identity adjacent to the phloem (21). In mouse, the similarity tree is unable to establish the affinity between amygdala cells collected from adjacent layers in the brain, possibly because they were harvested from mice of different genetic backgrounds (Figure 4). However, the Spec network reveals a large set of transcripts shared between them.
We also find that transcripts can be shared by cell types that are separated spatially across organs. For example, quiescent center (QC), which supports the growth of stem cells in the root niche (22,23), was connected largely to adjacent cell types, as in the similarity tree (Figure 4). However, the QC also shared one major edge with progenitor cells of the lateral root meristem, a distant cell type that is nonetheless also associated with root growth (24,25) (Figure 4). A few critical regulators for meristem function, such as PLETHORA2 (26), are known to be specifically expressed in both stem cell populations. However, the Spec network identifies a more global functional link between these cell types. For example, genes that share a high Spec in QC and lateral root meristem were over-represented in plastid formation (plastid fission, P<10−2; Supplementary Table S1). In support of this functional link, mutants in plastid formation have been shown to have stem cell-specific defects in plant roots (27). Once again, the specific connection between the two cell types was not obvious from the similarity tree or common clustering routines (Figure 4 and Supplementary Figure S2). In the mouse brain, the Spec network linked together four cell types or subtypes within a part of the classically defined limbic system, the hippocampus and amygdala, for which functional and synaptic ties have been noted by recent morphological and molecular genetic studies (28). Interestingly, in both organisms, we found that certain functional categories of genes tended to have similar domain sizes even if they were expressed in different cell types (Supplementary Tables S2 and S3).
As shown in Figures 2 and and3,3, high Spec values identify highly specific gene expression patterns, with dSpec assessing the noisiness of the pattern. These metrics should also be useful for biomarker discovery, as is frequently sought in cancer diagnostics (29,30). We tested this possibility by using two case studies in which large datasets and known markers for specific conditions were available.
The first dataset consisted of expression profiles of eight different white blood cell types from seven healthy subjects analyzed on the Illumina BeadChip platform (17). This dataset provided an opportunity to test Spec against the large and well-documented set of cluster of differentiation (CD) markers, whose specificity varies expression in one to many of the cell types (31). We reasoned that, in a large-scale validation using markers selected to reflect on–off states, mRNA expression patterns should broadly reflect the protein levels measured by cell-surface markers. In addition, the samples in this dataset represented individuals, permitting us to use dSpec to assess robustness of expression in replicate cell types, a measure of cell-type plasticity among individuals that has clinical relevance.
We coded known positive and negative markers for each cell type into a table of +1 and −1 values, respectively, and correlated the table to Nspec values generated from the empirical data in the expression study (see ‘Materials and Methods’ section, Figure 5). The highest correlation was among the narrowly expressed markers, those present in one, two or three cell types (average r=0.72, 0.85, 0.63 respectively), with 12 out of these 26 markers having correlation 0.97 or higher. Correlation drops off rapidly for markers that were broadly expressed (i.e. in four, five, six, or seven cell types). The results showed that Spec is highly effective in identifying known markers, including markers with more complex patterns of expression in several cell types and absence in others.
The analysis also illustrates the potential of dSpec to assess robustness for a given marker at high resolution. For example, CD14 is specific to monocytes and granulocytes, as known from its use as a cell surface marker and predicted by its Nspec values. The marker's dSpec values are relatively low in its negative cell types but moderately high in one of its positive cell types (granulocytes). Such a noise profile suggests that the marker is prone to false negatives but not false positives, a case where noise may be tolerable when multiple markers are used. The marker exhibits a reliable-when-present plasticity such that when it is present, it identifies granulocytes with high confidence. However, not all individuals express the marker in granulocytes. On the other hand, CD123 is highly specific to granulocytes, where it exhibits low noise (low dSpec), while its dSpec is high in all cell types other than granulocytes. Such a profile suggests that CD123 is highly informative of granulocytes but its high noise in the other cell types, where its expression levels vary among individuals, makes it potentially prone to false positives. The marker exhibits a plasticity in which virtually all individuals express CD123 at a specific level in granulocytes yet individuals show variable expression levels in almost all other cell types.
Overall, we find that mRNA expression plasticity as measured by dSpec increases for markers transcribed in an increasingly larger number of cell types (Figure 5). Such a trend, as detected by Spec's quantification of specificity and noise, has practical implications. For example, if a negative marker is needed to sort against a cell population, that population will be more reliably identified by a set of narrow markers than a single broad marker (assuming that mRNA noise levels reflect protein-level noise in the case of cell-surface markers). Thus, Spec and dSpec provide a way to assess marker specificity and plasticity, respectively; in particular, dSpec can detect whether specificity is reduced due to noise inside or outside of the gene's expression domain.
In the second dataset, we used a large compendium of hormone treatment microarray data from many different labs (http://affymetrix.arabidopsis.info/narrays/experimentbrowse.pl). The use of this dataset permitted us to benchmark Spec against one other common biomarker discovery tool using a large set of genes known to respond in one condition, auxin treatment (Supplementary Table S4). We asked whether Spec could make use of the information in a meta-analysis consisting of 250 samples collected in 14 different labs.
A number of quantitative approaches have been applied to finding biomarkers, including support vector machines, neural networks and others (32–36). We tested Spec against one highly cited and well-documented tool, GenePattern (1,37), which bases pattern discovery on a t-test with calculated false discovery rates. The test set contained treated samples for 13 hormones or hormone inhibitors, with no controls included (since the 12 other classes served as background or non-target classes). In each case, we tested for auxin-specific responses, examining the top ranked genes for each algorithm to compare performance of the two approaches (Figure 6A). Spec was notably precise at stringent levels. Among the top 20 ranked markers identified by each method, Spec identified 10 known markers while GenePattern identified two (Supplementary Table S5). Both methods capture highly specific markers with consistently high expression in the target class and low expression in the non-target class, although we note this is a relatively small percentage of the known markers. At low stringency levels (>250 highest ranked), GenePattern was able to find more markers than Spec (24 versus 15 in the top 500 ranked) but precision at these stringency levels was very low.
Another unique property of Spec as a biomarker discovery tool is its ability to identify complex markers that are informative of more than one condition, as demonstrated in Figures 2–4. For example, the documented auxin-responsive transcript (AT4G34770) has a relatively high Spec for auxin and for ethylene (Figure 7, purple circle). Interestingly, its noise within the auxin data is relatively high and it is not likely to be identified as an auxin marker using traditional statistical methods. Similarly, brassinosteroid-6-oxidase 2, which is involved in brassinosteroid biosynthesis, is induced by both brassinosteroid and GA inhibitors, the latter of which exhibits high variance within treatments and would likely have been missed by statistical methods (Figure 7, green circle). Thus, analogous to its potential in linking the function of cells, the quantification of specificity provides a way of linking common responses to diverse conditions.
We have described a rigorous approach that uses information theory to formalize the concept of specificity in gene expression and to quantify cell identity in the presence of noise. More generally, the approach is applicable when biological measurements x (e.g. mRNA expression levels, protein abundances, epigenetic modifications, etc.) are mapped onto a biological organization y (e.g. cell types, spatial structure, treatments, disease states, etc.)—and the mapping is given by a probability distribution P(x|y). Information-based approaches in developmental biology have previously focused on transmission of information within developmental regulatory circuits (38,39). Our application of Spec here addresses a novel question, namely how much information does a gene's expression level provide about a cell's identity. As such, Spec provides both a unifying conceptual framework and a measurement tool in the study of cell identity, and with it the ability to quantify on a genome-wide scale this central concept of developmental biology.
The formulation presented here makes it possible to distinguish noise that detracts from information on cell-type specificity from noise which does not, by using the measure dSpec. This unique feature of the method is generally missing in purely statistical approaches, such as analysis of variance or measures based on the t-test. The ability to cope with various sources of noise in data reveals critical connections between cell types, as well as unique and promising classes of biomarkers. When technical replicate data are used, dSpec detects cases when variability among samples detracts from specificity. When samples from different individuals are used, dSpec provides a measure of transcriptional plasticity for each gene in each cell type. In its biomarker applications, dSpec therefore makes it possible to pinpoint cell types in which plasticity is most likely to result in false positives, and to identify markers that are more likely to be reliable for a given target class.
The Spec analysis provides a new tool to explore the mosaic character of gene expression in a multicellular organism. In the plant and animal examples examined, most specifically expressed genes have expression domains containing between two and five cell types (Figures 2 and and3),3), with shared gene expression describing potentially novel mechanistic links between cell types. The quantification of specificity and noise also reveals the limits of complex pattern detection, i.e. patterns consisting of more than five or six cell-type domains could not be precisely delineated given the noise in the data (Figure 2). The genomic signature of cell-type specificity, i.e. the bright fingers in Figure 2, is notably absent in the hormone dataset (Supplementary Figure S4), demonstrating that the signature is neither a generic feature of microarray data nor an artifact of the approach. In addition, cell identity has a component of specifically absent gene activity, i.e. transcripts expressed in all but one or a few cell types (Figure 3).
As a quantitative measure of specificity, Spec also opens possibilities for phylogenetic studies to map changes in cellular complexity during evolution (3,40–43). The next generation of genomics may allow entire transcriptomes to be routinely measured in individual cells, rather than in pooled samples. It will be particularly interesting to see individual differences among single cells using Spec, and to determine which parts of the overall genomic distribution of specificity, shown in Figure 2, are maintained at the single cell level, and which new aspects of specificity are revealed. By virtue of having information theory at its basis, Spec provides a consistent framework for comparing our current measurements of specificity, with those enabled by future technological advances in genomics.
Spec's significantly higher precision in biomarker identification is due to its handling of noise, which permits markers to exhibit significant variability within the target class (Figure 6B). This is arguably a common property of many otherwise reliable markers, which are specific to the target condition but variable in their response. On the other hand, most of the markers missed by Spec and identified by GenePattern showed high levels of noise in the non-target set but a consistently higher level of expression in the target set (Figure 6B). Such markers could be prone to false positive tests and may pose a problem for diagnostics.
We have described the formulation and application of Spec, an information-theoretic specificity measure that allows a rigorous quantification of cell identity in biological systems. Using information theory as the basis for measuring specificity allows both ease of interpretation and flexibility of application. Moreover, it necessitates the incorporation of noise as an integral component of the specificity measure. As we have shown, this critical facet of our approach allows features of genomic data that are typically ignored or discarded due to variability to be meaningfully analyzed and quantified. Without our explicit accounting of noise, many of the structures revealed in the datasets we have analyzed (transcription patterns, biomarkers, cell-type connections) would have been entirely missed. The flexibility and generality of the method, combined with its rigorous treatment of noise, provide a powerful approach for quantitative analysis of specificity in biology.
Supplementary Data are available at NAR Online.
Burroughs Wellcome Career Award at the Scientific Interface (to E.K.); National Institutes of Health (grant R01 GM078279 to K.D.B.). Funding for open access charge: Burroughs Wellcome Fund and National Institutes of Health.
Conflict of interest statement. None declared.
The authors would like to thank Mark Siegal, Erik van Nimwegen and three anonymous referees for their comments and careful reading of the manuscript.