We have described a rigorous approach that uses information theory to formalize the concept of specificity in gene expression and to quantify cell identity in the presence of noise. More generally, the approach is applicable when biological measurements x
(e.g. mRNA expression levels, protein abundances, epigenetic modifications, etc.) are mapped onto a biological organization y (e.g. cell types, spatial structure, treatments, disease states, etc.)—and the mapping is given by a probability distribution P
). Information-based approaches in developmental biology have previously focused on transmission of information within developmental regulatory circuits (38
). Our application of Spec here addresses a novel question, namely how much information does a gene's expression level provide about a cell's identity. As such, Spec provides both a unifying conceptual framework and a measurement tool in the study of cell identity, and with it the ability to quantify on a genome-wide scale this central concept of developmental biology.
The formulation presented here makes it possible to distinguish noise that detracts from information on cell-type specificity from noise which does not, by using the measure dSpec. This unique feature of the method is generally missing in purely statistical approaches, such as analysis of variance or measures based on the t-test. The ability to cope with various sources of noise in data reveals critical connections between cell types, as well as unique and promising classes of biomarkers. When technical replicate data are used, dSpec detects cases when variability among samples detracts from specificity. When samples from different individuals are used, dSpec provides a measure of transcriptional plasticity for each gene in each cell type. In its biomarker applications, dSpec therefore makes it possible to pinpoint cell types in which plasticity is most likely to result in false positives, and to identify markers that are more likely to be reliable for a given target class.
The Spec analysis provides a new tool to explore the mosaic character of gene expression in a multicellular organism. In the plant and animal examples examined, most specifically expressed genes have expression domains containing between two and five cell types ( and ), with shared gene expression describing potentially novel mechanistic links between cell types. The quantification of specificity and noise also reveals the limits of complex pattern detection, i.e. patterns consisting of more than five or six cell-type domains could not be precisely delineated given the noise in the data (). The genomic signature of cell-type specificity, i.e. the bright fingers in , is notably absent in the hormone dataset (Supplementary Figure S4
), demonstrating that the signature is neither a generic feature of microarray data nor an artifact of the approach. In addition, cell identity has a component of specifically absent gene activity, i.e. transcripts expressed in all but one or a few cell types ().
As a quantitative measure of specificity, Spec also opens possibilities for phylogenetic studies to map changes in cellular complexity during evolution (3
). The next generation of genomics may allow entire transcriptomes to be routinely measured in individual cells, rather than in pooled samples. It will be particularly interesting to see individual differences among single cells using Spec, and to determine which parts of the overall genomic distribution of specificity, shown in , are maintained at the single cell level, and which new aspects of specificity are revealed. By virtue of having information theory at its basis, Spec provides a consistent framework for comparing our current measurements of specificity, with those enabled by future technological advances in genomics.
Spec's significantly higher precision in biomarker identification is due to its handling of noise, which permits markers to exhibit significant variability within the target class (B). This is arguably a common property of many otherwise reliable markers, which are specific to the target condition but variable in their response. On the other hand, most of the markers missed by Spec and identified by GenePattern showed high levels of noise in the non-target set but a consistently higher level of expression in the target set (B). Such markers could be prone to false positive tests and may pose a problem for diagnostics.
We have described the formulation and application of Spec, an information-theoretic specificity measure that allows a rigorous quantification of cell identity in biological systems. Using information theory as the basis for measuring specificity allows both ease of interpretation and flexibility of application. Moreover, it necessitates the incorporation of noise as an integral component of the specificity measure. As we have shown, this critical facet of our approach allows features of genomic data that are typically ignored or discarded due to variability to be meaningfully analyzed and quantified. Without our explicit accounting of noise, many of the structures revealed in the datasets we have analyzed (transcription patterns, biomarkers, cell-type connections) would have been entirely missed. The flexibility and generality of the method, combined with its rigorous treatment of noise, provide a powerful approach for quantitative analysis of specificity in biology.