|Home | About | Journals | Submit | Contact Us | Français|
Documenting the extent of cellular diversity is a critical step in defining the functional organization of tissues and organs. To infer cell type diversity from partial or incomplete transcription factor expression data we devised a sparse Bayesian framework that is able to handle estimation uncertainty, and can incorporate diverse cellular characteristics to optimize experimental design. Focusing on spinal V1 inhibitory interneurons, for which the spatial expression of 19 transcription factors has been mapped, we infer the existence of approximately 50 candidate V1 neuronal types, many of which localize in compact spatial domains in the ventral spinal cord. We have validated the existence of inferred cell types by direct experimental measurement, establishing this Bayesian framework as an effective platform for cell type characterization in the nervous system and elsewhere.
Tissues and organs are comprised of diverse cell types, possessing characteristic morphology and specialized function. The diversification of cell types attains prominence in the nervous system, where neuronal distinctions depend on the activities of transcription factors (TFs) and their downstream effectors (Kohwi and Doe, 2013). Attempts to define the link between transcriptional identity and neuronal diversity have benefitted from the analysis of long-distance projection neurons, for which distinctions in target innervation provide a clear correlate of functional divergence (Molyneaux et al, 2007; Sanes and Masland, 2015). In the retina and cerebral cortex, functional subclasses of ganglion and pyramidal neurons have been delineated through their transcriptional identities (Siegert et al, 2009; Greig et al, 2013). Similarly, the hierarchical ordering of motor neuron subtypes in the spinal cord has its origins in discrete profiles of transcription factor expression (Dasen et al, 2005; Dasen and Jessell, 2009). Yet, local interneurons represent by far the most prevalent neurons within the mammalian CNS, collectively shaping the output of long-range-projection neurons (Isaacson and Scanziani, 2011). The local confinement of interneuron axons, however, has made it difficult to obtain objective measures of identity and diversity.
Genome-wide mRNA expression profiles have been informative in distinguishing neuronal cell types (Usoskin et al, 2015; Zeisel et al, 2015; Macosko et al, 2015, Tasic et al, 2016). Nevertheless, documented dissociations between mRNA and protein expression (Gygi et al, 1999; Vogel & Marcotte, 2012) emphasize the merits of analysis of protein expression at the level of individual neurons (Sharma et al, 2015). But if many genes are involved in defining individual subpopulations, then the validation of protein co-expression will be constrained by the limited repertoire of primary and secondary antibodies.
This practical limitation could be overcome through the development of a statistical method that is able to resolve the extent of neuronal diversity from sparsely sampled transcriptional datasets. Such a method should provide: (i) an objective measure of confidence in the existence of cell types and their prevalence within a parental population, (ii) improvement in estimation accuracy upon integrating independent cellular characteristics with molecular phenotype, and (iii) informative predictions to guide further experiments. To meet these goals we developed a sparse Bayesian framework that models co-expression data based on incomplete combinations of TFs. Our focus on TF expression was governed by the well-established role of DNA-binding proteins in defining neuronal identity (Dalla Torre di Sanguinetto et al, 2008; Amamoto and Arlotta, 2014).
We used this Bayesian approach to assess the diversity of V1 interneurons in the spinal cord, a major inhibitory interneuron population implicated in motor control (Zhang et al, 2014). V1 interneurons are defined by developmental expression of the homeodomain transcription factor En1 (Saueressig et al, 1999) and include Renshaw cells and Group Ia reciprocal interneurons, which mediate recurrent and reciprocal inhibition, respectively (Sapir et al, 2004 and Zhang et al, 2014). Yet these two physiologically defined subtypes represent only a small fraction of the parental V1 population (Alvarez et al, 2005), implying a greater diversity of V1 neurons. Indeed, the V1 population has recently been subdivided on the basis of the expression of 19 TFs (Bikoff et al, 2016).
Here we apply a sparse Bayesian approach to assess the diversity of V1 neuronal subtypes marked by these TFs. Our analysis has generated three major findings: (i) an estimate of the number of V1 interneuron types, on the order of 50 subtypes; (ii) an estimate of the expression profiles of the 19 TFs across these inferred types; and (iii) a cladistic description of V1 diversity that guides experiments aimed at monitoring and manipulating distinct V1 subpopulations. In several instances, the predicted assignment of V1 neuronal type has been validated through single-cell quantitative RT-PCR, immunohistochemistry and assessment of spatial distribution. Finally, we demonstrate that this sparse Bayesian analysis serves as an effective platform for identifying cell types within other diverse populations.
The companion paper (Bikoff et al, 2016) used comparative microarray screening to identify and map the expression of 19 TFs in V1 interneurons. We used three sets of data to infer V1 neuronal diversity: (i) the fraction of neurons within the parental V1 population that express each of the 19 TFs (Figure 1A), (ii) the fractions of neurons co-expressing various pairs of TFs (Figure 1B), and (iii) the position of V1 interneurons expressing each of the 19 TFs (Figure 3A). Complete analysis of all TF pairs, however, is hindered by the fact that primary antibodies generated in the same host species cannot be distinguished easily by fluorescently-tagged secondary antibodies. We therefore developed an approach that permits statistical inference on the basis of incomplete data.
In this statistical analysis, a cell type is defined by the expression pattern of the 19 TFs under consideration. We characterize TFs as either expressed or not expressed, and thus each expression pattern is specified by a vector of 19 binary numbers, Jk,a, for pattern k, with a ranging from 1 to 19. Jk,a is set to 1 if TF a is expressed in expression pattern k, and to 0 if it is not. This results in 219 possible binary expression patterns for the 19 TFs. This large number could be reduced by eliminating combinations that include pairs of factors not co-expressed within the same neuron. Analysis of the pairwise expression revealed that 67 out of 148 measured TF pairs fail to co-express (Figure 1B), thereby reducing the possible diversity to 1,978 potential expression patterns (Supplemental Information, Section Statistical Model). Thus, for the variables Jk,a specifying the expression patterns, k runs from 1 to 1,978.
We designate the fraction of cells with expression pattern k, the cell-type fraction, denoted by fk, with k again ranging across all the potential expression patterns (1 to 1,978). Cell-type fractions must be positive (fk ≥ 0) and sum to 1 (Σk fk = 1), indicating that the entire V1 population is accounted for. The fraction of V1 neurons expressing TF a (the data in Figure 1A) is Σk fk Jk,a, and the fraction co-expressing factors a and b (the data in Figure 1B) is Σk fk Jk,a Jk,b (Supplemental Information). Fitting data within this framework amounts to choosing a set of cell-type fractions that provide a good match to the expression and co-expression data and that satisfy non-negativity and sum-to-one constraints (by the definition of fk). Under these conditions, the number of nonzero inferred cell-type fractions determines the inferred number of cell types, and the variables Jk,a for a = 1, … 19 and for k values with fk ≠ 0, provide candidate expression patterns of these selected cell types.
In principle, the model could be fit to observed data by minimizing the summed squared difference between the measurements and the predictions generated by the inferred fractions. This amounts to a non-negative constrained least squares (NNCLS) minimization problem (see Experimental Procedures; Wang et al, 2006; Abbas et al, 2009; Gong et al, 2011; Grange, et al, 2014). But the NNCLS approach fails in this case because, despite the constraint of non-negativity, it generates an infinite number of equally valid solutions. Indeed, for any single presumed cell type it is possible to find alternative solutions that exclude this cell type while maintaining an optimal summed squared difference (Supplemental Figure 3A).
We therefore resorted to a Bayesian approach in which unknown cell-type fractions are modeled as random variables, allowing their uncertainty to be characterized by probability distributions. The use of a prior distribution enables previous knowledge and expectations to be incorporated into the model, and a likelihood function reflects the probability that the observed data were generated by the model. As a biologically plausible prior distribution over cell-type fractions, we chose a constrained “spike-and-slab” (SnS) distribution (Ishwaran and Rao, 2005). This prior incorporates the biologically reasonable assumption that only a small fraction of the 1,978 potential cell types actually exist within the parental V1 population. The SnS prior forces that only a subset of potential expression patterns is required to explain the measurements (Supplemental Information).
The use of Bayes’ rule to combine prior and data likelihoods results in a posterior distribution from which estimates of confidence about the existence and identity of cell types can be determined. In our case, the posterior distribution cannot be computed directly, necessitating the use of a Monte Carlo sampling method (Gelman et al 2013). In particular, we adapted a Hamiltonian Monte Carlo (HMC) algorithm to draw random samples from the posterior distribution. This Monte Carlo procedure is specialized for constrained SnS posteriors and permits efficient sampling from our posterior distributions (Pakman and Paninski 2013, 2014).
Each iteration of the sampling algorithm generates a set of cell-type fractions that satisfy the constraints and provide a good fit to the data. The number of selected cell types and their expression patterns vary across iterations. Combining samples across a large number of iterations allows us to infer the properties of the posterior probability distribution. For example, the proportion of Monte Carlo samples for which a particular expression pattern is selected determines that type’s posterior inclusion probability, and serves as a confidence measure of its necessity to explain the data. We also computed the distribution of the number of cell types selected in each iteration, which provides an estimate of the total number of distinct cell types required to explain the observed data. Repeated sampling also enables us to compute cross-correlations between cell-type fractions that are used to construct a list of candidate expression profiles along with the probabilities of their correspondence to actual cell types. As with an expression pattern, a candidate expression profile is a 19-component vector, with each component representing a different TF, but now these components are allowed to be real numbers between 0 and 1. In this scheme, component a of the candidate expression profile represents the probability that TF a is expressed. Finally, the quantification of uncertainty in the Bayesian approach provides a tool that can enhance experimental design by selecting, in a principled way, the measurement that is expected to maximally reduce uncertainty (Supplemental Information, Section Experimental Design).
We validated the ability of the Bayesian approach to accurately infer cellular diversity by performing computational cross-validation experiments, as well as experiments on simulated datasets, for which the underlying cell types and corresponding cell-type fractions are known (see Computational Validation section of Supplemental Information). This approach provides convincing evidence that meaningful and accurate estimates of cellular diversity can be extracted, encouraging us to apply the Bayesian approach to V1 datasets.
We first applied this Bayesian framework to TF expression without including spatial information (Figure 1A and 1B, but not not3A).3A). As discussed, each iteration of the HMC sampling algorithm generates a possible set of cell types, but their number and identity vary across HMC iterations. Over the course of the full HMC run, the number of types selected (those with non-zero cell-type fractions) ranged from 25 to 33 with a mean ± standard deviation of 29 ± 2 (Figure 1C). The identity of the selected cell types varied across different HMC iterations.
Computing the posterior inclusion probability of each expression pattern across many samples led to a rank-ordered list of candidate expression patterns. The 40 candidate patterns with the highest inclusion probabilities and their inferred cell-type fractions are shown in Figure 1D. The expression pattern with the highest inclusion probability corresponds to the Renshaw interneuron, a defined V1 neuronal type that mediates recurrent inhibition of motor neurons (Renshaw, 1946) and co-expresses the TFs Oc1, Oc2 and MafB (Stam et al, 2012). This analysis also infers the existence of MafA+ and MafA− subsets of Renshaw interneurons (patterns 1 and 30 in Figure 1D), a molecular diversity that may correspond to their known morphological heterogeneity (Fyffe, 1990).
We examined the sensitivity of these results to the number of TFs used in the analysis, selecting 11 to 18 of the 19 measured TFs. The average number of selected cell types, 29 for the full 19 factors, decreases only gradually when smaller numbers of TFs are analyzed. Moreover, when 16 to 19 TFs are incorporated, the number of selected cell types remains relatively constant, close to 29 (Figure 1E). In contrast, the number of potential cell types (1,978 for the case of 19 factors) depends much more strongly on the chosen TF subset (Figure 1F). These findings suggest that Bayesian calculations of cellular diversity based solely on the available TF data may be close to saturating with the use of a majority subset of the 19 TFs examined here.
What is the diversity of potential TF expression patterns within the V1 population? We detected 131 different expression patterns with posterior inclusion probabilities greater than 0.05 (i.e. appearing in more than 5% of the HMC samples). We constructed candidate expression profiles by clustering the 131 most likely expression patterns into ‘groups’ (Figure 2A and Supplemental Information, Section Clustering Cell Types). A group is defined as a set of expression patterns that satisfies two conditions: (i) the members of a group express similar sets of TFs (Supplemental Figure 2A), and (ii) in all or almost all of the HMC samples, only a single member of a group is selected (i.e. has a non-zero cell-type fraction), although different members may be selected in different samples (Supplemental Figure 2B). The second condition causes the members of a group to be negatively correlated with each other (Supplemental Figure 2A). These conditions imply that a group is likely to represent a single cell type with an uncertain expression pattern, rather than multiple cell types.
We developed a recursive algorithm for constructing these groups. All candidate expression profiles with inclusion probabilities greater than 5% were assigned, with most groups having only a single member selected across all of the HMC samples, and no group having more than one member selected in >3% of the samples (Supplemental Figure 2D). To examine the robustness of the inferred groups, we varied systematically the threshold for selecting the list of candidate expression patterns from which groups were constructed. As this threshold is lowered, the number of groups first increases linearly because each high-ranked expression pattern spawns its own group (Supplemental Figure 2C). However, this growth slows as lower-ranked patterns join existing groups, resulting in a weak dependence on the inclusion threshold. With an inclusion threshold of 5%, the clustering algorithm identifies 35 groups (Figure 2A).
Each group gives rise to a single candidate expression profile (Figure 2B), and for each profile we assign an expression probability to each TF, weighting the binary expression patterns of each member of the group by the frequency with which it appear in the HMC samples (Figure 2B, top). In addition, we compute a posterior inclusion probability (Figure 2B, bottom) for each candidate expression profile. These are much higher than the inclusion probabilities of the corresponding candidate expression patterns from which they are constructed (Figure 1D). Nevertheless, there is still considerable uncertainty in the identity of the cell types predicted by TF expression data alone (Figure 2B). Furthermore, the existence of 131 candidates for only ~30 cell types (the average number per sample; Figure 1C) and the fact that few of the top expression patterns in Figure 1D have posterior inclusion probabilities near one both indicate that expression-only data is insufficient for specifying cell-type identity (Supplemental Figure 3A).
Certain classes of spinal interneurons are known to localize in discrete spatial domains (Thomas and Wilson, 1965; Hultborn et al, 1971), prompting us to ask whether the incorporation of spatial information could refine estimates of V1 group diversity. For spatial analysis, we divided the ventral spinal cord into 196 bins, mapped spatial expression data into specific bins, and defined cell-type fractions for each bin (Figure 3A). We then applied an appropriately generalized Bayesian analysis to the spatially-resolved data (Supplemental Information Sections 2.1 and 3.1).
Incorporating spatial information into the Bayesian analysis increased the number of candidate cell types that the HMC sampler selected per iteration from 29 to 50 ± 2 (mean ± SD); and the degree of confidence in the inferred expression profiles also increased (Figure 3B). With the addition of spatial information only 75 total expression patterns are assigned posterior inclusion probabilities greater than 0.05 (compared to 131 for the non-spatial analysis), and many of their inclusion probabilities are close to one (Supplemental Figure 2F; c.f. Figure 1D). We repeated the grouping procedure for these 75 total cell types and uncovered 57 candidate expression profiles, most of which are identified with high inclusion probabilities (Figure 3C), permitting a more confident assignment of expression patterns (Figure 3D).
An additional benefit of this spatial analysis is that it provides estimates of how each inferred cell type localizes in the ventral spinal cord (Figure 4). Although the method does not impose continuity on cell type distributions we find that many of the inferred cell types are localized in relatively compact, contiguous domains, covering the full positional spectrum of the parental V1 interneuron distribution along both the dorsoventral and mediolateral axes. Notably, one inferred cell type with the expression profile of Renshaw interneurons (expressing MafB, Oc1 and Oc2) is predicted to be confined to the most ventral region within the parental V1 population (Figure 4I), in agreement with their known settling position (Alvarez and Fyffe, 2007; Stam et al, 2012). Other inferred cell types, characterized by FoxP2, FoxP4, Nr3b3, and/or Nr4a2 expression, showed clustered distributions ventral to the central canal, and dorsomedial to motor neurons (Figure S4). Such distributions are similar to the proposed location of group Ia reciprocal interneurons (Hultborn et al, 1971), a subset of which are known to reside within the parental V1 population (Zhang et al, 2014). Taken together, these findings document novel molecular and spatial diversity in the V1 interneuron population.
To characterize the minimal number of TFs needed to provide selective access to an individual cell type, we developed a classification scheme that relies on a recursive algorithm to sequentially subdivide the parental population and arrange every cell type along a clade diagram (see Supplemental Information). In this representation, the central node of the diagram corresponds to the full V1 population, with branches representing TFs expressed in a mutually exclusive fashion covering the highest fraction of the parental population. This process is repeated until the candidate cell types from which the analysis is constructed are revealed at the extremities of the plot. Our analysis reveals that 64% of the V1 parental population can be divided into four main clades on the basis of the mutually exclusive expression of FoxP2, Pou6f2, Sp8, and MafA (Figure 5A). Each clade contains from 4 to 19 cell types, with a total of 36 cell types falling within the 4 main clades (Figure 5B). The minimum number of TFs needed to target each group ranges from 2 (in the case of MafA, Zfhx4 neurons) to 6 (as in the final leaves of the FoxP2 clade) with a mean number of 4 ± 1 (mean ± SD).
Analysis of clade settling positions shows that the MafA, Sp8, and Pou6f2 clades exhibited little spatial overlap, whereas the FoxP2 clade displays a broad spatial distribution with significant overlap with the other three clades (Figure 5C). At the second tier of our clade assignment, V1 subclasses become more restricted spatially (Figure 5D), with additional restrictions for progressively higher tiers (not shown). Cell types within the Pou6f2 clade exhibit medio-lateral gradations in their spatial distributions, determined by the expression of the TFs Nr5a2 and Lmo3 respectively. In certain instances, however, cell types within a single clade show overlapping spatial distributions, best exemplified by the FoxP2 clade, characterized by numerous intermingled cell types with no statistically significant difference in their centroid coordinates (Supplemental Figure 4). In summary, this clade analysis provides predictive insight into the relative contributions of individual TFs as well as spatial information of use in delineating the hierarchy of V1 interneuron candidate cell types.
The merits of our computational analysis depend critically on the ability to infer cellular diversity with accuracy. We sought to assess the biological accuracy of the Bayesian model’s predictions, comparing first experimental findings and inferred results from Bayesian analysis with single-cell quantitative real-time PCR (qRT-PCR) data (Figure 6). We focused on 15 TFs for which reliable qRT-PCR probes could be identified, and analyzed TF expression within En1+ neurons, isolated at random from p0 lumbar spinal segments of En1::Cre; RCE::lsl.GFP mice. We assessed whether qRT-PCR and immunocytochemistry gave comparable co-expression values. Appropriate thresholds for each gene were set, relative to the expression of the ubiquitously expressed gene β-actin, with the aim of comparing the measured qRT-PCR patterns against our immunohistochemical measurements. Thresholds were chosen by minimizing the distance between qRT-PCR generated expression values and immunohistochemical measurements (Experimental Procedure, Single cell qRT-PCR characterization). After applying appropriate thresholds, TFs were classified either as ‘expressed’ or ‘not expressed’ within individual En1+ interneurons. The patterns of gene expression emerging from quantitative PCR exhibited high correlation with immunohistochemical data for both individual and paired TF measurements (Figure 6A–C, correlation coefficient = 0.88).
We then asked whether qRT-PCR transcriptional patterns correlate with the clade results of our Bayesian analysis, seeking to validate the general organization of our candidate expression patterns, instead of individual cell types. Individual V1 interneuron gene expression profiles can be segregated along the four major inferred clade populations, indicating that our computational predictions accurately reflect gene expression relationships in vivo (Figure 6D). Importantly, qRT-PCR identified closely-related expression patterns predicted by the model. These results validate the general organization of the Bayesian cell-type predictions.
We next tested predictions involving triple labeling of combinations of TFs not previously measured. We focused on V1Sp8 interneurons, noting that predicted fractions for the 7 potential combinations of Sp8, Prox1 and Prdm8 are in good agreement with their measured values (Figure 7A, Supplementary Table 1). Predictions arising from spatial analyses tend to be more constrained, with smaller standard deviations, and are generally more accurate than the non-spatial predictions. Moreover, we validated the predicted absence of the combination Prdm8+, Prox1+, Sp8− (Figure 7B). Taken together, these results indicate that the Bayesian approach accurately infers the TF expression profile of neuronal types within the parental V1 population.
Finally, our results also enabled us to test predictions about unmeasured spatial distributions of neurons expressing pairs of TFs, on the basis of measured spatial information for single TFs. We focused on cases in which a combination of two TFs confined an inferred V1 neuronal type to a highly restricted region of the parental V1 distribution. We found that our predictions faithfully co-localize with the actual distributions assessed in p0 caudal lumbar spinal segments (Figure 7C–E). These results indicate that the Bayesian approach, by virtue of incorporating dual cellular sources of information, correctly predicts the spatial distribution of novel gene combinations.
To establish the general applicability of our Bayesian approach, we evaluated its ability to identify cell types in systems where an estimate of cellular diversity had been extracted by other analysis procedures.
We first focused on the zebrafish embryo, for which single cells have been transcriptionally profiled by RNAseq and mapped to their location of origin (Satija et al 2015). Although the delineation of the entire cellular repertoire was not attempted in that work, the analysis of single cell cluster profiles across the marginal region of the embryo is consistent with seven cell types (Figure 7G). We sought to determine whether our sparse Bayesian methods are able to achieve this result given simulated data generated by randomly subsampling the dataset from Satija et al (2015) (Supplemental Information, Ground-truth data generation from RNAseq measurements). In the absence of spatial information, the sparse Bayesian algorithm estimated 5 +/− 1 cell types. The transcriptional profile of each inferred cell type corresponds to one of the ground-truth candidates, but our procedure underestimated total cell-type number (Figure 7I). Introducing spatial information into the analysis increases the number of correctly inferred cell types to 6 +/− 1 (Figure 7J), close to 7, the ground-truth number. Thus as in the V1 study, inference is improved by incorporating additional cellular characteristics – in this case location. The coarse description of the spatial distributions and their similar shapes, together with the random selection of the subset of genes incorporated in our analysis (see Experimental Procedures for details), are likely reasons that the algorithm slightly underestimates cell-type diversity. Nevertheless, the results obtained by the Bayesian approach are generally in good agreement with those obtained by clustering the original RNAseq data.
We next analyzed cortical interneuron diversity, where 16 interneuronal cell types have been identified on the basis of RNAseq data in mouse somatosensory cortex and hippocampal CA1 neurons (Zeisel et al, 2015). From this data set, single and pairwise measurements were created, with errors assigned to each measurement (Supplemental Information). We used this dataset to construct Bayesian estimates of neocortical diversity in the absence of spatial information, varying the number of genes used in the analysis, the noise in the measurements and the amount of missing data (representing antibody incompatibility). From this, we inferred 12.7 +/− 0.3 cell types over a range of 13 to 16 genes used, all 12 corresponding to correctly inferred expression profiles (Figure 7K). In every example, the sparse Bayesian analysis outperformed the NNCLS approach, which overestimated the number of cell types by nearly 100%. We next used all selected genes to estimate sensitivity to noise and missing data, and observed a larger effect for noise (Figure 7L). As the noise level and amount of missing data tend to zero, we correctly infer the total number of cell types and their expression profiles (Figure 7M).
These analyses establish the sparse Bayesian approach as an effective means of estimating neuronal type diversity, and provide further insight into the benefits of incorporating spatial information when obtaining accurate estimates.
Objective assessment of the extent of mammalian cellular diversity has remained challenging. The sparse Bayesian framework presented here provides a general method for characterizing cellular heterogeneity on the basis of sparsely sampled biological information. We have used this framework to study spinal V1 interneuron diversity. By analyzing the spatial expression densities of individual TFs as well as their patterns of pairwise expression, candidate expression profiles for ~50 inferred V1 cell types are provided. The integration of distinct phenotypic aspects of cellular heterogeneity, in this instance TF expression and settling position, markedly enhances the confidence in assignment of predicted V1 interneuron cell types. We note that this approach provides a general method for delineating the heterogeneity of cell types in any mixed tissue.
Estimates of cell type diversity can be obtained through computational approaches that employ hierarchical clustering (see Armañanzas and Ascoli, 2015, for a recent review). But these methods have drawbacks - it is challenging to use hierarchical clustering to determine the number of cell types automatically, and their inferences can be sensitive to the choice of similarity measures (Augen, 2005).
A different set of computational approaches based on deconvolution algorithms have been used to characterize cellular diversity from information about gene expression profiles (Shen-Orr and Gaujoux, 2013). These can be divided into two major methodologies. Regression approaches can be applied when the expression profile of cell types of interest is known a priori (Wang et al, 2006; Abbas et al, 2009; Gong et al, 2011; Zuk et al, 2013; Grange, et al, 2014). In contrast, matrix-factorization approaches become relevant when cell type expression profiles are not known (Repsilber et al, 2010; Erkkilä et al, 2010; Bazot et al, 2013; Zhong et al, 2013; Liebner et al, 2014). The latter approaches suffer from similar limitations as the NNCLS method: matrix factorizations are inherently non-unique, and estimation of confidence levels can be difficult. Past attempts to overcome these challenges have relied on the premise that particular genes are expressed in only a single cell type, which does not represent the general biological case (Gaujoux and Seoighe, 2012). A more recent approach, conceptually akin to the methods developed here, uses SnS priors to reduce the mathematical ambiguities inherent in matrix factorization and has recently been applied to the identification of genetic disease pathways (Shen et al, 2015), albeit without addressing cell type diversity.
Our approach sidesteps previous limitations by focusing on the fractional prevalence of individual and paired TFs within a parental population, with well-defined values between zero and one, in contrast to expression levels that may be scaled arbitrarily (see Supplemental Information for further discussion). Mathematically, our model resembles previous regression models (Wang et al, 2006; Abbas et al, 2009; Gong et al, 2011; Zuk et al, 2013; Grange, et al, 2014), but has the virtue of incorporating the SnS prior and a binary set of regressors that span all possible cell types. For the data considered here, the resulting regression problem is highly ill-posed (Supplemental Figure 3), necessitating a fully Bayesian approach to capture uncertainty in resultant estimates.
A distinctive feature of our approach is that we start by considering all of the possible expression patterns for the genes being considered, 219 in our case and, in general, 2N for a study involving N genes. We note that this factor 2N may be prohibitive for applications of the method to RNA-seq data, where a large number of genes are typically tracked. Although applications in which N is several thousand would appear impractical, it may be possible to identify a subset of genes expected to be particularly informative about cell type and restrict the analysis to this subset. Even with a reduced N, 2N may be dauntingly large, but it is important to recall that in our analysis a preliminary screening reduced the number of expression patterns by a factor of ~265, from 219 (524,288) to 1,978. Greater N values may yield even larger reductions.
Our analysis has identified extensive transcriptional diversity within V1 interneurons on the basis of the expression of 19 TFs. The first issue this raises is whether further diversity will follow inevitably with the inclusion of additional V1 TFs. We analyzed the impact of varying the number of TFs in our analysis and found only a weak dependence of the number of cell types on the number of TFs (Figure 1E–F). Thus, 19 TFs appear sufficient to uncover a substantial fraction of the total underlying transcriptional heterogeneity.
A second issue is the impact of incorporating spatial information on cell type. In our analysis the inclusion of spatial data increased inferred cell type number by ~70%, and markedly enhanced confidence in the inferred expression profiles. Spatial information has a much stronger impact on cell type assignment than variation in the number of TFs, indicating that settling position carries significant cell type information which is independent from the information carried by the expression patterns of the 19 TFs examined here.
A third issue is the impact of V1 cell type spatial segregation on synaptic input specificity (Bikoff et al., 2016). An inordinate diversity in spatially-restricted cell types may be necessary to satisfy highly diverse spinal circuit computations. Our data indicate that spatial segregation only partially accounts for spinal V1 diversity. A large number of inferred cell types, marked by expression of FoxP2, localize in the same ventral region (Supplemental Figure 4). Additional heterogeneity might be realized by the expression of surface molecules that impose and constrain connectivity with different synaptic partners.
We note that transcriptional diversity could, in some instances, reflect variation in functional cell state, rather than indicating a distinct neuronal subtype. However, consistent with the genetic specification of cell type identity, we find that the position of transcriptionally distinct V1 subsets are segregated, stereotyped from animal to animal, and stable across development (see Bikoff et al, 2016, Figures 3H and S4A–B). The spatial segregation argues strongly against the ‘cell state’ possibility. Nevertheless, activity-shaped differences in Er81 expression in fast-spiking cortical interneurons have been shown to mark delay-type or non-delayed firing states (Dehorter et al, 2015). In addition, activity-dependent induction of Npas4 expression has been described for cortical neurons (Lin et al., 2008), with implications for homeostatic regulation of sensitivity to inhibitory transmitters. Further studies will therefore be needed to dissect the functional consequences of V1 diversity, to resolve whether certain state-dependent functional properties are reflected in the diversity of V1 transcriptional profiles.
Our statistical approach has relevance well beyond a focus on spinal V1 interneurons, and could prove useful in further delineating neuronal cell types elsewhere in the nervous system. Cortical projection neurons fractionate into a few broad classes based on patterns of target innervation and distinctions in gene expression, yet the extent to which any single broad class of pyramidal neurons is itself heterogeneous remains unclear (Greig et al, 2013). The classification of interneuron cell types in the brain has proven particularly challenging (Ascoli, 2008; DeFelipe et al, 2013; Kepecs and Fishell, 2014), although studies of hippocampal interneuron diversity suggest a degree of heterogeneity that approaches that found for spinal V1 interneurons. Within CA1 hippocampus, over 20 inhibitory interneuron subtypes have been identified, based on anatomical, molecular, or electrophysiological distinctions (Krook-Magnusen et al, 2012). Single-cell transcriptome analysis of primary somatosensory cortex or CA1 hippocampus interneurons has identified 16 molecularly distinct interneuron cell types, which likely represents a lower bound on diversity (Zeisel et al, 2015). Thus, insight into interneuronal diversity in the spinal cord may inform studies to address heterogeneity throughout the brain.
Our analysis also has implications for genetic strategies aimed at manipulating circuit elements throughout the nervous system. The minimal number of TFs needed to define a single V1 cell type uniquely has been identified on the basis of clade profiles and is, on average, 4 ± 1. This indicates that individual TFs are generally not sufficient to isolate V1 neuronal types, consistent with findings in other neuronal systems (Sanes and Masland, 2015). The difficulty in identifying single TFs that uniquely define a cell type may reflect the prevalence of combinatorial TF codes (Philippidou and Dasen, 2015), and could explain the difficulty in delineating individual motor neuron pools (De Marco Garcia and Jessell, 2008).
Immunohistochemistry was performed as in Bikoff et al, 2016. Briefly, p0 En1::cre; Tau.lsl.nLacZ mice were transcardially perfused with 4% paraformaldehyde in 0.1M phosphate buffer, followed by a 2 hour postfixation. Tissue was then washed, cryoprotected by equilibration in 30% sucrose in 0.1M phosphate buffer, embedded in OCT and cryostat-sectioned in the transverse plane at 20 μm. Immunohistochemistry was performed on tissue through sequential exposure to primary antibodies (overnight at 4°C) and fluorophore-conjugated secondary antibodies (1 hour at room temperature). Sections were mounted using Fluoromount-G (SouthernBiotech) and coverslipped for imaging. Confocal images were obtained on an LSM 710 Meta Confocal microscope (Carl Zeiss) at 1024×1024 resolution, using a Plan-Apochromat 20x/0.8 M27 objective. See Bikoff et al, 2016 for a description of antibodies used.
Confocal images of transcription factor co-expression were analyzed in Imaris (Bitplane) using the “Colocalization” and “Spots” functions with thresholds set to exclude nonspecific background reactivity, followed by manual validation to confirm co-expression within V1 interneurons. For each transcription factor combination, at least 2 independent sections from 3 or more animals were analyzed, totaling >580 lumbar sections and >1100 lumbar hemisections in this dataset.
Interneuron spatial distributions are described in Bikoff et al, (2016). Sections were normalized to a standardized spinal cord hemisection (distance from central canal to lateral boundary: 650 μm; distance from central canal to bottom-most boundary: 400 μm). Coordinates were exported from Imaris, and plotted using the contour function in Matlab.
Supplemental Information includes a detailed description of the Bayesian model and Hamiltonian Monte Carlo algorithm for sampling from the posterior distribution of the fractional values f given the observed data, under the SnS prior.
GFP+ cells from lumbar spinal cords of p0 En1::Cre; RCE.lsl.GFP mice were dissociated using the Papain Dissociation Kit (Worthington), and single cells were isolated by uorescence-activated cell sorting using a Beckman Coulter MoFlo Astrios cell sorter. Cells were directly deposited into 96-well plates containing lysis buffer (Ambion Single Cell-to-CT Kit (Life Technologies). Reverse transcription and pre-amplification were performed according to the manufacturer’s protocol (Ambion, Life Technologies). Pre-amplified products were loaded on Biomark 48.48 Dynamic Arrays, and run on the Biomark HD microfluidic multiplex qRT-PCR platform (Fluidigm, San Francisco, CA). Gene expression levels were assayed using TaqMan probes (Life Technologies) directed against housekeeping genes and 15 of the 19 transcription factor-encoding genes, excluding FoxP1, MafA, Nr3b3, and Zfhx4 for which reliable probes could not be identified. (See Supplemental Information Table 2 for a description of TaqMan probes and extended experimental procedures). Ct values were measured by Biomark software, where relative transcript levels were determined by 2–Ct normalization to Actb transcript levels. The co-expression matrix for single cell qRT-PCR data was calculated using an optimization approach. Brie y, a threshold for each gene was computed by minimizing the distance between a co-expression matrix calculated by qRT-PCR and immunohistochemistry. The fraction of cells expressing gene X was computed simply by calculating the number of cells expressing gene X divided by the total number of cells.
We thank H. Lee for sharing advice on qPCR experiments. R. Axel, W. Fischler, A. Miri, D. Gutnisky and J.C. Tapia provided valuable discussions and comments on the manuscript, and C. Zuker suggested validation experiments. We thank A. Karpova and S. Druckman for quantitative insight. M.I.G. was supported by the National Science Foundation through a GRFP fellowship. T.M.J was supported by NIH grant NS033245, the Harold and Leila Y. Mathers Foundation, the Brain Research Foundation, and Project A.L.S., and is an investigator of the Howard Hughes Medical Institute. L.F.A. was supported by NIH grant MH093338 and by the Gatsby, Swartz and Mathers Foundations. L.P. and A.P. were supported by ONR grant N00014-14-1-0243 and ARO MURI grant W911NF-12-1-0594.
Supplemental Information can be found with this article online at …..
AUTHOR CONTRIBUTIONSM.I.G., A.P., L.A., L.P. designed computational analysis. J.B.B. performed biological experiments upon which statistical analysis was based. M.I.G. and J.B.B. performed and analyzed qPCR experiments. M.I.G., J.B.B., A.P., L.A., L.P., and T.M.J. prepared the manuscript.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.