Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC4831714

Formats

Article sections

Authors

Related links

Cell. Author manuscript; available in PMC 2017 March 24.

Published in final edited form as:

Published online 2016 March 3. doi: 10.1016/j.cell.2016.01.026

PMCID: PMC4831714

NIHMSID: NIHMS765564

Mariano I. Gabitto,^{1,}^{4,}^{*} Ari Pakman,^{2,}^{*} Jay B. Bikoff,^{1,}^{4,}^{*} L. F. Abbott,^{1,}^{3} Thomas M. Jessell,^{1,}^{4} and Liam Paninski^{1,}^{2}

Documenting the extent of cellular diversity is a critical step in defining the functional organization of tissues and organs. To infer cell type diversity from partial or incomplete transcription factor expression data we devised a sparse Bayesian framework that is able to handle estimation uncertainty, and can incorporate diverse cellular characteristics to optimize experimental design. Focusing on spinal V1 inhibitory interneurons, for which the spatial expression of 19 transcription factors has been mapped, we infer the existence of approximately 50 candidate V1 neuronal types, many of which localize in compact spatial domains in the ventral spinal cord. We have validated the existence of inferred cell types by direct experimental measurement, establishing this Bayesian framework as an effective platform for cell type characterization in the nervous system and elsewhere.

Tissues and organs are comprised of diverse cell types, possessing characteristic morphology and specialized function. The diversification of cell types attains prominence in the nervous system, where neuronal distinctions depend on the activities of transcription factors (TFs) and their downstream effectors (Kohwi and Doe, 2013). Attempts to define the link between transcriptional identity and neuronal diversity have benefitted from the analysis of long-distance projection neurons, for which distinctions in target innervation provide a clear correlate of functional divergence (Molyneaux *et al*, 2007; Sanes and Masland, 2015). In the retina and cerebral cortex, functional subclasses of ganglion and pyramidal neurons have been delineated through their transcriptional identities (Siegert *et al*, 2009; Greig *et al*, 2013). Similarly, the hierarchical ordering of motor neuron subtypes in the spinal cord has its origins in discrete profiles of transcription factor expression (Dasen *et al*, 2005; Dasen and Jessell, 2009). Yet, local interneurons represent by far the most prevalent neurons within the mammalian CNS, collectively shaping the output of long-range-projection neurons (Isaacson and Scanziani, 2011). The local confinement of interneuron axons, however, has made it difficult to obtain objective measures of identity and diversity.

Genome-wide mRNA expression profiles have been informative in distinguishing neuronal cell types (Usoskin *et al*, 2015; Zeisel *et al*, 2015; Macosko *et al*, 2015, Tasic *et al*, 2016). Nevertheless, documented dissociations between mRNA and protein expression (Gygi *et al*, 1999; Vogel & Marcotte, 2012) emphasize the merits of analysis of protein expression at the level of individual neurons (Sharma *et al,* 2015). But if many genes are involved in defining individual subpopulations, then the validation of protein co-expression will be constrained by the limited repertoire of primary and secondary antibodies.

This practical limitation could be overcome through the development of a statistical method that is able to resolve the extent of neuronal diversity from sparsely sampled transcriptional datasets. Such a method should provide: (i) an objective measure of confidence in the existence of cell types and their prevalence within a parental population, (ii) improvement in estimation accuracy upon integrating independent cellular characteristics with molecular phenotype, and (iii) informative predictions to guide further experiments. To meet these goals we developed a sparse Bayesian framework that models co-expression data based on incomplete combinations of TFs. Our focus on TF expression was governed by the well-established role of DNA-binding proteins in defining neuronal identity (Dalla Torre di Sanguinetto *et al*, 2008; Amamoto and Arlotta, 2014).

We used this Bayesian approach to assess the diversity of V1 interneurons in the spinal cord, a major inhibitory interneuron population implicated in motor control (Zhang *et al*, 2014). V1 interneurons are defined by developmental expression of the homeodomain transcription factor En1 (Saueressig *et al*, 1999) and include Renshaw cells and Group Ia reciprocal interneurons, which mediate recurrent and reciprocal inhibition, respectively (Sapir *et al*, 2004 and Zhang *et al*, 2014). Yet these two physiologically defined subtypes represent only a small fraction of the parental V1 population (Alvarez *et al*, 2005), implying a greater diversity of V1 neurons. Indeed, the V1 population has recently been subdivided on the basis of the expression of 19 TFs (Bikoff *et al*, 2016).

Here we apply a sparse Bayesian approach to assess the diversity of V1 neuronal subtypes marked by these TFs. Our analysis has generated three major findings: (i) an estimate of the number of V1 interneuron types, on the order of 50 subtypes; (ii) an estimate of the expression profiles of the 19 TFs across these inferred types; and (iii) a cladistic description of V1 diversity that guides experiments aimed at monitoring and manipulating distinct V1 subpopulations. In several instances, the predicted assignment of V1 neuronal type has been validated through single-cell quantitative RT-PCR, immunohistochemistry and assessment of spatial distribution. Finally, we demonstrate that this sparse Bayesian analysis serves as an effective platform for identifying cell types within other diverse populations.

The companion paper (Bikoff *et al*, 2016) used comparative microarray screening to identify and map the expression of 19 TFs in V1 interneurons. We used three sets of data to infer V1 neuronal diversity: (i) the fraction of neurons within the parental V1 population that express each of the 19 TFs (Figure 1A), (ii) the fractions of neurons co-expressing various pairs of TFs (Figure 1B), and (iii) the position of V1 interneurons expressing each of the 19 TFs (Figure 3A). Complete analysis of all TF pairs, however, is hindered by the fact that primary antibodies generated in the same host species cannot be distinguished easily by fluorescently-tagged secondary antibodies. We therefore developed an approach that permits statistical inference on the basis of incomplete data.

In this statistical analysis, a cell type is defined by the expression pattern of the 19 TFs under consideration. We characterize TFs as either expressed or not expressed, and thus each expression pattern is specified by a vector of 19 binary numbers, *J _{k,a}*, for pattern

We designate the fraction of cells with expression pattern k, the *cell-type fraction*, denoted by *f _{k}*, with

In principle, the model could be fit to observed data by minimizing the summed squared difference between the measurements and the predictions generated by the inferred fractions. This amounts to a non-negative constrained least squares (NNCLS) minimization problem (see Experimental Procedures; Wang *et al*, 2006; Abbas *et al*, 2009; Gong *et al*, 2011; Grange, *et al*, 2014). But the NNCLS approach fails in this case because, despite the constraint of non-negativity, it generates an infinite number of equally valid solutions. Indeed, for any single presumed cell type it is possible to find alternative solutions that exclude this cell type while maintaining an optimal summed squared difference (Supplemental Figure 3A).

We therefore resorted to a Bayesian approach in which unknown cell-type fractions are modeled as random variables, allowing their uncertainty to be characterized by probability distributions. The use of a *prior* distribution enables previous knowledge and expectations to be incorporated into the model, and a *likelihood* function reflects the probability that the observed data were generated by the model. As a biologically plausible prior distribution over cell-type fractions, we chose a constrained “spike-and-slab” (SnS) distribution (Ishwaran and Rao, 2005). This prior incorporates the biologically reasonable assumption that only a small fraction of the 1,978 potential cell types actually exist within the parental V1 population. The SnS prior forces that only a subset of potential expression patterns is required to explain the measurements (Supplemental Information).

The use of Bayes’ rule to combine prior and data likelihoods results in a posterior distribution from which estimates of confidence about the existence and identity of cell types can be determined. In our case, the posterior distribution cannot be computed directly, necessitating the use of a Monte Carlo sampling method (Gelman *et al* 2013). In particular, we adapted a Hamiltonian Monte Carlo (HMC) algorithm to draw random samples from the posterior distribution. This Monte Carlo procedure is specialized for constrained SnS posteriors and permits efficient sampling from our posterior distributions (Pakman and Paninski 2013, 2014).

Each iteration of the sampling algorithm generates a set of cell-type fractions that satisfy the constraints and provide a good fit to the data. The number of selected cell types and their expression patterns vary across iterations. Combining samples across a large number of iterations allows us to infer the properties of the posterior probability distribution. For example, the proportion of Monte Carlo samples for which a particular expression pattern is selected determines that type’s *posterior inclusion probability*, and serves as a confidence measure of its necessity to explain the data. We also computed the distribution of the number of cell types selected in each iteration, which provides an estimate of the total number of distinct cell types required to explain the observed data. Repeated sampling also enables us to compute cross-correlations between cell-type fractions that are used to construct a list of *candidate expression profiles* along with the probabilities of their correspondence to actual cell types. As with an expression pattern, a candidate expression profile is a 19-component vector, with each component representing a different TF, but now these components are allowed to be real numbers between 0 and 1. In this scheme, component *a* of the candidate expression profile represents the probability that TF *a* is expressed. Finally, the quantification of uncertainty in the Bayesian approach provides a tool that can enhance experimental design by selecting, in a principled way, the measurement that is expected to maximally reduce uncertainty (Supplemental Information, Section Experimental Design).

We validated the ability of the Bayesian approach to accurately infer cellular diversity by performing computational cross-validation experiments, as well as experiments on simulated datasets, for which the underlying cell types and corresponding cell-type fractions are known (see Computational Validation section of Supplemental Information). This approach provides convincing evidence that meaningful and accurate estimates of cellular diversity can be extracted, encouraging us to apply the Bayesian approach to V1 datasets.

We first applied this Bayesian framework to TF expression without including spatial information (Figure 1A and 1B, but not not3A).3A). As discussed, each iteration of the HMC sampling algorithm generates a possible set of cell types, but their number and identity vary across HMC iterations. Over the course of the full HMC run, the number of types selected (those with non-zero cell-type fractions) ranged from 25 to 33 with a mean ± standard deviation of 29 ± 2 (Figure 1C). The identity of the selected cell types varied across different HMC iterations.

Computing the posterior inclusion probability of each expression pattern across many samples led to a rank-ordered list of candidate expression patterns. The 40 candidate patterns with the highest inclusion probabilities and their inferred cell-type fractions are shown in Figure 1D. The expression pattern with the highest inclusion probability corresponds to the Renshaw interneuron, a defined V1 neuronal type that mediates recurrent inhibition of motor neurons (Renshaw, 1946) and co-expresses the TFs Oc1, Oc2 and MafB (Stam *et al*, 2012). This analysis also infers the existence of MafA^{+} and MafA^{−} subsets of Renshaw interneurons (patterns 1 and 30 in Figure 1D), a molecular diversity that may correspond to their known morphological heterogeneity (Fyffe, 1990).

We examined the sensitivity of these results to the number of TFs used in the analysis, selecting 11 to 18 of the 19 measured TFs. The average number of selected cell types, 29 for the full 19 factors, decreases only gradually when smaller numbers of TFs are analyzed. Moreover, when 16 to 19 TFs are incorporated, the number of selected cell types remains relatively constant, close to 29 (Figure 1E). In contrast, the number of *potential* cell types (1,978 for the case of 19 factors) depends much more strongly on the chosen TF subset (Figure 1F). These findings suggest that Bayesian calculations of cellular diversity based solely on the available TF data may be close to saturating with the use of a majority subset of the 19 TFs examined here.

What is the diversity of potential TF expression patterns within the V1 population? We detected 131 different expression patterns with posterior inclusion probabilities greater than 0.05 (i.e. appearing in more than 5% of the HMC samples). We constructed *candidate expression profiles* by clustering the 131 most likely expression patterns into ‘groups’ (Figure 2A and Supplemental Information, Section Clustering Cell Types). A group is defined as a set of expression patterns that satisfies two conditions: (i) the members of a group express similar sets of TFs (Supplemental Figure 2A), and (ii) in all or almost all of the HMC samples, only a single member of a group is selected (i.e. has a non-zero cell-type fraction), although different members may be selected in different samples (Supplemental Figure 2B). The second condition causes the members of a group to be negatively correlated with each other (Supplemental Figure 2A). These conditions imply that a group is likely to represent a single cell type with an uncertain expression pattern, rather than multiple cell types.

We developed a recursive algorithm for constructing these groups. All candidate expression profiles with inclusion probabilities greater than 5% were assigned, with most groups having only a single member selected across all of the HMC samples, and no group having more than one member selected in >3% of the samples (Supplemental Figure 2D). To examine the robustness of the inferred groups, we varied systematically the threshold for selecting the list of candidate expression patterns from which groups were constructed. As this threshold is lowered, the number of groups first increases linearly because each high-ranked expression pattern spawns its own group (Supplemental Figure 2C). However, this growth slows as lower-ranked patterns join existing groups, resulting in a weak dependence on the inclusion threshold. With an inclusion threshold of 5%, the clustering algorithm identifies 35 groups (Figure 2A).

Each group gives rise to a single *candidate expression profile* (Figure 2B), and for each profile we assign an *expression probability* to each TF, weighting the binary expression patterns of each member of the group by the frequency with which it appear in the HMC samples (Figure 2B, top). In addition, we compute a posterior inclusion probability (Figure 2B, bottom) for each candidate expression profile. These are much higher than the inclusion probabilities of the corresponding candidate expression patterns from which they are constructed (Figure 1D). Nevertheless, there is still considerable uncertainty in the identity of the cell types predicted by TF expression data alone (Figure 2B). Furthermore, the existence of 131 candidates for only ~30 cell types (the average number per sample; Figure 1C) and the fact that few of the top expression patterns in Figure 1D have posterior inclusion probabilities near one both indicate that expression-only data is insufficient for specifying cell-type identity (Supplemental Figure 3A).

Certain classes of spinal interneurons are known to localize in discrete spatial domains (Thomas and Wilson, 1965; Hultborn *et al*, 1971), prompting us to ask whether the incorporation of spatial information could refine estimates of V1 group diversity. For spatial analysis, we divided the ventral spinal cord into 196 bins, mapped spatial expression data into specific bins, and defined cell-type fractions for each bin (Figure 3A). We then applied an appropriately generalized Bayesian analysis to the spatially-resolved data (Supplemental Information Sections 2.1 and 3.1).

Incorporating spatial information into the Bayesian analysis increased the number of candidate cell types that the HMC sampler selected per iteration from 29 to 50 ± 2 (mean ± SD); and the degree of confidence in the inferred expression profiles also increased (Figure 3B). With the addition of spatial information only 75 total expression patterns are assigned posterior inclusion probabilities greater than 0.05 (compared to 131 for the non-spatial analysis), and many of their inclusion probabilities are close to one (Supplemental Figure 2F; c.f. Figure 1D). We repeated the grouping procedure for these 75 total cell types and uncovered 57 candidate expression profiles, most of which are identified with high inclusion probabilities (Figure 3C), permitting a more confident assignment of expression patterns (Figure 3D).

An additional benefit of this spatial analysis is that it provides estimates of how each inferred cell type localizes in the ventral spinal cord (Figure 4). Although the method does not impose continuity on cell type distributions we find that many of the inferred cell types are localized in relatively compact, contiguous domains, covering the full positional spectrum of the parental V1 interneuron distribution along both the dorsoventral and mediolateral axes. Notably, one inferred cell type with the expression profile of Renshaw interneurons (expressing MafB, Oc1 and Oc2) is predicted to be confined to the most ventral region within the parental V1 population (Figure 4I), in agreement with their known settling position (Alvarez and Fyffe, 2007; Stam *et al*, 2012). Other inferred cell types, characterized by FoxP2, FoxP4, Nr3b3, and/or Nr4a2 expression, showed clustered distributions ventral to the central canal, and dorsomedial to motor neurons (Figure S4). Such distributions are similar to the proposed location of group Ia reciprocal interneurons (Hultborn *et al*, 1971), a subset of which are known to reside within the parental V1 population (Zhang *et al*, 2014). Taken together, these findings document novel molecular and spatial diversity in the V1 interneuron population.

To characterize the minimal number of TFs needed to provide selective access to an individual cell type, we developed a classification scheme that relies on a recursive algorithm to sequentially subdivide the parental population and arrange every cell type along a clade diagram (see Supplemental Information). In this representation, the central node of the diagram corresponds to the full V1 population, with branches representing TFs expressed in a mutually exclusive fashion covering the highest fraction of the parental population. This process is repeated until the candidate cell types from which the analysis is constructed are revealed at the extremities of the plot. Our analysis reveals that 64% of the V1 parental population can be divided into four main clades on the basis of the mutually exclusive expression of FoxP2, Pou6f2, Sp8, and MafA (Figure 5A). Each clade contains from 4 to 19 cell types, with a total of 36 cell types falling within the 4 main clades (Figure 5B). The minimum number of TFs needed to target each group ranges from 2 (in the case of MafA, Zfhx4 neurons) to 6 (as in the final leaves of the FoxP2 clade) with a mean number of 4 ± 1 (mean ± SD).

Analysis of clade settling positions shows that the MafA, Sp8, and Pou6f2 clades exhibited little spatial overlap, whereas the FoxP2 clade displays a broad spatial distribution with significant overlap with the other three clades (Figure 5C). At the second tier of our clade assignment, V1 subclasses become more restricted spatially (Figure 5D), with additional restrictions for progressively higher tiers (not shown). Cell types within the Pou6f2 clade exhibit medio-lateral gradations in their spatial distributions, determined by the expression of the TFs Nr5a2 and Lmo3 respectively. In certain instances, however, cell types within a single clade show overlapping spatial distributions, best exemplified by the FoxP2 clade, characterized by numerous intermingled cell types with no statistically significant difference in their centroid coordinates (Supplemental Figure 4). In summary, this clade analysis provides predictive insight into the relative contributions of individual TFs as well as spatial information of use in delineating the hierarchy of V1 interneuron candidate cell types.

The merits of our computational analysis depend critically on the ability to infer cellular diversity with accuracy. We sought to assess the biological accuracy of the Bayesian model’s predictions, comparing first experimental findings and inferred results from Bayesian analysis with single-cell quantitative real-time PCR (qRT-PCR) data (Figure 6). We focused on 15 TFs for which reliable qRT-PCR probes could be identified, and analyzed TF expression within En1^{+} neurons, isolated at random from p0 lumbar spinal segments of *En1::Cre; RCE::lsl.GFP* mice. We assessed whether qRT-PCR and immunocytochemistry gave comparable co-expression values. Appropriate thresholds for each gene were set, relative to the expression of the ubiquitously expressed gene β-actin, with the aim of comparing the measured qRT-PCR patterns against our immunohistochemical measurements. Thresholds were chosen by minimizing the distance between qRT-PCR generated expression values and immunohistochemical measurements (Experimental Procedure, Single cell qRT-PCR characterization). After applying appropriate thresholds, TFs were classified either as ‘expressed’ or ‘not expressed’ within individual En1^{+} interneurons. The patterns of gene expression emerging from quantitative PCR exhibited high correlation with immunohistochemical data for both individual and paired TF measurements (Figure 6A–C, correlation coefficient = 0.88).

We then asked whether qRT-PCR transcriptional patterns correlate with the clade results of our Bayesian analysis, seeking to validate the general organization of our candidate expression patterns, instead of individual cell types. Individual V1 interneuron gene expression profiles can be segregated along the four major inferred clade populations, indicating that our computational predictions accurately reflect gene expression relationships *in vivo* (Figure 6D). Importantly, qRT-PCR identified closely-related expression patterns predicted by the model. These results validate the general organization of the Bayesian cell-type predictions.

We next tested predictions involving triple labeling of combinations of TFs not previously measured. We focused on V1^{Sp8} interneurons, noting that predicted fractions for the 7 potential combinations of Sp8, Prox1 and Prdm8 are in good agreement with their measured values (Figure 7A, Supplementary Table 1). Predictions arising from spatial analyses tend to be more constrained, with smaller standard deviations, and are generally more accurate than the non-spatial predictions. Moreover, we validated the predicted absence of the combination Prdm8^{+}, Prox1^{+}, Sp8^{−} (Figure 7B). Taken together, these results indicate that the Bayesian approach accurately infers the TF expression profile of neuronal types within the parental V1 population.

Finally, our results also enabled us to test predictions about unmeasured spatial distributions of neurons expressing pairs of TFs, on the basis of measured spatial information for single TFs. We focused on cases in which a combination of two TFs confined an inferred V1 neuronal type to a highly restricted region of the parental V1 distribution. We found that our predictions faithfully co-localize with the actual distributions assessed in p0 caudal lumbar spinal segments (Figure 7C–E). These results indicate that the Bayesian approach, by virtue of incorporating dual cellular sources of information, correctly predicts the spatial distribution of novel gene combinations.

To establish the general applicability of our Bayesian approach, we evaluated its ability to identify cell types in systems where an estimate of cellular diversity had been extracted by other analysis procedures.

We first focused on the zebrafish embryo, for which single cells have been transcriptionally profiled by RNAseq and mapped to their location of origin (Satija *et al* 2015). Although the delineation of the entire cellular repertoire was not attempted in that work, the analysis of single cell cluster profiles across the marginal region of the embryo is consistent with seven cell types (Figure 7G). We sought to determine whether our sparse Bayesian methods are able to achieve this result given simulated data generated by randomly subsampling the dataset from Satija *et al* (2015) (Supplemental Information, Ground-truth data generation from RNAseq measurements). In the absence of spatial information, the sparse Bayesian algorithm estimated 5 +/− 1 cell types. The transcriptional profile of each inferred cell type corresponds to one of the ground-truth candidates, but our procedure underestimated total cell-type number (Figure 7I). Introducing spatial information into the analysis increases the number of correctly inferred cell types to 6 +/− 1 (Figure 7J), close to 7, the ground-truth number. Thus as in the V1 study, inference is improved by incorporating additional cellular characteristics – in this case location. The coarse description of the spatial distributions and their similar shapes, together with the random selection of the subset of genes incorporated in our analysis (see Experimental Procedures for details), are likely reasons that the algorithm slightly underestimates cell-type diversity. Nevertheless, the results obtained by the Bayesian approach are generally in good agreement with those obtained by clustering the original RNAseq data.

We next analyzed cortical interneuron diversity, where 16 interneuronal cell types have been identified on the basis of RNAseq data in mouse somatosensory cortex and hippocampal CA1 neurons (Zeisel *et al*, 2015). From this data set, single and pairwise measurements were created, with errors assigned to each measurement (Supplemental Information). We used this dataset to construct Bayesian estimates of neocortical diversity in the absence of spatial information, varying the number of genes used in the analysis, the noise in the measurements and the amount of missing data (representing antibody incompatibility). From this, we inferred 12.7 +/− 0.3 cell types over a range of 13 to 16 genes used, all 12 corresponding to correctly inferred expression profiles (Figure 7K). In every example, the sparse Bayesian analysis outperformed the NNCLS approach, which overestimated the number of cell types by nearly 100%. We next used all selected genes to estimate sensitivity to noise and missing data, and observed a larger effect for noise (Figure 7L). As the noise level and amount of missing data tend to zero, we correctly infer the total number of cell types and their expression profiles (Figure 7M).

These analyses establish the sparse Bayesian approach as an effective means of estimating neuronal type diversity, and provide further insight into the benefits of incorporating spatial information when obtaining accurate estimates.

Objective assessment of the extent of mammalian cellular diversity has remained challenging. The sparse Bayesian framework presented here provides a general method for characterizing cellular heterogeneity on the basis of sparsely sampled biological information. We have used this framework to study spinal V1 interneuron diversity. By analyzing the spatial expression densities of individual TFs as well as their patterns of pairwise expression, candidate expression profiles for ~50 inferred V1 cell types are provided. The integration of distinct phenotypic aspects of cellular heterogeneity, in this instance TF expression and settling position, markedly enhances the confidence in assignment of predicted V1 interneuron cell types. We note that this approach provides a general method for delineating the heterogeneity of cell types in any mixed tissue.

Estimates of cell type diversity can be obtained through computational approaches that employ hierarchical clustering (see Armañanzas and Ascoli, 2015, for a recent review). But these methods have drawbacks - it is challenging to use hierarchical clustering to determine the number of cell types automatically, and their inferences can be sensitive to the choice of similarity measures (Augen, 2005).

A different set of computational approaches based on deconvolution algorithms have been used to characterize cellular diversity from information about gene expression profiles (Shen-Orr and Gaujoux*,* 2013). These can be divided into two major methodologies. Regression approaches can be applied when the expression profile of cell types of interest is known *a priori* (Wang *et al*, 2006; Abbas *et al*, 2009; Gong *et al*, 2011; Zuk *et al*, 2013; Grange, *et al*, 2014). In contrast, matrix-factorization approaches become relevant when cell type expression profiles are not known (Repsilber *et al*, 2010; Erkkilä *et al*, 2010; Bazot *et al*, 2013; Zhong *et al*, 2013; Liebner *et al*, 2014). The latter approaches suffer from similar limitations as the NNCLS method: matrix factorizations are inherently non-unique, and estimation of confidence levels can be difficult. Past attempts to overcome these challenges have relied on the premise that particular genes are expressed in only a single cell type, which does not represent the general biological case (Gaujoux and Seoighe, 2012). A more recent approach, conceptually akin to the methods developed here, uses SnS priors to reduce the mathematical ambiguities inherent in matrix factorization and has recently been applied to the identification of genetic disease pathways (Shen *et al*, 2015), albeit without addressing cell type diversity.

Our approach sidesteps previous limitations by focusing on the fractional prevalence of individual and paired TFs within a parental population, with well-defined values between zero and one, in contrast to expression levels that may be scaled arbitrarily (see Supplemental Information for further discussion). Mathematically, our model resembles previous regression models (Wang *et al*, 2006; Abbas *et al*, 2009; Gong *et al*, 2011; Zuk *et al*, 2013; Grange, *et al*, 2014), but has the virtue of incorporating the SnS prior and a binary set of regressors that span all possible cell types. For the data considered here, the resulting regression problem is highly ill-posed (Supplemental Figure 3), necessitating a fully Bayesian approach to capture uncertainty in resultant estimates.

A distinctive feature of our approach is that we start by considering all of the possible expression patterns for the genes being considered, 2^{19} in our case and, in general, 2* ^{N}* for a study involving

Our analysis has identified extensive transcriptional diversity within V1 interneurons on the basis of the expression of 19 TFs. The first issue this raises is whether further diversity will follow inevitably with the inclusion of additional V1 TFs. We analyzed the impact of varying the number of TFs in our analysis and found only a weak dependence of the number of cell types on the number of TFs (Figure 1E–F). Thus, 19 TFs appear sufficient to uncover a substantial fraction of the total underlying transcriptional heterogeneity.

A second issue is the impact of incorporating spatial information on cell type. In our analysis the inclusion of spatial data increased inferred cell type number by ~70%, and markedly enhanced confidence in the inferred expression profiles. Spatial information has a much stronger impact on cell type assignment than variation in the number of TFs, indicating that settling position carries significant cell type information which is independent from the information carried by the expression patterns of the 19 TFs examined here.

A third issue is the impact of V1 cell type spatial segregation on synaptic input specificity (Bikoff *et al*., 2016). An inordinate diversity in spatially-restricted cell types may be necessary to satisfy highly diverse spinal circuit computations. Our data indicate that spatial segregation only partially accounts for spinal V1 diversity. A large number of inferred cell types, marked by expression of FoxP2, localize in the same ventral region (Supplemental Figure 4). Additional heterogeneity might be realized by the expression of surface molecules that impose and constrain connectivity with different synaptic partners.

We note that transcriptional diversity could, in some instances, reflect variation in functional cell state, rather than indicating a distinct neuronal subtype. However, consistent with the genetic specification of cell type identity, we find that the position of transcriptionally distinct V1 subsets are segregated, stereotyped from animal to animal, and stable across development (see Bikoff *et al*, 2016, Figures 3H and S4A–B). The spatial segregation argues strongly against the ‘cell state’ possibility. Nevertheless, activity-shaped differences in Er81 expression in fast-spiking cortical interneurons have been shown to mark delay-type or non-delayed firing states (Dehorter e*t al*, 2015). In addition, activity-dependent induction of Npas4 expression has been described for cortical neurons (Lin *et al*., 2008), with implications for homeostatic regulation of sensitivity to inhibitory transmitters. Further studies will therefore be needed to dissect the functional consequences of V1 diversity, to resolve whether certain state-dependent functional properties are reflected in the diversity of V1 transcriptional profiles.

Our statistical approach has relevance well beyond a focus on spinal V1 interneurons, and could prove useful in further delineating neuronal cell types elsewhere in the nervous system. Cortical projection neurons fractionate into a few broad classes based on patterns of target innervation and distinctions in gene expression, yet the extent to which any single broad class of pyramidal neurons is itself heterogeneous remains unclear (Greig *et al*, 2013). The classification of interneuron cell types in the brain has proven particularly challenging (Ascoli, 2008; DeFelipe *et al*, 2013; Kepecs and Fishell, 2014), although studies of hippocampal interneuron diversity suggest a degree of heterogeneity that approaches that found for spinal V1 interneurons. Within CA1 hippocampus, over 20 inhibitory interneuron subtypes have been identified, based on anatomical, molecular, or electrophysiological distinctions (Krook-Magnusen *et al*, 2012). Single-cell transcriptome analysis of primary somatosensory cortex or CA1 hippocampus interneurons has identified 16 molecularly distinct interneuron cell types, which likely represents a lower bound on diversity (Zeisel *et al*, 2015). Thus, insight into interneuronal diversity in the spinal cord may inform studies to address heterogeneity throughout the brain.

Our analysis also has implications for genetic strategies aimed at manipulating circuit elements throughout the nervous system. The minimal number of TFs needed to define a single V1 cell type uniquely has been identified on the basis of clade profiles and is, on average, 4 ± 1. This indicates that individual TFs are generally not sufficient to isolate V1 neuronal types, consistent with findings in other neuronal systems (Sanes and Masland, 2015). The difficulty in identifying single TFs that uniquely define a cell type may reflect the prevalence of combinatorial TF codes (Philippidou and Dasen, 2015), and could explain the difficulty in delineating individual motor neuron pools (De Marco Garcia and Jessell, 2008).

Immunohistochemistry was performed as in Bikoff *et al*, 2016. Briefly, p0 *En1::cre; Tau.lsl.nLacZ* mice were transcardially perfused with 4% paraformaldehyde in 0.1M phosphate buffer, followed by a 2 hour postfixation. Tissue was then washed, cryoprotected by equilibration in 30% sucrose in 0.1M phosphate buffer, embedded in OCT and cryostat-sectioned in the transverse plane at 20 μm. Immunohistochemistry was performed on tissue through sequential exposure to primary antibodies (overnight at 4°C) and fluorophore-conjugated secondary antibodies (1 hour at room temperature). Sections were mounted using Fluoromount-G (SouthernBiotech) and coverslipped for imaging. Confocal images were obtained on an LSM 710 Meta Confocal microscope (Carl Zeiss) at 1024×1024 resolution, using a Plan-Apochromat 20x/0.8 M27 objective. See Bikoff *et al*, 2016 for a description of antibodies used.

Confocal images of transcription factor co-expression were analyzed in Imaris (Bitplane) using the “Colocalization” and “Spots” functions with thresholds set to exclude nonspecific background reactivity, followed by manual validation to confirm co-expression within V1 interneurons. For each transcription factor combination, at least 2 independent sections from 3 or more animals were analyzed, totaling >580 lumbar sections and >1100 lumbar hemisections in this dataset.

Interneuron spatial distributions are described in Bikoff *et al*, (2016). Sections were normalized to a standardized spinal cord hemisection (distance from central canal to lateral boundary: 650 μm; distance from central canal to bottom-most boundary: 400 μm). Coordinates were exported from Imaris, and plotted using the contour function in Matlab.

Supplemental Information includes a detailed description of the Bayesian model and Hamiltonian Monte Carlo algorithm for sampling from the posterior distribution of the fractional values ** f** given the observed data, under the SnS prior.

GFP+ cells from lumbar spinal cords of p0 En1::Cre; RCE.lsl.GFP mice were dissociated using the Papain Dissociation Kit (Worthington), and single cells were isolated by uorescence-activated cell sorting using a Beckman Coulter MoFlo Astrios cell sorter. Cells were directly deposited into 96-well plates containing lysis buffer (Ambion Single Cell-to-CT Kit (Life Technologies). Reverse transcription and pre-amplification were performed according to the manufacturer’s protocol (Ambion, Life Technologies). Pre-amplified products were loaded on Biomark 48.48 Dynamic Arrays, and run on the Biomark HD microfluidic multiplex qRT-PCR platform (Fluidigm, San Francisco, CA). Gene expression levels were assayed using TaqMan probes (Life Technologies) directed against housekeeping genes and 15 of the 19 transcription factor-encoding genes, excluding FoxP1, MafA, Nr3b3, and Zfhx4 for which reliable probes could not be identified. (See Supplemental Information Table 2 for a description of TaqMan probes and extended experimental procedures). Ct values were measured by Biomark software, where relative transcript levels were determined by 2–Ct normalization to Actb transcript levels. The co-expression matrix for single cell qRT-PCR data was calculated using an optimization approach. Brie y, a threshold for each gene was computed by minimizing the distance between a co-expression matrix calculated by qRT-PCR and immunohistochemistry. The fraction of cells expressing gene X was computed simply by calculating the number of cells expressing gene X divided by the total number of cells.

Click here to view.^{(839K, pdf)}

Click here to view.^{(16K, docx)}

Click here to view.^{(1.4M, pdf)}

We thank H. Lee for sharing advice on qPCR experiments. R. Axel, W. Fischler, A. Miri, D. Gutnisky and J.C. Tapia provided valuable discussions and comments on the manuscript, and C. Zuker suggested validation experiments. We thank A. Karpova and S. Druckman for quantitative insight. M.I.G. was supported by the National Science Foundation through a GRFP fellowship. T.M.J was supported by NIH grant NS033245, the Harold and Leila Y. Mathers Foundation, the Brain Research Foundation, and Project A.L.S., and is an investigator of the Howard Hughes Medical Institute. L.F.A. was supported by NIH grant MH093338 and by the Gatsby, Swartz and Mathers Foundations. L.P. and A.P. were supported by ONR grant N00014-14-1-0243 and ARO MURI grant W911NF-12-1-0594.

Supplemental Information can be found with this article online at …..

**AUTHOR CONTRIBUTIONS**

**Publisher's Disclaimer: **This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

- Abbas AR, Wolslegel K, Seshasayee D, Modrusan Z, Clark HF. Deconvolution of blood microarray data identifies cellular activation patterns in systemic lupus erythematosus. PLoS ONE. 2009;7:e6098. [PMC free article] [PubMed]
- Alvarez FJ, Jonas PC, Sapir T, Hartley R, Berrocal MC, Geiman EJ, Todd AJ, Goulding M. Postnatal phenotype and localization of spinal cord V1 derived interneurons. J Comp Neurol. 2005;493:177–192. [PMC free article] [PubMed]
- Alvarez FJ, Fyffe RE. The continuing case for the Renshaw cell. J Physiol. 2007;584:31–45. [PubMed]
- Amamoto R, Arlotta P. Development-inspired reprogramming of the mammalian central nervous system. Science. 2014;343:1239882. [PMC free article] [PubMed]
- Armañanzas R, Ascoli GA. Towards the automatic classification of neurons. Trends Neurosci. 2015;5:307–318. [PMC free article] [PubMed]
- Ascoli GA. Neuroinformatics grand challenges. Neuroinformatics. 2008;I:1–3. [PubMed]
- Augen J. Bioinformatics in the Post-Genomic Era: Genome, Transcriptome, Proteome, and Information-Based Medicine. Addison-Wesley Press; 2005.
- Bazot C, Dobigeon N, Tourneret JY, Zaas AK, Ginsburg GS, Hero AO. Unsupervised Bayesian linear unmixing of gene expression microarrays. BMC Bioinformatics. 2013;14:99. [PMC free article] [PubMed]
- Bikoff JB, Gabitto MI, Rivard AF, Drobac E, Machado TA, Miri A, Brenner-Morton S, Famojure E, Diaz C, Alvarez FJ, et al. Spinal inhibitory interneuron diversity delineates variant motor microcircuits. Cell. 2016 this issue. [PMC free article] [PubMed]
- Dalla Torre di Sanguinetto SA, Dasen JS, Arber S. Transcriptional mechanisms controlling motor neuron diversity and connectivity. Curr Opin Neurobiol. 2008;1:36–43. [PubMed]
- Dasen JS, Tice BC, Brenner-Morton S, Jessell TM. A Hox regulatory network establishes motor neuron pool identity and target-muscle connectivity. Cell. 2005;3:477–91. [PubMed]
- Dasen JS, Jessell TM. Hox networks and the origins of motor neuron diversity. Curr Top Dev Biol. 2009;88:169–200. [PubMed]
- DeFelipe J, López-Cruz PL, Benavides-Piccione R, Bielza C, Larrañaga P, Anderson S, Burkhalter A, Cauli B, Fairén A, Feldmeyer D, et al. New insights into the classification and nomenclature of cortical GABAergic interneurons. Nat Rev Neurosci. 2013;14:202–216. [PMC free article] [PubMed]
- Dehorter N, Ciceri G, Bartolini G, Lim L, Pino I, Marín O. Tuning of fast-spiking interneuron properties by an activity-dependent transcriptional switch. Science. 2015;349:1216–1220. [PMC free article] [PubMed]
- De Marco Garcia NV, Jessell TM. Early motor neuron pool identity and muscle nerve trajectory defined by postmitotic restrictions in Nkx6.1 activity. Neuron. 2008;57:217–231. [PMC free article] [PubMed]
- Erkkilä T, Lehmusvaara S, Ruusuvuori P, Visakorpi T, Shmulevich I, Lähdesmäki H. Probabilistic analysis of gene expression measurements from heterogeneous tissues. Bioinformatics. 2010;20:2571–2587. [PMC free article] [PubMed]
- Fyffe RE. Evidence for separate morphological classes of Renshaw cells in the cat’s spinal cord. Brain Res. 1990;1:301–304. [PubMed]
- Gaujoux R, Seoighe C. Semi-supervised Nonnegative Matrix Factorization for gene expression deconvolution: a case study. Infect Genet Evol. 2012;5:913–921. [PubMed]
- Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A. Bayesian Data Analysis. Boca Raton, FL: Chapman & Hall/CRC; 2013.
- Gygi SP, Rochon Y, Franza BR, Aebersold R. Correlation between protein and mRNA abundance in yeast. Mol Cell Biol. 1999;3:1720–1730. [PMC free article] [PubMed]
- Gong T, Hartmann N, Kohane IS, Brinkmann V, Staedtler F, Letzkus M, Bongiovanni S, Szustakowski JD. Optimal deconvolution of transcriptional profiling data using quadratic programming with application to complex clinical blood samples. PLoS ONE. 2011;11:e27156. [PMC free article] [PubMed]
- Grange P, Bohland J, Okaty B, Sugino K, Bokil H, Nelson S, Ng L, Hawrylycz M, Mitra P. Cell-type-based model explaining coexpression patterns of genes in the brain. Proc Nat Acad Sci. 2014;111:5397–5402. [PubMed]
- Greig LC, Woodworth MB, Galazo MJ, Padmanabhan H, Macklis JD. Molecular logic of neocortical projection neuron specification, development and diversity. Nat Rev Neurosci. 2013;11:755–769. [PMC free article] [PubMed]
- Hultborn H, Jankowska E, Lindström Recurrent inhibition of interneurones monosynaptically activated from group Ia afferents. J Physiol. 1971;3:613–636. [PubMed]
- Isaacson JS, Scanziani M. How inhibition shapes cortical activity. Neuron. 2011;2:231–243. [PMC free article] [PubMed]
- Ishwaran, Rao JS. Spike and slab variable selection: frequentist and Bayesian strategies. Ann Statist. 2005;33:730–773.
- Kepecs A, Fishell G. Interneuron cell types are fit to function. Nature. 2014;505:318–326. [PMC free article] [PubMed]
- Klein AM, Mazutis L, Akartuna I, Tallapragada N, Veres A, Li V, Peshkin L, Weitz DA, Kirschner MW. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell. 2015;5:1187–1201. [PMC free article] [PubMed]
- Kohwi M, Doe CQ. Temporal fate specification and neural progenitor competence during development. Nat Rev Neurosci. 2013;12:823–838. [PMC free article] [PubMed]
- Krook-Magnuson E, Varga C, Lee SH, Soltesz I. New dimensions of interneuronal specialization unmasked by principal cell heterogeneity. Trends Neurosci. 2012;35:175–184. [PMC free article] [PubMed]
- Liebner DA, Huang K, Parvin JD. MMAD: microarray microdissection with analysis of differences is a computational tool for deconvoluting cell type-specific contributions from tissue samples. Bioinformatics. 2014;5:682–689. [PMC free article] [PubMed]
- Lin Y, Bloodgood BL, Hauser JL, Lapan AD, Koon AC, Kim TK, Hu LS, Malik AN, Greenberg ME. Activity-dependent regulation of inhibitory synapse development by Npas4. Nature. 2008;455:1198–1204. [PMC free article] [PubMed]
- Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR, Kamitaki N, Martersteck EM, et al. Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell. 2015;5:1202–1214. [PMC free article] [PubMed]
- Molyneaux BJ, Arlotta P, Macklis JD. Molecular development of corticospinal motor neuron circuitry. Novartis Found Symp. 2007;288:3–15. [PubMed]
- Pakman A, Paninski L. Auxiliary-variable exact Hamiltonian Monte Carlo samplers for binary distributions. Advances in Neural Information Processing Systems. 2013:2490–2498.
- Pakman A, Paninski L. Exact Hamiltonian Monte Carlo for truncated Multivariate Gaussian. Journal of Computational and Graphical Statistics. 2014;23:2.
- Philippidou P, Dasen JS. Sensory-Motor Circuits: Hox Genes Get in Touch. Neuron. 2015;88:437–440. [PubMed]
- Repsilber D, Kern S, Telaar A, Walzl G, Black GF, Selbig J, Parida SK, Kaufmann SH, Jacobsen M. Biomarker discovery in heterogeneous tissue samples -taking the in-silico deconfounding approach. BMC Bioinformatics. 2010;11:27. [PMC free article] [PubMed]
- Sanes JR, Masland RH. The Types of Retinal Ganglion Cells: Current Status and Implications for Neuronal Classification. Annu Rev Neurosci. 2015;38:221–246. [PubMed]
- Sapir T, Geiman EJ, Wang Z, Velasquez T, Mitsui S, Yoshihara Y, Frank E, Alvarez FJ, Goulding M. Pax6 and Engrailed 1 regulate two distinct aspects of Renshaw cell development. J Neurosci. 2004;5:1255–1264. [PMC free article] [PubMed]
- Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of single-cell gene expression data. Nat Biotechnol. 2015;33:495–502. [PMC free article] [PubMed]
- Saueressig H, Burrill J, Goulding M. Engrailed-1 and netrin-1 regulate axon pathfinding by association interneurons that project to motor neurons. Development. 1999;19:4201–4212. [PubMed]
- Sharma K, Schmitt S, Bergner CG, Tyanova S, Kannaiyan N, Manrique-Hoyos N, Kongi K, Cantuti L, Hanisch UK, Philips MA, et al. Cell type– and brain region– resolved mouse brain proteome. Nat Neurosci. 2015;18:1819–1831. [PubMed]
- Shen Y, Rahman M, Piccolo SR, Gusenleitner D, El-Chaar NN, Cheng L, Monti S, Bild AH, Johnson WE. ASSIGN: context-specific genomic profiling of multiple heterogeneous biological pathways. Bioinformatics. 2015;11:1745–1753. [PMC free article] [PubMed]
- Shen-Orr S, Gaujoux R. Computational deconvolution: extracting cell type-specific information from heterogeneous samples, Curr. Opin Immunol. 2013;5:571–578. [PMC free article] [PubMed]
- Siegert S, Scherf BG, Del Punta K, Didkovsky N, Heintz N, Roska B. Genetic address book for retinal cell types. Nat Neurosci. 2009;9:1197–1204. [PubMed]
- Stam FJ, Hendricks TJ, Zhang J, Geiman EJ, Francius C, Labosky PA, Clotman F, Goulding M. Renshaw cell interneuron specialization is controlled by a temporally restricted transcription factor program. Development. 2012;131:179–190. [PubMed]
- Tasic B, Menon V, Nguyen TN, Kim TK, Jarsky T, Yao Z, Levi B, Gray LT, Sorensen SA, Dolbeare T, et al. Adult mouse cortical cell taxonomy revealed by singel cell transcriptomics. Nat Neurosci. 2016 doi: 10.1038/nn.4216. [PMC free article] [PubMed] [Cross Ref]
- Thomas RC, Wilson VJ. Precise localization of Renshaw cells with a new marking technique. Nature. 1965;980:211–213. [PubMed]
- Usoskin D, Furlan A, Islam S, Abdo H, Lönnerberg P, Lou D, Hjerling-Leffler J, Haeggström J, Kharchenko O, Kharchenko PV, Linnarsson S, Ernfors P. Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing. Nat Neurosci. 2015;18:145–153. [PubMed]
- Vogel C, Marcotte EM. Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nat Rev Genet. 2012;13:227–232. [PMC free article] [PubMed]
- Wang M, Master SR, Chodosh LA. Computational expression deconvolution in a complex mammalian organ. BMC Bioinformatics. 2006;7:328. [PMC free article] [PubMed]
- Zeisel A, Munoz-Manchado AB, Codeluppi S, Lonnerberg P, La Manno G, Jureus A, Marques S, Munguba H, He L, Betsholtz C, et al. Brain structure. Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science. 2015;347:1138–1142. [PubMed]
- Zhang J, Lanuza GM, Britz O, Wang Z, Siembab VC, Zhang Y, Velasquez T, Alvarez FJ, Frank E, Goulding M. V1 and V2b interneurons secure the alternating flexor-extensor motor activity mice require for limbed locomotion. Neuron. 2014;1:138–150. [PMC free article] [PubMed]
- Zhong Y, Wan YW, Pang K, Chow LM, Liu Z. Digital sorting of complex tissues for cell type-specific gene expression profiles. BMC Bioinformatics. 2013;14:89. [PMC free article] [PubMed]
- Zuk O, Amir A, Zeisel A, Shamir O, Shental N. Accurate profiling of microbial communities from massively parallel sequencing using convex optimization. String Processing and Information Retrieval 2013

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |