A number of biologically important processes involve transitions through distinct cell states. Differentiation1,2,3,4,5,6,7,8,9
and disease initiation and progression12,13,14
are among the many examples of this kind. State changes in such processes are in general stochastic, as reflected in experimentally observed variation in transition latency even in the setting where transitions arise in homogenous cell cultures subjected to defined driving events (e.g. Hanna et al.17
Stochasticity of transitions at the single-cell level () imply that during such a process a cell population is a mixture of cells in different states, with the state composition of the cell population itself time-varying (). Studying single-cell events in heterogenous, time-varying populations is challenging and the global changes in single-cell transcriptional, metabolic, and epigenetic state that are involved in these processes remain incompletely understood. High-throughput assays based on homogenates provide only population-averaged data; in transition processes such data represent averages over heterogenous states (). Genome-wide single-cell protocols are now emerging2,4
, but their efficiency, availability and depth remain limited. Furthermore, these are not live cell assays, so cannot be used to directly track genome-wide molecular profiles of single cells undergoing state transitions.
Stochastic cell state transitions and population-averaged molecular data (illustrated, without loss of generality, with reference to reprogramming and gene expression data).
Here we present a general stochastic model of transition processes that links parameters at the single cell level to time-course data at the cell population level, as obtained for example in conventional expression, proteomic or epigenetic assays based on homogenates. The key novelty of our approach is to specify latent stochastic models at the single-cell level and then (mathematically) aggregate the models to give a likelihood at the level of homogenate data. As we show below, this allows parameters specific to single-cell states and transitions between them to be estimated from homogenate, time-course data. To facilitate analysis of data collected at non-uniform time points we use continuous-time Markov processes as the single-cell models.
Estimation of model parameters from population-averaged time-course data then gives information on several aspects of the single-cell states and transitions, including:
- Single-cell state profiles (e.g. state-specific expression, protein or epigenetic profiles);
- State markers (e.g. genes, proteins or marks that are highly specific to individual states); and,
- Dynamical information concerning transition rates, cell residence times, and population composition through time.
To fix ideas and illustrate our approach, we develop and apply our model in the context of reprogramming of mouse embryonic fibroblasts (MEFs) to a state of pluripotency10,15,16
. This is a process that has been widely studied in recent years, and where a number of advanced experimental approaches have been brought to bear. Recent studies have shown that reprogramming has a substantial stochastic component. Subclones derived from the same transduced somatic cells activate pluripotency markers, such as Nanog-GFP, at very different times, over a range of a few weeks10,15,16
. Further, there is evidence that the entire cell population has the potential to give rise to pluripotent cells during direct reprogramming, i.e., there is not an “elite” group of cells that are uniquely able to do so17
. Thus, current evidence suggests reprogramming is an inherently stochastic process17
in which individual cells change from an initial differentiated state to an induced pluripotent stem cell (iPSC) state. Single-cell studies using pre-selected sets of genes have begun to elucidate cellular events in reprogramming19,20,21,22
. However, at the genome-wide level many questions remain open and our understanding of the state transitions, including the number of traversed states, their marker genes and transition rates, remains limited.
In our model, we suppose that a cell can stochastically visit a set of n
states during the transition process (). Transitions between these states are described by a latent continuous-time Markov process whose discrete state space is identified with cell states (see Methods and SI
for details). The model parameters are (i) the transition rates wi,i′
between states i
′ and (ii) state-specific parameters βij
that represent the mean expression level for gene j
in state i
(we focus on transcriptomic data here, but the analysis could be readily applied to e.g. proteomic or epigenomic data). We refer to the β
's as state-specific signatures (). The population dynamics are characterized solely by the transition rates: given the rates wi,i′
, the Markov model yields the probability pi
) of being in state i
at time t
. For a large number of cells, the population-averaged expression xj
) of gene j
at time t
is then a combination of state-specific expression levels weighted by the probability of being in each state ():
can be estimated from time-course data. Complicated transition networks may require ancillary data to ensure identifiability. Here, for simplicity, here we limit ourselves to consider only linear forward-transition models (i.e., no reverse arrows in ); this constraint allows direct application to conventional, time-course data. In the reprogramming context, we note that recent data17
support the idea that almost all donor cells eventually give rise to iPS cells. These results were determined from single cell assays that observed the appearance of one marker for the final state, expression of the Nanog pluripotency gene, and indicated an irreversible switch to pluripotency during reprogramming11
. For the reprogramming application we present, as discussed in detail below, we further assume that all cells in the starting population are in an initial (differentiated) state.
We put forward a computationally efficient approach for estimation, as implemented in software called STAMM (State Transitions using Aggregated Markov Models; see Methods and SI
for details). As we illustrate below, STAMM can be readily applied to full-genome studies. Furthermore, since STAMM is rooted in a probabilistic model, model selection methods allow exploration of the likely number of single-cell states in a transition process of interest.
Hidden Markov models (HMMs) are widely used to describe latent processes in biological applications and have previously been used to describe cell populations26
and model the cell cycle27,28
. It is interesting to contrast our model with a classical HMM. The key differences are twofold. First, our model involves aggregation of single-cell level Markov chains, thus it deals with states that are not only hidden, but whose connection to population-level observables necessarily involves averaging over multiple instances of the latent process. In contrast, a HMM applied to time-course data from a transition process does not provide a model at the single-cell level. Second, our model operates in continuous time and applies naturally to non-uniformly sampled data. In contrast, in a HMM the underlying Markov process operates in discrete time, such that the probability of a state transition is the same between successive time points regardless of the intervening time period. This assumption generally will not hold at all under uneven time sampling of a heterogenous population. Due to these reasons, in our view HMMs are intrinsically ill-suited to the study of transition processes of the type we consider here.
An alternative approach to using HMMs is to attempt direct deconvolution based on a model of single-cell expression profiles, e.g.29,30,31
. These approaches have greater deconvolution power but are hindered by the upfront requirement for an expression model. For example, Siegal-Gaskins et al.29
established a model for the progression of Caulobacter
cells through their own cell cycle. Similarly, Rowicka et al.31
measured the distribution of cell cycle time-shifts based on well-known cell cycle regulated genes. While these methods can in principle be adapted for other organisms and systems, the STAMM method presented here is immediately and directly applicable to time-course data from general transition processes. In particular, it is not necessary to find genes following regular expression profiles, nor is it necessary to have an a priori
knowledge of the phases of the process of interest, which in many applications, including reprogramming as considered here, remain incompletely understood.
Differential expression analysis is widely used to highlight potentially important players in high-throughput studies. Approaches have also been proposed for time course data that rank genes (or proteins) according to whether they show evidence of change over time or relative to a control time course23,24,32
or that cluster together genes that show similar temporal profiles25
. However, these approaches do not attempt to model single-cell state transitions nor account for cellular heterogeneity.