Flow Cytometry is among the most widely used platforms in biomedical research and clinical labs. It is used for investigation of a wide variety of biological problems at single cell level. Classical applications of flow cytometry include quantitative measurements of DNA content and cell cycle progression [
1]. It is also one of the key platforms for studying dynamic cellular properties such as differentiation, proliferation and apoptosis, especially in the contexts of stem cells and cancer [
2]. Such applications make flow cytometry the ideal platform for the purpose of identifying and monitoring the myriad states and functions in different specimens that vary over time under particular conditions and stimuli.
Typically, a flow sample is stained with fluorescent dyes, possibly attached to antibodies, and per cell events such as the expression of a cell-surface marker or the DNA content are measured in terms of fluorescence intensity. The distribution of these events are then plotted or modeled statistically for identification of important features in the sample. While developments in computational cytomics have produced many useful analytical methods (e.g. [
3]), several important problems have not yet been addressed adequately. One such issue involves precise parametric modeling of dynamic features in temporal profiles such that the model parameters can characterize the transition of the populations in a sample through different cellular states. Often simple statistics such as population mean or size can be imprecise in the presence of unusually shaped distributions and outliers in temporal profiles. The modeling scenario could be complicated further by the adoption of different trajectories by different subpopulations. Indeed a rigorous algorithm for modeling cellular state transitions can not only automate the traditionally manual approach, which is subjective and labor-intensive, but also extend it to increasingly complex and high-throughput experiments.
Many major cytometric studies have highlighted the importance of characterizing temporal profiles at single cell resolution for a variety of purposes such as cell cycle expression kinetics (e.g. [
4,
5]), pharmacodynamics (e.g. [
6,
7]), signaling alterations in specific subpopulations (e.g. [
8,
9]), dynamics of differentiation into distinct lineages (e.g. [
10,
11]), and so on. Clearly, mathematical formulation of a cellular state-space, and the transitions therein, can help us model a given collection of temporal flow cytometric profiles with the required rigor. Thereupon we can study the changes in features (say, in comparison with those in control profiles) and monitor trends in parametric detail. Precise probabilistic modeling of sample distributions at each stage can automatically reveal such dynamic features as emergence of a tail subpopulation or change in the skewness of a cluster that are statistically well-defined as well as biologically insightful [
3].
Temporal profiling of cellular state transitions in flow data can, however, present unique modeling challenges. Often the transient states produce non-Gaussian features such as asymmetric or trailing subpopulations owing to rush or delay in progression from one state to another [
5]. Intermediate states might also produce outliers that cannot be clearly distinguished from the more distinctive states. Moreover certain metastable states may appear only inconsistently in a given time course [
11]. Often the transient features appear and disappear at the tails of the more prominent distributions, and may be hard to model via automation. Thus a framework that uses robust probabilistic density functions to model time course data may be the best way to represent the underlying state-space, and reveal any sudden or gradual transition therein. In terms of the distribution of events in a flow sample, characteristics of different states may be determined by variation in size (say, percentage of cells in a peak or cluster), location (such as mean or mode) or significance (peak density) of the model components. While traditionally such changes were detected with manual or non-parametric techniques, several model-based frameworks have recently been applied with success, e.g. [
3,
12-
15].
Here we present StateProfiler, a new framework based on finite mixture models of skew t-Normal distributions (STNMIX) for statistical characterization of flow cytometric time course data. In particular, we present in StateProfiler a new greedy Expectation-Maximization (EM) algorithm for fitting our STNMIX model. The greedy EM algorithm starts with a minimum number of distributions (or components) and sequentially inserts a new component to the mixture until model convergence is achieved. This parsimonious approach allows us to detect the dynamic appearance (and disappearance) of transient features that are characteristic of many state transitions. In addition, intermediate states are known to produce spatial features in the form of distributions with unusual shapes or low separation, which can lead to overlapping components, and hence to an overestimated number of model components. For optimal model selection, we therefore also provide in StateProfiler a new procedure for merging skew t-Normal components that are significantly overlapping in the mixture such that the change in entropy of the resulting model is minimal. Besides profiling of unusually shaped distributions and less well-separated features, this allows StateProfiler to tackle cellular heterogeneity that exists even within clonal populations.
We applied StateProfiler to learn the temporal features of cell cycle progression in two mutant strains of budding yeast Saccharomyces cerevisiae. Based on knockout of S-phase triggering cyclins Clb5 and Clb6, we compared the S-phase delay phenotypes resulting from the differential regulation of the two cyclins. Also we used StateProfiler to construct the overall temporal profile of clonal divergence underlying lineage selection in mammalian hematopoietic progenitor EML cells. By comparing the fitted models at each time point, we observed a slow and non-montonic convergence of clonal outlier subpopulations to a final median state.