Multiparametric single-cell analysis has advanced our understanding of diverse biological and pathological processes, providing insights into cellular differentiation, intracellular signaling cascades and clinical immunophenotyping. Analysis by flow cytometry has increased steadily, fueled by growing interest in the identification of rare stem cell populations and the use of intracellular markers (i.e. phosphorylated proteins) for drug targeting. Modern flow cytometers typically provide simultaneous single-cell measurements of up to 12 fluorescent parameters in routine cases, and analysis of up to 17 protein parameters has been reported [1
]. Recently, the first commercially available next-generation mass cytometry platform (CyTOFTM, DVS Sciences Inc., Toronto, ON, Canada) has become available and allows routine measurement of 30 or more single-cell parameters [2
]. Despite increasing research in cytometric analysis and the technological advances in acquiring an increasing number of parameters per single cell, methods for analyzing multidimensional single-cell data remain inadequate. We present a novel analytical approach, Spanning-tree Progression Analysis of Density-normalized Events (SPADE), to organize high-dimensional cytometry data in an unsupervised manner, and to investigate natural and pathogenic cellular heterogeneity for biological insight.
Traditional methods for flow cytometry data analysis are often subjective and labor-intensive processes that require expert knowledge of the underlying cellular phenotypes. One common but cumbersome step is the selection of subsets of cells in a process called “gating” [3
]. A gate is a region, defined in a biaxial plot of two measurements, which is used to select cells with a desired phenotype for downstream analysis. Gates are either manually drawn using software such as FlowJo (http://www.treestar.com/
), FlowCore [4
], or automatically defined by clustering algorithms [5
]. Manual gating is highly subjective and dependent on the investigator’s knowledge and interpretation of the experiment. Automatic gating algorithms cluster cells by optimizing the objective that cells in the same cluster be more similar to each other than cells from other clusters. Because these algorithms strive to define maximally different clusters, they often miss the underlying continuity of phenotypes (progression) that is inherent in cellular differentiation [11
]. In addition, optimization objectives of most automatic gating algorithms are predisposed to capture the most abundant cell populations, while rare cell types, such as stem cells, are either excluded as outliers or absorbed by larger clusters. Some algorithms, such as a recent approach for automated gating termed SamSPECTRAL, have begun to include mechanisms for rare cell type identification [12
Traditional cytometry data analysis methods also commonly suffer from limitations in scalability and visualization with increasing numbers of measurements per single cell. These limitations become more acute as the data dimensionality increases. To fully visualize an m
-dimensional flow dataset,
biaxial plots are needed, where each biaxial plot displays the correlation of only two markers at a time. It is difficult to comprehend the correlations among three or more markers from a series of biaxial plots. One recent approach that partly addresses the scalability issue is the probability state model, implemented in the GemstoneTM software package (Verity Software House, Inc.). This approach rearranges cells into a non-branching linear order, according to an investigator’s knowledge or expectation of how known markers fluctuate along a progression underlying the measured cell population [13
]. Because cells are ordered in a non-branching fashion, a new model must be constructed for each mutually exclusive cell type (i.e. T cells, B cells).
SPADE is complementary to existing approaches for analyzing cytometry data by providing a visualization of multiple cell types in a branched tree structure that is constructed without requiring the user to define a known cellular ordering. Through a simple 2D visualization, SPADE shows how measured markers behave across all cell types in the data, while gating only focuses on user-selected cell types. SPADE partitions cytometry data into many hierarchically organized clusters that reflect all of the dimensions in the data, thus empowering investigators to identify and annotate known cell types, and to find unexpected ones. To demonstrate SPADE’s ability to detect a branched hierarchy underlying a heterogeneous population of real cells, we applied SPADE to a conventional (8-parameter) flow cytometry dataset of normal mouse bone marrow, a well-defined biological system with multiple known developmental transition points. The scalability of SPADE was demonstrated with a next-generation (31-parameter) mass cytometry dataset of normal human bone marrow using two staining panels and multiple experimental stimulatory conditions [14
]. SPADE organized the data in a tree structure that partially recapitulated known biology of hematopoiesis, identified surrogate markers that define a functionally distinct cell type, overlaid data from complementary staining panels, and mapped intracellular signal activation of functional markers across a landscape of hematopoietic development.