|Home | About | Journals | Submit | Contact Us | Français|
Advances in multi-parameter flow cytometry (FCM) now allow for the independent detection of larger numbers of fluorochromes on individual cells, generating data with increasingly higher dimensionality. The increased complexity of these data has made it difficult to identify cell populations from high-dimensional FCM data using traditional manual gating strategies based on single-color or two-color displays.
To address this challenge, we developed a novel program, FLOCK (FLOw Clustering without K), that uses a density-based clustering approach to algorithmically identify biologically relevant cell populations from multiple samples in an unbiased fashion, thereby eliminating operator-dependent variability.
FLOCK was used to objectively identify seventeen distinct B cell subsets in a human peripheral blood sample and to identify and quantify novel plasmablast subsets responding transiently to tetanus and other vaccinations in peripheral blood. FLOCK has been implemented in the publically available Immunology Database and Analysis Portal – ImmPort (http://www.immport.org) for open use by the immunology research community.
FLOCK is able to identify cell subsets in experiments that use multi-parameter flow cytometry through an objective, automated computational approach. The use of algorithms like FLOCK for FCM data analysis obviates the need for subjective and labor intensive manual gating to identify and quantify cell subsets. Novel populations identified by these computational approaches can serve as hypotheses for further experimental study.
Flow cytometry (FCM) is a method used to detect the presence and amount of cellular constituents in large numbers of single cells allowing for the enumeration and functional characterization of different cell types in complex cellular mixtures (9, 22, 23, 30, 46, 52). FCM can be used to measure the expression of membrane proteins, intracellular and secreted cell constituents, nucleic acids, phospholipids and post-translationally modified forms of proteins allowing FCM to be used to monitor cellular processes like DNA replication, programmed cell death and signal transduction activation (14, 18, 21, 24, 25, 27, 35). Thus, FCM has become an invaluable tool for investigation of basic physiological and biological processes in the experimental lab.
FCM has also become important as a diagnostic tool in the clinical lab (3, 5, 19, 37, 38, 50). For example, Chronic Lymphocytic Leukemia (CLL) and Monoclonal B cell Lymphocytosis (MBL) are both diagnosed based on abnormal patterns of CD5, CD20 and immunoglobulin light chains expression as assessed by FCM, even though the latter may show normal absolute leukocyte counts (32, 36, 44).
In the last decade, FCM has experienced dramatic technical advances in both instrumentation and reagent development such that the simultaneous assessment of 19 fluorescent and light scatter parameters on each individual cell is possible (8, 41). However, the standard methods for analyzing FCM data have not kept pace with these advances in technology. The majority of investigators still rely on iterative manual gating strategies in two-dimensional dot or contour plots to identify cell population subsets, which is somewhat subjective, time-consuming and difficult to reproduce even by the same investigator. In addition, if each of the 19 parameters is simply used to divide cell mixtures into expressing (+) and non-expressing (−) subsets, the number of potential cell populations could be as large as 524,288 (219). Thus, novel techniques are required to eliminate the dependence on manual gating in order to comprehensively analyze these rich datasets and extract biologically relevant findings.
Several groups have proposed new strategies for FCM data analysis, including a series of recent reports based on the modeling of cell population characteristics as mixtures of mathematical distributions (13, 17, 29, 42). Here we describe a model-independent approach – FLOw Clustering without K (FLOCK) – that utilizes grid-based partitioning and density distribution analysis to identify cell populations in high dimensional FCM data. The utility of FLOCK is demonstrated by the identification of seventeen B cell subsets in human peripheral blood and the monitoring of B cell population dynamics induced in response to vaccination.
Eight healthy adult subjects ages 22 to 65 years (mean 42 ± 15 year) were enrolled in this study at the University of Rochester Medical Center in 2008–2010; three were men and five were women. Peripheral blood mononuclear cells (PBMC) were isolated from heparinized blood by Ficoll-Hypaque density gradient centrifugation (Pharmacia Biotech). Seven subjects received the following vaccinations separately or in combination as a part of routine medical care: tetanus and diptheria toxoids (Td, Sanofi Pasteur, Swiftwater, PA), trivalent influenza vaccine 2009 (Fluzone, Sanofi,Pasteur. Swiftwater, PA), H1N1 monovalent influenza vaccine 2009 (Sanofi Pasteur, Swiftwater, PA), and Hepatitis Aand Hepatitis B (Twinrix, GlaxoSmithKline, Research Triangle Park, NC). PBMC were isolated on days 7 for all vaccination subjects; one subject had blood analyzed pre and post tetanus vaccination on days 5, 6, 7, 8, and 15. All studies were approved by the Research Subjects Review Boards at the University of Rochester Medical Center.
FCM data from three sample sets were analyzed using FLOCK in this manuscript. PBMC from the one untreated subject were stained with the following anti-human antibody staining reagents: IgD-FITC (BD Biosciences, San Diego, CA), CD1c-PE (Miltenyi, Auburn, CA), CD24-PE Alexa610 (Invitrogen), IgG-PE Cy5 (BD), CD3-PerCP Cy5.5 (BD), B220-PE Cy7 (BD), CD38-PacificBlue (BD custom conjugate), live/dead cell-staining PacificOrange Aqua (Invitrogen, Carlsbad, CA), CD27-APC (eBioscience, San Diego, CA), and CD19-APC-Cy7 (BD Biosciences, San Diego, CA). Note that the B220-PE-Cy7 reagent contains the RA3-6B2 monoclonal antibody, which recognizes the B220 glycoform of the CD45R protein. Stained cells were analyzed on an LSRII flow cytometer (Becton Dickinson, Franklin Lakes, NJ). Single conjugated mouse anti-human antibodies were added to Simply Cellular Anti-mouse compensation standard bead (Bangs Laboratories, Fishers, IN) and compensated automatically using Flow jo.8.8.6 (Treestar, Ashland, OR). The dataset used was CD3-CD19+ live lymphocyte gated (~67,000 events after gating) to focus the analysis on B cell populations. Data for CD3, CD19, CD1c and Aqua fluorescence were excluded from the FLOCK analysis.
The second dataset (tetanus study) examined B cell populations in serial PBMC samples from another subject after tetanus vaccination. Samples were taken immediately before vaccination (Day 0) and on Day 5, 6, 7, 8, and 15 after vaccination. PBMC were stained with the following anti-human antibodies to surface markers at 4°C for 20 minutes: IgD-PE (BD Biosciences, San Diego, CA), CXCR4-PE-Cy5 (eBioscience, San Diego, CA), CD3-PE-Cy5.5 (Caltag Laboratories, Carlsbad, CA), CD14-PE-Cy5.5 (Invitrogen, Carlsbad, CA), CD19-PE-Cy7 (BD Biosciences, San Diego, CA), CD38-Pacific Blue (BD Biosciences, San Diego, CA), CD138-APC (Miltenyi Biotec, Auburn, CA), HLA-DR-AlexaF700 (eBioscience, San Diego, CA), and CD27-APC-Alexa750 (eBioscience, San Diego, CA). Cells were then washed and resuspended in Cytoperm/Cytofix (BD Biosciences, San Jose, CA) containing anti-human Ki67-FITC antibodies (Invitrogen, Carlsbad, CA), and incubated for 20 minutes at 4°C. The dataset analyzed using FLOCK included onlyCD3-CD14-CD19+ gated events – Day 0 (13721 events), Day 5 (27832 events), Day 6 (27627 events), Day 7 (32204 events), Day 8 (31640 events), and Day 15 (39208 events).
The third data set (cross vaccine study) included PBMC samples from different subjects collected seven days after vaccination with various combinations of tetanus and diptheria toxoid, trivalent influenza 2009, monovalent H1N1 influenza 2009, Hepatitis Aand Hepatitis B vaccines. Samples were stained as above except excluding HLA-DR-AlexaF700. The dataset analyzed using FLOCK included only CD3-CD14-CD19+ gated events – 21702-8 (61691 events), 21702-19 (40333 events), 21702-6 (48124 events), 25251-134 (89261 events), 3334-139 (23951 events), and 21702-12 (68882 events).
PBMC were added to 96-well ELISpot plates (MAIPS4510 96 well) coated with tetanus toxoid (1 Lf/mL Cylex Incorporated, Columbia, MD) or anti-human IgG (5μg/mL, Jackson Immunoresearch, West Grove, PA) and incubated overnight. Wells were washed and bound antibodies were detected with alkaline phosphatase-conjugated anti-human IgG antibody (1μg/mL, Jackson Immunoreseach), and reactions developed with VECTOR Blue Alkaline Phosphatase Substrate Kit III (Vector Laboratories, Burlingame, CA). Spots in each well were counted using the CTL immunospot reader (Cellular Technologies Ltd, Shaker Heights, OH).
PBMC were isolated from a subject on day 7 after trivalent influenza vaccination and stained with the vaccination panel as described above. Three plasmablast/plasma cell populations identified by FLOCK were sorted and total cellular RNA was isolated using the RNeasy Mini Kit (Qiagen Inc., Valencia, CA) according to the manufacturer's protocol. After DNase I treatment (Invitrogen, Carlsbad, CA), 1 μg of RNA was subjected to reverse transcription using the iScript RT Kit (BioRad Inc., Hercules, CA). Aliquots of the resulting single-stranded cDNA products were included with BLIMP-1, Pax5, or GAPDH primers (6) in a 20 μl PCR reaction using HotStarTaq PCR Kit (Qiagen Inc., Valencia, CA), and amplified using a Bio-Rad MyCycler (Bio-Rad, Hercules, CA) with the following conditions: after an initial step of 95°C for 5 min, 40 cycles of 95°C for 30 sec, 55°C for 30 sec, and 68°C for 1 min, ending with a final extension step of 68°C for 5 min.
There are five major components in a FLOCK analysis pipeline: data preprocessing, grid-based density clustering, cross-sample comparison, result visualization, and population statistics calculations.
Data preprocessing consists of FCS file conversion and data normalization. FCS files used in this paper have been compensated before preprocessing. Binary .fcs files were converted to tab-delimited ASCII text format using the channel number export function in the FlowJo software package (Tree Star, Inc.). The exported data constitute a data matrix table with each row corresponding to an event (cell) and each column to either a light scatter parameter such as FSC or a fluorescence emission channel such as FL1.
As the exported matrix columns have different value ranges, they could potentially contribute differently to the similarity of events being measured and used for clustering. To balance their contributions to the similarity measure, each column in the data matrix was normalized. Two normalization methods have been used successfully in our studies. For the peripheral B cell subset analysis, the data matrix was column-wise z-score normalized. Z-score normalization has been shown to be effective in classifying hematopoietic bone marrow cell populations (57). For the tetanus study, the data matrix was column-wise, 0–1, min-max normalized so that the Stone’s binning method (51) could be used. The formulas for the z-score and the min-max normalization are:
where S is the original value, S’ the normalized value, μand σ the average and the standard deviation of the data column of S, respectively, Min and Max are the smallest and the largest value of the data column of S. Although normalized data are used for event clustering, original values are used for visualization and population statistics calculations.
The core of the FLOCK algorithm begins by first partitioning each dimension into equally sized bins, which results in partition of the n-dimensional space into “hyper-regions”. Figure 1A shows a 2D mock example of how the data space is partitioned. However, in the FLOCK algorithm, partitioning is applied to all dimensions available in the dataset simultaneously. Each dimension of the data space is partitioned into the same number of equal-sized bins. Therefore a d-dimensional data space is partitioned into t*d hyper-regions with t equal-sized bins in each dimension.
In the second step, each hyper-region is assessed to determine the number of events present, and any hyper-region in which the number of events exceeds a certain threshold is labeled as being “dense” (Figure 1B). Equal-sized binning generates hyper-regions of equal volume. Therefore we can define the density of a hyper-region as the number of events in the region. A density threshold is used to distinguish a dense hyper-region from sparse and empty hyper-regions. As the density threshold increases, the number of dense hyper-regions decreases.
In the third step, dense hyper-regions adjacent to each other in n-dimensional space (two dense hyper-regions are adjacent if they are next to and contiguous with each other in one dimension and have the same position on all other dimensions) are grouped together into dense hyper-region groups, and the centroid (a point whose coordinates are the averages of the corresponding coordinates of a given set of point) representing all events in each dense hyper-region group determined (Figure 1C).
Finally, the dense hyper-region centroids are used to assign each event to its closest centroid based on Euclidean distance (Figure 1D). Other distance metric (e.g. Mannahalobis distance) could be used, but these have not yet been tested to determine their affect on the FLOCK results. Then the centroids are updated based on the member events of the dense regions and the cluster membership is computed again with the new centroids. This procedure can be regarded as a modified K-means procedure with only a few iterations. As initial centroids are not randomly generated, but rather based on data density, FLOCK quickly converges to a stable result during the centroid recalculation. This approach provides for two important advantages to traditional K-means– it determines – the number of populations in the high dimensional space from the number of dense hyper-region groups and it arrives at an estimated location of the center of the density distribution for each.
Two parameters need to be provided to execute the core of the FLOCK algorithm the number of partitions used to divide each dimension (# of bins) and the value of the density threshold cut-off. Both parameters appear to be dependent on a number of characteristics being measured, i.e. the number of events captured and the distribution of the data in the n-dimensional space. In order to automatically select the number of partitions, two different methods were employed. The first method is based on our observation of the growth of non-empty hyper-regions generated, which initially increases as the number of bins increases, but then starts to decrease when the number of bins used per dimension is larger than necessary. For the peripheral B cell subset analysis, we chose a bin number to maximize the ratio of the number of non-empty hyper-regions versus the number of bins to ensure that distinct populations are separated into different hyper- regions. For the tetanus data set, Stone’s method (51, 54) was used to optimize the bias-variance trade-off for different number of bins, which allows FLOCK to partition the space sufficiently in order to ensure that distinct cell populations are separated, while avoiding excessive partitioning that would eliminate the density characteristics of the data space.
The other parameter to select is the density threshold cut-off. Different cut-off values lead to different clustering results. The one that gives rise to the best Silhouette value (47) in the clustering is used. To avoid exhaustively testing each possible cut-off value we employed several different threshold selection methods, and the method that generates the best Silhouette value is chosen. For the peripheral B cell subset analysis, an ad hoc method designed to distinguish the dense hyper-regions from background based on the average density of the hyper-regions was used. For the tetanus data set, the density threshold cut-off was selected based on the minimum description length (MDL) principle (2, 45) that is commonly used to identify the best cut-off value within a data sequence. We have also developed another ad hoc method to identify the inflexion point of the decrease of the number of dense hyper-regions as the cut-off increases, which usually generates a lower density threshold value than the MDL principle and is more effective at identifying sparse cell populations. The use of these different approaches for density threshold estimation allows FLOCK to be tailored for each data set and to identify both relatively rare and relatively abundant cell populations.
The FLOCK algorithm has been implemented in the Immunology Database and Analysis Portal (ImmPort; http://www.immport.org). The runtime of a single FLOCK analysis is largely dependent on the number of events in a single data file. A relatively large file of ~2 million events returned results in less than 20 minutes; more typical files in the range of 10 thousand to 100 thousand events return results in less than 2 minutes.
The visualization module of FLOCK as implemented in ImmPort supports 2-dimensional dot plots with each population colored uniquely. An individual population can be viewed in its own color with other populations in white as background. Users can navigate the multidimensional space by changing the dimensions or populations that are displayed.
Expression profiles of a population are presented that indicate the approximate expression level of each marker for the population. To generate the profile, FLOCK partitions each dimension into four bins with equal width. The four bins, from lowest values to highest values, are denoted with integers 1, 2, 3, and 4, respectively. The bin a population belongs to is decided by the majority vote of population events, and therefore each population’s profile is represented with a set of the integers –one integer (from 1–4) per marker.
Statistics calculated in FLOCK include population proportions, mean fluorescence intensity (MFI), cell count, standard deviation of the population on each parameter, and coefficient of variation.
For the analysis of the tetanus and cross vaccine studies, the group of population centroids generated by FLOCK analysis of a single sample was applied to multiple samples to facilitate population comparison. Each event is assigned to its closest centroid in each sample. Populations generated in such a way have the same ID and color across all samples. For these data, the FCM results from the tetanus study Day 6 sample was used to identify the initial centroid locations.
FLOCK was used as an objective method for the identification of B cell subsets in a dataset derived from the staining of a peripheral blood mononuclear cell (PBMC) sample derived from a normal human volunteer with reagents selected to identify various B lymphocyte (CD19+, CD3-) cell types based on six cell characteristics - expression of IgD, IgG, CD27, CD38, CD24, and B220. Seventeen distinct cell populations were identified by FLOCK using these characteristics. Figure 2A shows three two-dimensional dot plot combinations (IgD vs CD27; CD24 vs B220; IgG vs CD38) with cell events in a given B cell population subset highlighted in the same color. Six distinct regions can be observed from IgD versus CD27 dot plots, which appear to match previously defined B cell subsets (1, 20, 53, 55):
FLOCK was able to further identify additional subsets within each of these regions from the differential expression of the remaining markers.
While an IgD+CD27- phenotype is commonly used to globally define naïve B cells, there is mounting evidence demonstrating heterogeneity within this population, including the presence of early (T1 and T2) and late (T3, also called pre-naïve) transitional cells (28, 39). As shown in Figure 2B, FLOCK is indeed able to further subdivide this larger population into several discreet subsets based on the expression pattern of the remaining markers: N1 (B220+, CD38−), N2 (B220+, CD38+), and N3 (B220low, CD38+). The N2 subset was found to be the most abundant B cell subset in this peripheral blood sample at 48.9% of the CD19+ cells (see Table 1 for population proportions for all B cell subsets). Figure 2B also documents how FLOCK can take advantage of the differential expression of additional markers (in this case the relative expression of CD24) to provide a clean differentiation of naïve subsets that are identified in the literature as Bm1 (N1) and Bm2 (N2) based on the expression of IgD and CD38 (10, 40, 49, 55). These findings contrast with the recent report by Caraux in which circulating naïve B cells were reported as being CD38− (12). However, we found that FLOCK was able to clearly delineate low-level expression of CD38 in N2 and N3 populations as distinct from the negative expression in N1. Similar distinctions in CD38 expression were found in several memory B cell subsets as well (see below). This feature powerfully illustrates the ability of FLOCK to identify within multicolor data small populations previously recognized using conventional analysis with a limited number of markers. Additional scrutiny of these subpopulations using the relative expression of CD24 and CD38 suggest that CD24++, CD38++ T1/T2 transitional cells are likely included within the N2 population. (This important fraction, which has been proposed to include IL-10-producing regulatory B cells and to correlate with good clinical responses to B cell depletion in SLE, has been identified and quantified by FLOCK within the larger IgD+, CD27- N2 population on the basis of the expression of other markers in other samples (data not shown)).
Unswitched memory B cells, also designated by different investigators as natural memory or marginal zone-like B cells, are characterized by expression of both CD27 and IgD. These unswitched memory B cells were further subdivided into four distinct subsets by FLOCK based on the relative amount of IgD (UM1 - 4, Figure 2A), B220 and CD38 (Figure 2C). Interestingly, the majority of UM cells (UM3 & UM4) belong to the IgDint fraction and are B220low, a feature that has been proposed to coincide with memory cell differentiation of B220+ naïve cells, and which these cells share with isotype switched CD27+ memory cells. In contrast, the remainder of CD27+ UM cells (UM1 & UM2) express levels of IgD and B220 comparable to naïve cells, a feature suggesting that they might represent an earlier differentiation stage of UM cells. Of note, as before for naïve B cells, distinct CD38+ and CD38- fractions are readily discriminated by FLOCK within both IgD+ and IgDint fractions (UM1/UM2 and UM3/UM4, respectively).
Memory B cells that have lost surface IgD expression through AID-mediated switch recombination to downstream isotypes like IgG and IgA (CD27+, IgD-) are considered isotype switched memory cells that by and large express surface IgA or IgG. These cells could be first subdivided based on expression of IgG into IgG+ switched memory B cells (GSM, Figure 2D) and IgG- switched (largely IgA+ according to our own published and unpublished data) memory B cells (GNSM, Figure 2E). These two groups could be further subdivided based on the relative expression of CD38, B220 and CD24. Consistent with previous studies, the majority of GSM cells downregulate B220 expression. One CD27+, IgG+ population was found to express IgD as well (GSM1). This population likely represents B cells that have recently undergone class switch recombination to the gamma heavy chain locus, but still retain residual IgD protein. Of note, these unique cells also retain expression of B220, a feature possibly suggesting an early/incomplete differentiation stage from naïve B cells. As for previous populations, CD38 presence or absence could also be resolved with FLOCK analysis for both GSM and GNSM populations. Based on CD38 expression, it appears that the combination of GSM2, GNSM1 and GNSM3 would correspond to the traditional early Bm5 subset (IgD-, CD38+) and the combination of GSM3, GNSM2 and GNSM4 would correspond to the late Bm5 subset (IgD-, CD38-) (49, 55). Some interesting differences however were detected between these two subsets of switched memory cells. Thus, no IgD co-expression was detected in any GNSM cells. Moreover, while the majority of GNSM cells behave like the majority of GSM cells, the GNSM3 subset retained expression of B220. Further analysis will be needed to explore the interesting possibility that these cells could represent the relatively rare IgM-only memory subset identified by others. Such possibility would be consistent with the B220 expression and in keeping with the early differentiation stage proposed for these cells (26). Similarly, since no reagents for the detection of Ig mu heavy chain or kappa and lambda light chains were included, the presence of an IgD-only, lambda positive B cell subset (5) cannot be addressed from these data. Finally, GNSM cells also contained one of only two peripheral blood subsets with negative expression of both CD24 and CD38 (GNSM4).
Cells lacking expression of both CD27 and IgD (predominantly expressing either IgG or IgA and containing a significant degree of somatic hypermutation) have been reported by different groups. We have reported that patients with SLE are characterized by large expansion of these cells; significant expansions have also been reported in patients with HIV infection and malaria (34, 55, 56). As for other populations, these so-called double negative memory B cells (DNM1 - 2) that express low levels of CD27 and lack expression of CD38 and IgD were also found in both IgG+ and IgG- varieties (Figure 2F). In contrast to conventional memory cells and in keeping with our previous observations, these cells express higher levels of B220 and are devoid of CD24 expression.
Finally, the plasmablast subset (PB) can be easily delineated based on the relatively high level of both CD27 and CD38 expression (Figure 2A & G). PB were found to be predominantly B220-, CD24- and surface Ig−.
In the end, each of the seventeen B cell subsets identified by FLOCK can be distinguished from every other population, based on some combination of staining parameters, and thus represents a distinct B cell subset in peripheral blood. In addition, although not necessarily symmetric, the distributions of events for each population in each dimension appear unimodal indicating that these subsets cannot be further subdivided based on the data available.
As a test of the utility of this strategy for automated cell population identification in FCM data, FLOCK was applied to a dataset from multiple peripheral blood samples before and after tetanus vaccination. In order to support direct cross-sample comparison, FLOCK was first applied to the Day 6 sample, and the centroid locations used to cluster each of the remaining data files.
Twelve distinct CD19+ cell populations were identified in this dataset using reagents designed to identify particular plasmablast and plasma cell populations (Figure 3A). (Plots of all CD19+ cell populations are provided in Supplemental Figure 2. Supplemental Table 1 lists the expression profiles and population proportions generated by FLOCK for the identified populations). Again, FLOCK was able to identify conventional B cell subsets as observed in IgD vs CD27 plots (49). Populations #1, #2, #11 and #12 correspond to mature naïve (B cells that have not yet encountered their cognate antigen) and, perhaps, transitional B cells in that they express IgD but lack CD27 (Figure 3A). These four naïve B cell populations are distinguishable by differential expression of other cell surface proteins. For example, Population #1 expresses CD38, suggesting that it may represent a transitional B cell subset. Population #2 expresses low levels of CD138; Population #11 lacks HLA-DR expression; while Population #12 expresses HLA-DR (Supplemental Figure 2). The functional significance of these expression differences is not known at this time and requires further investigation.
Population #3 corresponds to conventional unswitched memory B cells, which express both CD27 and IgD; Population #4 may constitute a subset in transition between naïve and unswitched memory B cells. Populations #8 and #10 correspond to class switched memory B cells, which express CD27 but lack IgD; again these two sub-populations can be distinguished based on HLA-DR expression. Population # 9 lacks IgD and CD27 expression suggestive of the double-negative switched memory phenotype (55).
The traditional plasmablast subset has been described as being CD38high and CD27high. Interestingly, FLOCK identifies 3 different plasmablast/plasma cell populations using this 10-parameter panel. All three of these plasmablast/plasma cell subsets are positive for intracellular Ki67, a marker of recent proliferation. Population #6 (CD19+, IgD-, CD27high, CD38high, CD138-) expresses high levels of CD27 and CD38 but lacks CD138, consistent with the traditional plasmablast phenotype. In contrast, Population #7 (CD19+, IgD-, CD27high, CD38high, CD138+) not only expresses high levels of CD27 and CD38 but also CD138, a marker of long-lived plasma cells, suggesting the mobilization of a novel plasma cell subset in blood after vaccination. In addition, a third putative plasmablast subset, Population #5 (CD19+, IgD−, CD27+, CD38+, CD138−), was identified by FLOCK analysis.
Plasmablast/plasma cell populations #5, #6 and #7 show interesting dynamics in response to tetanus vaccination (Figure 3B). Population #6 has the characteristic phenotype of the traditional plasmablast subset and shows an oscillatory response in which population proportions dip by Day 5 only to rise again on Day 6 before falling on Day 7. In contrast, Population #7 (plasma cells) starts out low and remains low at Day 5, and then shows a dramatic but transient spike on Day 6. By clearly segregating these population subsets, FLOCK was able to demonstrate that these cells responding to vaccination express higher levels of Ki67, suggesting that the responding cells are proliferating, and also exhibit cytokine-secreting effector functions (data not shown). In contrast to the prevailing notion of old CD138+ plasma cells displaced from the bone marrow after vaccination, these data suggest that circulating CD138+ plasma cells are newly generated, as shown by Ki67 positive staining, and are active participants in the newly induced immune response (43). Population #5 does not express the characteristic high levels of CD27 (Figure 3A) and CD38 (Supplemental Figure 2) of the traditional plasmablasts and plasma cells but shows the same kind of oscillatory response as Population #6 (plasmablasts). Based on the gene expression studies described below, this subset may represent an early plasmablast precursor based on its phenotype.
The results of FLOCK were similar to the results of populations identified using manual gating (Supplemental Figure 3; Figure 3C, D). In addition, tetanus-specific IgG antibody secreting cells (ASCs) and total IgG ASCs were enumerated and compared with plasmablast populations identified by flow cytometry using FLOCK or manual analysis. The proportions of Population #7 identified by FLOCK closely match the results obtained from manual gating for CD138-expressing, CD38+ plasma cells (Figure 3C, D). The dynamic pattern of Population #7 proportions also closely paralleled the levels of total IgG-secreting and tetanus-specific plasma cells as quantified by ELISPOT.
Plasma cell differentiation is initiated by down-regulation of Pax5 and up-regulation of BLIMP-1. To validate that these 3 novel plasmablast-like populations are indeed of plasma cell lineage, Populations #5, 6, and 7 were sorted from the PBMC of a patient on day 7 after trivalent inactivated influenza vaccine (2009) vaccination and Pax5 and BLIMP-1 mRNA expression measured (Figure 3E). Population #7 demonstrates a lack of Pax5 and high levels of BLIMP-1 expression, evidence of the fully-differentiated plasma cell phenotype; whereas Population #6 has low Pax5 and high BLIMP-1 levels suggesting an intermediary phenotype. Lastly, Population #5 has high levels of both Pax5 and BLIMP-1 suggesting that this novel population might correspond to an early plasmablast population that has acquired BLIMP-1 expression but has not yet downregulated Pax5. Thus, FLOCK was able to discover previously characterized and novel plasmablast/plasma cell populations.
Finally, to demonstrate the reproducibility of cell population identification using FLOCK and to further validate the presence of these 3 distinct plasmablast/plasma cell populations in PBMC, an additional 6 subjects who received routine vaccinations with tetanus and diptheria toxoid, trivalent influenza vaccine 2009, H1N1 monovalent influenza vaccine 2009, Hepatitis A, and Hepatitis B were analyzed 7 days post-vaccination. These 3 plasmablast/plasma cell populations were identified in all 6 subjects with varying percentages (Figure 4A) – Population #5 (1.14–2.79%), Population #6 (1.86–9.68%), and Population #7 (0.62–9.1%). The population proportions derived from FLOCK analysis were compared with the results from manual gating (Figure 4B). For Populations #6 and #7, the results are highly correlated, with an r > 0.99. For Population #5, the correlation is less strong; this likely relates to the overlapping nature of the staining distributions in the CD38 and CD138 dimensions, which is difficult to address using manual gating in 2D plots. However, taken together, these data suggest that automated population identification using FLOCK is highly reproducible and demonstrate that the FLOCK program can accurately and objectively identify relevant human B cell populations in clinical samples without the need for labor-intensive manual gating.
Recent progress in FCM data analysis has mainly focused on model-based approaches, including the applications of Bayes mixture Gaussian models (13], t-mixture models as implemented in flowClust (29) and skew-t models as implemented in FLAME (Flow Analysis with Automated Multivariate Estimation) (42). User-driven clustering procedures for FCM population identification have also been described (57). The main advantage of model-based clustering is the robustness in identifying populations that follow the predefined models, if the model parameters can be estimated in an optimal way without becoming trapped in a local minimum solution. However, it remains an open question if model-based approaches can solve the high-dimensional clustering problem. Estimating and refining model parameters is a slow procedure for high-dimensional data set. In addition, the populations may not follow the distributional models when they become sparse in the high-dimensional space due to the curse of dimensionality.
Unlike model-based clustering, grid-based density clustering used in FLOCK partitions the data based on the closeness among events and the density variance. The analysis is conducted on data regions generated from the grid, which makes the whole procedure highly efficient as the number of data regions is much smaller than the number of events. Also the grid structure helps FLOCK rapidly locate the central areas of cell population distributions, which, as Bradley (11) and Ding (16) have pointed out, is critical for obtaining high-dimensional data clustering because optimization-based procedures are much more likely to become trapped in local minima with high-dimensional distributions than with low-dimensional distributions. In this respect, FLOCK could be complementary to recent model-based methods as the results of FLOCK could be used to provide the model-based approaches with initial parameters. An additional advantage to the FLOCK approach to cross-sample comparison is that the approach appears to be relatively robust to slight difference in fluorescence intensity of cell population between samples. This feature is especially useful for studies in which differences in sample preparation and staining procedures might generate technical variability in fluorescence that are not biologically relevant.
In the studies reported here, FLOCK was able to identify cell subsets that represented as few as 0.17% (23 events) of CD19+ cells in the complex PBMC mixture (Population #7 in the tetanus Day 0 sample). Visual inspection of the cell events in Figure 3B suggest that these are real. Thus, if a lower limit of 20 events is used as a confidence cutoff, a minimum of 20,000 (gated) events would need to be collected to identify cell populations present at a proportion of 0.1% or greater.
As the number of staining dimensions increases, the high dimensional space becomes increasing sparse, making it difficult to find population clusters based on density. We have recently developed a modification to the core FLOCK algorithm that incorporates a dimension selection step that focuses the initial clustering on the most informative 4–6 dimensions, followed by a refinement of the results based on data from other dimensions. Using this modification, FLOCK is able to handle datasets up to the 19-dimension limit of the current cytometer instrumentation (data not shown)
One of the main obstacles for the assessment of FLOCK is that there are no gold standards of cell population identity. In this paper we attempted to use population histograms to evaluate the results of FLOCK-based population identification. Populations identified by FLOCK showed unimodal distributions for all parameters measured in the experiment; 2-dimensional dot plots demonstrated the uniqueness of each population subset. Since these are objective characteristics that are independent from the criteria used in the FLOCK clustering algorithm (full-dimensional closeness and density), the results of FLOCK are validated using criteria that are not incorporated in the algorithm itself. Ultimately, it will be up to the biologists to determine their functional relevance. At least FLOCK will provide a more objective perspective for the understanding of FCM data instead of relying solely on human interpretation to supervise the data classification. To this end, a density-based approach seems the best candidate as it assumes very little about the data distributions but only uses the inherent density variation to partition the data.
Collaborative efforts have been made to provide an objective way to compare and evaluate the FCM data clustering methods, and also to establish guidance about appropriate use and application of these methods. These efforts have recently resulted in the open project FlowCAP (Flow Cytometry: Critical Assessment of Population Identification Methods, http://flowcap.flowsite.org/). FlowCAP is seeking and collecting FCM data to test different computational methods. Additional evaluation of FLOCK will be performed within the FlowCAP project once sufficient data sets have been collected.
The value of FLOCK is illustrated by two main features reflected in this manuscript, namely the automated recognition of multiple populations from data sets generated by multicolor flow analysis and the ability to identify unheralded populations that were not be readily suspected with iterative bi-dimensional gating. Indeed, by and large, the human B cell literature is confined to the classification of B cell subsets based on the relative expression of IgD/CD27/CD38/CD24, typically in 2 to 4-color combinations. In contrast, the multicolor approach used in our studies allowed FLOCK to consistently identify 17 well-defined subsets in human peripheral blood (the results shown in Figure 2 and Table 1 were corroborated in 4 additional unimmunized healthy subjects; results not shown). Thus, FLOCK identified major parental populations conventionally defined by the relative expression of IgD and CD27 (naïve, unswitched memory, switched memory, CD27- switched memory and plasmablasts). But, in addition to these classical B cell subsets, FLOCK was able to further subdivide these populations based on the expression of additional cell surface markers –CD38, CD24 and B220 - that contribute important information regarding the developmental and activation state of B cells. In contrast to previous studies, the ability to accurately segregate these subsets using FLOCK will allow their consistent and more objective quantification in longitudinal studies and across samples obtained in multicenter studies, thereby providing reliable information to understand the homeostasis of the different B cells subsets and the significance of their perturbation in different clinical settings, including both physiological and pathological immune responses (31, 48). Furthermore, this approach will greatly facilitate the isolation of individual subsets for the ex vivo and in vitro study of their biology and function.
In the tetanus study, FLOCK was able to identify several B cell subsets that responded to vaccination in a recall response following well-established kinetics patterns. While validating the reproducibility of FLOCK measurements, these studies also powerfully illustrate the ability of FLOCK to provide detailed phenotypic characterization of predetermined populations and to demonstrate new functional properties. The latter capability is perhaps best encapsulated in the recognition by FLOCK of the high levels of Ki67 universally present in the CD38high plasmablast (Population #6) and plasma cell (Population #7) populations also expressing high levels of CD27. The expression of Ki67 in both the CD138- plasmablast and CD138+ plasma cell populations in normal peripheral blood is consistent with a recent report by Caraux et al. (12). Moreover the longitudinal studies demonstrated the ability of FLOCK to uncover a previously unrecognized candidate population (Population #5 –CD38+, CD138-) likely to represent an early stage of plasma cell differentiation in the context of an antigen-specific acute recall response. This finding will now permit detailed studies of clonal relatedness between the different populations as well as analysis of antigen-affinity and affinity maturation across the different antibody-secreting cell subsets in order to understand the rules that govern the maturation of plasma cells and their selection into long-lived compartments (indicated at least in part by the differential expression of CXCR4 (4, 43)).
In the cross vaccine study, the proportions of the three plasmablast/plasma cell subsets varied dramatically between subjects, as much as 15 fold for the plasma cell (Population #7) population. Given the relatively small sample size in this pilot study it is difficult to determine if this variability in vaccine response is due to genetic differences in the human subjects enrolled in the study, differences in the prior vaccination or pathogen exposure history of the subjects, or differences in the vaccine formulations. The fact that the strongest response and one of the weakest responses were observed in the two subjects receiving the trivalent influenza vaccine 2009 argues against the latter explanation. It will be interesting to determine if the absolute levels of one or more of these plasmablast/plasma cell populations, or their dynamics, will serve as an effective surrogate marker of protective immunity following vaccination.
One challenges that arises from these studies is how to compare the 17 B cell populations identified in normal peripheral blood by FLOCK using this combination of staining reagents compare with other studies investigating B cell subsets in other normal and pathologic conditions, including MBL. Unfortunately, this is a difficult question to address because reagent panels used to distinguish between normal B cell subsets are not exactly the same as those used for disease diagnosis. Markers that are useful for disease diagnosis are often selected based on their abnormality of expression. For example, MBL was recognized partly through the investigation of chronic lymphocytic leukemia, with both characterized by abnormalities in the expression pattern of CD5, CD20 and immunoglobulin light chains. More recently, Nieto et al (36) used an 8-marker panel, including CD20, CD5, CD23, and kappa and lambda immunoglobulin light chains to study CLL-like MBL and found that these cells could be routinely detected in peripheral blood of healthy subjects. Unfortunately, the markers used in the analysis reported here were selected to highlight differences in normal memory B cell populations, and did not include CD5 and CD20. This raises the general challenge of comparing results from separate studies in which distinct, but overlapping sets of cellular markers are used. Indeed, this challenge even arose when we tried to compare the populations identified in the three data sets analyzed in this study, which is why different naming conventions were used. Collaborative efforts may be necessary in order to develop strategies to determine the relationships between cell populations identified in different studies, perhaps replacing less informative markers in our study (e.g. CD24, CD1c) with markers better suited to identify CLL-like MBL cells (e.g. CD5, CD20)(36) and immature/transitional early B cells (e.g. CD10) (12).
In order to address this challenge, one of our near-term goals is to link FLOCK results to the Cell Ontology (7, 15, 33). The Cell Ontology is being developed to provide a standard description of normal cell types based on experimentally-defined properties. In the case of hematopoietic cells, the expression of cell surface and intracellular proteins is being used as the major property defining distinct cell types in the ontology. The fluorescence profile of a population identified by FLOCK indicates the relative expression of the marker protein in the population, and thus could be used to link FLOCK results with Cell Ontology definitions. This would facilitate the management and exchange of cell population data among different labs. And since cell types are represented as a hierarchy of type-subtype relations in the ontology, it will be possible to understand the relationship of cell types identified at different levels of granularity.
FLOCK is a computational method that automatically identifies cell populations in multi-dimensional flow cytometry data. Based on a variety of assessment characteristics, FLOCK appears to objectively identify distinct cell types, including both previously characterized B cell subsets and novel subgroups of cells of unknown significance. As the main FCM analytical software developed in the context of the Bioinformatics Integration Support Contract (BISC) project, FLOCK has been implemented within Immunology Database and Analysis Portal -ImmPort (http://www.immport.org) environment, and is freely available for scientific research purposes. While FLOCK was developed for the analysis of FCM data, the clustering strategy developed can also be applied to other data mining problems that have previously relied on K-means to perform cluster analysis. Ultimately, the use of objective algorithmic approaches like FLOCK should play an important role in the improvement of clinical diagnostics based on flow cytometry.
This work was supported in part by the U.S. National Institutes of Health N01-AI40076 (Bioinformatics Integration Support Contract), N01-AI50029 (Rochester Program for Biodefense of Immunocompromised Populations), and U19-AI56390 (Rochester Autoimmunity Center of Excellence).