Eight healthy adult subjects ages 22 to 65 years (mean 42 ± 15 year) were enrolled in this study at the University of Rochester Medical Center in 2008–2010; three were men and five were women. Peripheral blood mononuclear cells (PBMC) were isolated from heparinized blood by Ficoll-Hypaque density gradient centrifugation (Pharmacia Biotech). Seven subjects received the following vaccinations separately or in combination as a part of routine medical care: tetanus and diptheria toxoids (Td, Sanofi Pasteur, Swiftwater, PA), trivalent influenza vaccine 2009 (Fluzone, Sanofi,Pasteur. Swiftwater, PA), H1N1 monovalent influenza vaccine 2009 (Sanofi Pasteur, Swiftwater, PA), and Hepatitis Aand Hepatitis B (Twinrix, GlaxoSmithKline, Research Triangle Park, NC). PBMC were isolated on days 7 for all vaccination subjects; one subject had blood analyzed pre and post tetanus vaccination on days 5, 6, 7, 8, and 15. All studies were approved by the Research Subjects Review Boards at the University of Rochester Medical Center.
2.2 FCM Analysis
FCM data from three sample sets were analyzed using FLOCK in this manuscript. PBMC from the one untreated subject were stained with the following anti-human antibody staining reagents: IgD-FITC (BD Biosciences, San Diego, CA), CD1c-PE (Miltenyi, Auburn, CA), CD24-PE Alexa610 (Invitrogen), IgG-PE Cy5 (BD), CD3-PerCP Cy5.5 (BD), B220-PE Cy7 (BD), CD38-PacificBlue (BD custom conjugate), live/dead cell-staining PacificOrange Aqua (Invitrogen, Carlsbad, CA), CD27-APC (eBioscience, San Diego, CA), and CD19-APC-Cy7 (BD Biosciences, San Diego, CA). Note that the B220-PE-Cy7 reagent contains the RA3-6B2 monoclonal antibody, which recognizes the B220 glycoform of the CD45R protein. Stained cells were analyzed on an LSRII flow cytometer (Becton Dickinson, Franklin Lakes, NJ). Single conjugated mouse anti-human antibodies were added to Simply Cellular Anti-mouse compensation standard bead (Bangs Laboratories, Fishers, IN) and compensated automatically using Flow jo.8.8.6 (Treestar, Ashland, OR). The dataset used was CD3-CD19+ live lymphocyte gated (~67,000 events after gating) to focus the analysis on B cell populations. Data for CD3, CD19, CD1c and Aqua fluorescence were excluded from the FLOCK analysis.
The second dataset (tetanus study) examined B cell populations in serial PBMC samples from another subject after tetanus vaccination. Samples were taken immediately before vaccination (Day 0) and on Day 5, 6, 7, 8, and 15 after vaccination. PBMC were stained with the following anti-human antibodies to surface markers at 4°C for 20 minutes: IgD-PE (BD Biosciences, San Diego, CA), CXCR4-PE-Cy5 (eBioscience, San Diego, CA), CD3-PE-Cy5.5 (Caltag Laboratories, Carlsbad, CA), CD14-PE-Cy5.5 (Invitrogen, Carlsbad, CA), CD19-PE-Cy7 (BD Biosciences, San Diego, CA), CD38-Pacific Blue (BD Biosciences, San Diego, CA), CD138-APC (Miltenyi Biotec, Auburn, CA), HLA-DR-AlexaF700 (eBioscience, San Diego, CA), and CD27-APC-Alexa750 (eBioscience, San Diego, CA). Cells were then washed and resuspended in Cytoperm/Cytofix (BD Biosciences, San Jose, CA) containing anti-human Ki67-FITC antibodies (Invitrogen, Carlsbad, CA), and incubated for 20 minutes at 4°C. The dataset analyzed using FLOCK included onlyCD3-CD14-CD19+ gated events – Day 0 (13721 events), Day 5 (27832 events), Day 6 (27627 events), Day 7 (32204 events), Day 8 (31640 events), and Day 15 (39208 events).
The third data set (cross vaccine study) included PBMC samples from different subjects collected seven days after vaccination with various combinations of tetanus and diptheria toxoid, trivalent influenza 2009, monovalent H1N1 influenza 2009, Hepatitis Aand Hepatitis B vaccines. Samples were stained as above except excluding HLA-DR-AlexaF700. The dataset analyzed using FLOCK included only CD3-CD14-CD19+ gated events – 21702-8 (61691 events), 21702-19 (40333 events), 21702-6 (48124 events), 25251-134 (89261 events), 3334-139 (23951 events), and 21702-12 (68882 events).
2.3 Tetanus-specific and Total IgG Antibody Secreting Cell (ASC) ELISpot Assay
PBMC were added to 96-well ELISpot plates (MAIPS4510 96 well) coated with tetanus toxoid (1 Lf/mL Cylex Incorporated, Columbia, MD) or anti-human IgG (5μg/mL, Jackson Immunoresearch, West Grove, PA) and incubated overnight. Wells were washed and bound antibodies were detected with alkaline phosphatase-conjugated anti-human IgG antibody (1μg/mL, Jackson Immunoreseach), and reactions developed with VECTOR Blue Alkaline Phosphatase Substrate Kit III (Vector Laboratories, Burlingame, CA). Spots in each well were counted using the CTL immunospot reader (Cellular Technologies Ltd, Shaker Heights, OH).
2.4 Quantitative PCR for Expression of B Cell Transcription Factors
PBMC were isolated from a subject on day 7 after trivalent influenza vaccination and stained with the vaccination panel as described above. Three plasmablast/plasma cell populations identified by FLOCK were sorted and total cellular RNA was isolated using the RNeasy Mini Kit (Qiagen Inc., Valencia, CA) according to the manufacturer's protocol. After DNase I treatment (Invitrogen, Carlsbad, CA), 1 μg of RNA was subjected to reverse transcription using the iScript RT Kit (BioRad Inc., Hercules, CA). Aliquots of the resulting single-stranded cDNA products were included with BLIMP-1, Pax5, or GAPDH primers (6
) in a 20 μl PCR reaction using HotStarTaq PCR Kit (Qiagen Inc., Valencia, CA), and amplified using a Bio-Rad MyCycler (Bio-Rad, Hercules, CA) with the following conditions: after an initial step of 95°C for 5 min, 40 cycles of 95°C for 30 sec, 55°C for 30 sec, and 68°C for 1 min, ending with a final extension step of 68°C for 5 min.
There are five major components in a FLOCK analysis pipeline: data preprocessing, grid-based density clustering, cross-sample comparison, result visualization, and population statistics calculations.
2.5.1 Data Preprocessing
Data preprocessing consists of FCS file conversion and data normalization. FCS files used in this paper have been compensated before preprocessing. Binary .fcs files were converted to tab-delimited ASCII text format using the channel number export function in the FlowJo software package (Tree Star, Inc.). The exported data constitute a data matrix table with each row corresponding to an event (cell) and each column to either a light scatter parameter such as FSC or a fluorescence emission channel such as FL1.
As the exported matrix columns have different value ranges, they could potentially contribute differently to the similarity of events being measured and used for clustering. To balance their contributions to the similarity measure, each column in the data matrix was normalized. Two normalization methods have been used successfully in our studies. For the peripheral B cell subset analysis, the data matrix was column-wise z-score normalized. Z-score normalization has been shown to be effective in classifying hematopoietic bone marrow cell populations (57
). For the tetanus study, the data matrix was column-wise, 0–1, min-max normalized so that the Stone’s binning method (51
) could be used. The formulas for the z-score and the min-max normalization are:
is the original value, S’
the normalized value, μand σ the average and the standard deviation of the data column of S, respectively, Min
are the smallest and the largest value of the data column of S
. Although normalized data are used for event clustering, original values are used for visualization and population statistics calculations.
2.5.2 Grid-based Density Clustering
The core of the FLOCK algorithm begins by first partitioning each dimension into equally sized bins, which results in partition of the n-dimensional space into “hyper-regions”. shows a 2D mock example of how the data space is partitioned. However, in the FLOCK algorithm, partitioning is applied to all dimensions available in the dataset simultaneously. Each dimension of the data space is partitioned into the same number of equal-sized bins. Therefore a d-dimensional data space is partitioned into t*d hyper-regions with t equal-sized bins in each dimension.
Algorithmic components of FLOCK
In the second step, each hyper-region is assessed to determine the number of events present, and any hyper-region in which the number of events exceeds a certain threshold is labeled as being “dense” (). Equal-sized binning generates hyper-regions of equal volume. Therefore we can define the density of a hyper-region as the number of events in the region. A density threshold is used to distinguish a dense hyper-region from sparse and empty hyper-regions. As the density threshold increases, the number of dense hyper-regions decreases.
In the third step, dense hyper-regions adjacent to each other in n-dimensional space (two dense hyper-regions are adjacent if they are next to and contiguous with each other in one dimension and have the same position on all other dimensions) are grouped together into dense hyper-region groups, and the centroid (a point whose coordinates are the averages of the corresponding coordinates of a given set of point) representing all events in each dense hyper-region group determined ().
Finally, the dense hyper-region centroids are used to assign each event to its closest centroid based on Euclidean distance (). Other distance metric (e.g. Mannahalobis distance) could be used, but these have not yet been tested to determine their affect on the FLOCK results. Then the centroids are updated based on the member events of the dense regions and the cluster membership is computed again with the new centroids. This procedure can be regarded as a modified K-means procedure with only a few iterations. As initial centroids are not randomly generated, but rather based on data density, FLOCK quickly converges to a stable result during the centroid recalculation. This approach provides for two important advantages to traditional K-means– it determines – the number of populations in the high dimensional space from the number of dense hyper-region groups and it arrives at an estimated location of the center of the density distribution for each.
Two parameters need to be provided to execute the core of the FLOCK algorithm the number of partitions used to divide each dimension (# of bins) and the value of the density threshold cut-off. Both parameters appear to be dependent on a number of characteristics being measured, i.e. the number of events captured and the distribution of the data in the n
-dimensional space. In order to automatically select the number of partitions, two different methods were employed. The first method is based on our observation of the growth of non-empty hyper-regions generated, which initially increases as the number of bins increases, but then starts to decrease when the number of bins used per dimension is larger than necessary. For the peripheral B cell subset analysis, we chose a bin number to maximize the ratio of the number of non-empty hyper-regions versus the number of bins to ensure that distinct populations are separated into different hyper- regions. For the tetanus data set, Stone’s method (51
) was used to optimize the bias-variance trade-off for different number of bins, which allows FLOCK to partition the space sufficiently in order to ensure that distinct cell populations are separated, while avoiding excessive partitioning that would eliminate the density characteristics of the data space.
The other parameter to select is the density threshold cut-off. Different cut-off values lead to different clustering results. The one that gives rise to the best Silhouette value (47
) in the clustering is used. To avoid exhaustively testing each possible cut-off value we employed several different threshold selection methods, and the method that generates the best Silhouette value is chosen. For the peripheral B cell subset analysis, an ad hoc
method designed to distinguish the dense hyper-regions from background based on the average density of the hyper-regions was used. For the tetanus data set, the density threshold cut-off was selected based on the minimum description length (MDL) principle (2
) that is commonly used to identify the best cut-off value within a data sequence. We have also developed another ad hoc
method to identify the inflexion point of the decrease of the number of dense hyper-regions as the cut-off increases, which usually generates a lower density threshold value than the MDL principle and is more effective at identifying sparse cell populations. The use of these different approaches for density threshold estimation allows FLOCK to be tailored for each data set and to identify both relatively rare and relatively abundant cell populations.
The FLOCK algorithm has been implemented in the Immunology Database and Analysis Portal (ImmPort; http://www.immport.org
). The runtime of a single FLOCK analysis is largely dependent on the number of events in a single data file. A relatively large file of ~2 million events returned results in less than 20 minutes; more typical files in the range of 10 thousand to 100 thousand events return results in less than 2 minutes.
2.5.3 Visualization and Statistics
The visualization module of FLOCK as implemented in ImmPort supports 2-dimensional dot plots with each population colored uniquely. An individual population can be viewed in its own color with other populations in white as background. Users can navigate the multidimensional space by changing the dimensions or populations that are displayed.
Expression profiles of a population are presented that indicate the approximate expression level of each marker for the population. To generate the profile, FLOCK partitions each dimension into four bins with equal width. The four bins, from lowest values to highest values, are denoted with integers 1, 2, 3, and 4, respectively. The bin a population belongs to is decided by the majority vote of population events, and therefore each population’s profile is represented with a set of the integers –one integer (from 1–4) per marker.
Statistics calculated in FLOCK include population proportions, mean fluorescence intensity (MFI), cell count, standard deviation of the population on each parameter, and coefficient of variation.
2.6 Cross-sample Comparison
For the analysis of the tetanus and cross vaccine studies, the group of population centroids generated by FLOCK analysis of a single sample was applied to multiple samples to facilitate population comparison. Each event is assigned to its closest centroid in each sample. Populations generated in such a way have the same ID and color across all samples. For these data, the FCM results from the tetanus study Day 6 sample was used to identify the initial centroid locations.