1.  ISAC's Gating-ML 2.0 data exchange standard for gating description 
The lack of software interoperability with respect to gating has traditionally been a bottleneck preventing the use of multiple analytical tools and reproducibility of flow cytometry data analysis by independent parties. To address this issue, ISAC developed Gating-ML, a computer file format to encode and interchange gates. Gating-ML 1.5 was adopted and published as an ISAC Candidate Recommendation in 2008. Feedback during the probationary period from implementors, including major commercial software companies, instrument vendors and the wider community, has led to a streamlined Gating-ML 2.0. Gating-ML has been significantly simplified and therefore easier to support by software tools. To aid developers, free, open source reference implementations, compliance tests and detailed examples are provided to stimulate further commercial adoption. ISAC has approved Gating-ML as a standard ready for deployment in the public domain and encourages its support within the community as it is at a mature stage of development having undergone extensive review and testing, under both theoretical and practical conditions.
PMCID: PMC4874733  PMID: 25976062
flow cytometry; bioinformatics; gating; data standard; file format
2.  ISAC’s Classification Results File Format (CLR)* 
Identifying homogenous sets of cell populations in flow cytometry is an important process for sorting and selecting populations of interests for further data acquisition and analysis. Many computational methods are now available to automate this process, with several algorithms partitioning cells based on high-dimensional separation versus the traditional pairwise two-dimensional visualization approach of manual gating. ISAC’s Classification Results File Format (CLR) was developed to exchange the results of both manual gating and algorithmic classification approaches in a standardized way based on per event based classifications, including the potential for soft classifications expressed as the probability of an event being a member of a class.
PMCID: PMC4874736  PMID: 25407887
flow cytometry; classification; clustering; standard; software interoperability; file format; analysis interchange
3.  Deep profiling of multitube flow cytometry data 
Bioinformatics  2015;31(10):1623-1631.
Motivation: Deep profiling the phenotypic landscape of tissues using high-throughput flow cytometry (FCM) can provide important new insights into the interplay of cells in both healthy and diseased tissue. But often, especially in clinical settings, the cytometer cannot measure all the desired markers in a single aliquot. In these cases, tissue is separated into independently analysed samples, leaving a need to electronically recombine these to increase dimensionality. Nearest-neighbour (NN) based imputation fulfils this need but can produce artificial subpopulations. Clustering-based NNs can reduce these, but requires prior domain knowledge to be able to parameterize the clustering, so is unsuited to discovery settings.
Results: We present flowBin, a parameterization-free method for combining multitube FCM data into a higher-dimensional form suitable for deep profiling and discovery. FlowBin allocates cells to bins defined by the common markers across tubes in a multitube experiment, then computes aggregate expression for each bin within each tube, to create a matrix of expression of all markers assayed in each tube. We show, using simulated multitube data, that flowType analysis of flowBin output reproduces the results of that same analysis on the original data for cell types of >10% abundance. We used flowBin in conjunction with classifiers to distinguish normal from cancerous cells. We used flowBin together with flowType and RchyOptimyx to profile the immunophenotypic landscape of NPM1-mutated acute myeloid leukemia, and present a series of novel cell types associated with that mutation.
Availability and implementation: FlowBin is available in Bioconductor under the Artistic 2.0 free open source license. All data used are available in FlowRepository under accessions: FR-FCM-ZZYA, FR-FCM-ZZZK and FR-FCM-ZZES.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4426837  PMID: 25600947
4.  flowCL: ontology-based cell population labelling in flow cytometry 
Bioinformatics  2014;31(8):1337-1339.
Motivation: Finding one or more cell populations of interest, such as those correlating to a specific disease, is critical when analysing flow cytometry data. However, labelling of cell populations is not well defined, making it difficult to integrate the output of algorithms to external knowledge sources.
Results: We developed flowCL, a software package that performs semantic labelling of cell populations based on their surface markers and applied it to labelling of the Federation of Clinical Immunology Societies Human Immunology Project Consortium lyoplate populations as a use case.
Conclusion: By providing automated labelling of cell populations based on their immunophenotype, flowCL allows for unambiguous and reproducible identification of standardized cell types.
Availability and implementation: Code, R script and documentation are available under the Artistic 2.0 license through Bioconductor (
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4393520  PMID: 25481008
5.  flowDensity: reproducing manual gating of flow cytometry data by automated density-based cell population identification 
Bioinformatics  2014;31(4):606-607.
Summary: flowDensity facilitates reproducible, high-throughput analysis of flow cytometry data by automating a predefined manual gating approach. The algorithm is based on a sequential bivariate gating approach that generates a set of predefined cell populations. It chooses the best cut-off for individual markers using characteristics of the density distribution. The Supplementary Material is linked to the online version of the manuscript.
Availability and implementation: R source code freely available through BioConductor ( Data available from (dataset FR-FCM-ZZBW).
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4325545  PMID: 25378466
6.  RchyOptimyx: Cellular Hierarchy Optimization for Flow Cytometry 
Analysis of high-dimensional flow cytometry datasets can reveal novel cell populations with poorly understood biology. Following discovery, characterization of these populations in terms of the critical markers involved is an important step, as this can help to both better understand the biology of these populations and aid in designing simpler marker panels to identify them on simpler instruments and with fewer reagents (i.e., in resource poor or highly regulated clinical settings). However, current tools to design panels based on the biological characteristics of the target cell populations work exclusively based on technical parameters (e.g., instrument configurations, spectral overlap, and reagent availability). To address this shortcoming, we developed RchyOptimyx (cellular hieraRCHY OPTIMization), a computational tool that constructs cellular hierarchies by combining automated gating with dynamic programming and graph theory to provide the best gating strategies to identify a target population to a desired level of purity or correlation with a clinical outcome, using the simplest possible marker panels. RchyOptimyx can assess and graphically present the trade-offs between marker choice and population specificity in high-dimensional flow or mass cytometry datasets. We present three proof-of-concept use cases for RchyOptimyx that involve 1) designing a panel of surface markers for identification of rare populations that are primarily characterized using their intracellular signature; 2) simplifying the gating strategy for identification of a target cell population; 3) identification of a non-redundant marker set to identify a target cell population.
PMCID: PMC3726344  PMID: 23044634
polychromatic flow cytometry; mass cytometry; exploratory data analysis; cellular hierarchy; graph theory; gating; marker panel; bioinformatics; statistics
7.  FCS 3.1 Implementation Guidance1 
The Flow Cytometry Standard (FCS) format was developed back in 1984. Since then, FCS became the standard file format supported by all flow cytometry software and hardware vendors. Over the years, updates were incorporated to adapt to technological advancements in both flow cytometry and computing technologies. However, flexibility in how data may be stored in FCS has led to implementation difficulties for instrument vendors and third party software developers. In this technical note, we are providing implementation guidance and examples related to FCS 3.1, the latest version of the standard. By publishing this text, we intend to prevent potential compatibility issues that could be faced when implementing the FCS spillover and preferred display keywords that have arisen during discussions among some implementers.
PMCID: PMC3676281  PMID: 22278913
flow cytometry; FCS; data standard; file format; bioinformatics
8.  Enhanced flowType/RchyOptimyx: a Bioconductor pipeline for discovery in high-dimensional cytometry data 
Bioinformatics  2014;30(9):1329-1330.
Summary: We present a significantly improved version of the flowType and RchyOptimyx BioConductor-based pipeline that is both 14 times faster and can accommodate multiple levels of biomarker expression for up to 96 markers. With these improvements, the pipeline is positioned to be an integral part of data analysis for high-throughput experiments on high-dimensional single-cell assay platforms, including flow cytometry, mass cytometry and single-cell RT-qPCR.
Availability: FlowType and RchyOptimyx are distributed under the Artistic 2.0 license through Bioconductor.
PMCID: PMC3998128  PMID: 24407226
9.  Correlation Analysis of Intracellular and Secreted Cytokines via the Generalized Integrated Mean Fluorescence Intensity (GiMFI)a 
The immune response in humans is usually assessed using immunogenicity assays to provide biomarkers as correlates of protection (CoP). Flow cytometry is the assay of choice to measure intracellular cytokine staining (ICS) of cell-mediated immune (CMI) biomarkers. For CMI analysis, the integrated mean fluorescence intensity (iMFI) was introduced as a metric to represent the total functional CMI response as a CoP. iMFI is computed by multiplying the relative frequency (% positive) of cells expressing a particular cytokine with the mean fluorescence intensity (MFI) of that population, and correlates better with protection in challenge models than either the percentage or the MFI of the cytokine-positive population. While determination of the iMFI as a CoP can readily be accomplished in animal models that allow challenge/protection experiments, this is not feasible in humans for ethical reasons. As a first step towards extending the iMFI concept to humans, we investigated the correlation of the iMFI derived from a human innate immune response ICS assay with functional cytokine release into the culture supernatant, as innate cytokines need to be released to have a functional impact. Next we developed a quantitatively more correlative mathematical approach for calculating the functional response of cytokine producing cells by incorporating the assignment of different weights to the magnitude (frequency of cytokine-positive cells) and the quality (the MFI) of the observed innate immune response. We refer to this model as GiMFI (Generalized iMFI).
PMCID: PMC2930075  PMID: 20629196
GiMFI; correlation analysis; functional response; culture supernatant; cytokine; flow cytometry; antigen presenting cells; integrated mean fluorescent intensity
10.  Data File Standard for Flow Cytometry, Version FCS 3.11 
The flow cytometry data file standard provides the specifications needed to completely describe flow cytometry data sets within the confines of the file containing the experimental data. In 1984, the first Flow Cytometry Standard format for data files was adopted as FCS 1.0. This standard was modified in 1990 as FCS 2.0 and again in 1997 as FCS 3.0. We report here on the next generation Flow Cytometry Standard data file format. FCS 3.1 is a minor revision based on suggested improvements from the community. The unchanged goal of the Standard is to provide a uniform file format that allows files created by one type of acquisition hardware and software to be analyzed by any other type.
The FCS 3.1 standard retains the basic FCS file structure and most features of previous versions of the standard. Changes included in FCS 3.1 address potential ambiguities in the previous versions and provide a more robust standard. The major changes include simplified support for international characters and improved support for storing compensation. The major additions are support for preferred display scale, a standardized way of capturing the sample volume, information about originality of the data file, and support for plate and well identification in high throughput, plate based experiments. Please see the normative version of the FCS 3.1 specification in supplementary material to this manuscript (or at in the Current standards section) for a complete list of changes.
PMCID: PMC2892967  PMID: 19937951
Flow cytometry; FCS; data standard; file format; bioinformatics
12.  XML-based Gating Descriptions in Flow Cytometry 
The lack of software interoperability with respect to gating due to lack of a standardized mechanism for data exchange has traditionally been a bottleneck preventing reproducibility of flow cytometry (FCM) data analysis and the usage of multiple analytical tools.
To facilitate interoperability among FCM data analysis tools, members of the International Society for the Advancement of Cytometry (ISAC) Data Standards Task Force (DSTF) have developed an XML-based mechanism to formally describe gates (Gating-ML).
Gating-ML, an open specification for encoding gating, data transformations and compensation, has been adopted by the ISAC DSTF as a Candidate Recommendation (CR).
Gating-ML can facilitate exchange of gating descriptions the same way that FCS facilitated for exchange of raw FCM data. Its adoption will open new collaborative opportunities as well as possibilities for advanced analyses and methods development. The ISAC DSTF is satisfied that the standard addresses the requirements for a gating exchange standard.
PMCID: PMC2585156  PMID: 18773465
Flow cytometry; gating; XML; data standard; compensation; transformation; bioinformatics
13.  Data Standards for Flow Cytometry 
Flow cytometry (FCM) is an analytical tool widely used for cancer and HIV/AIDS research, and treatment, stem cell manipulation and detecting microorganisms in environmental samples. Current data standards do not capture the full scope of FCM experiments and there is a demand for software tools that can assist in the exploration and analysis of large FCM datasets. We are implementing a standardized approach to capturing, analyzing, and disseminating FCM data that will facilitate both more complex analyses and analysis of datasets that could not previously be efficiently studied. Initial work has focused on developing a community-based guideline for recording and reporting the details of FCM experiments. Open source software tools that implement this standard are being created, with an emphasis on facilitating reproducible and extensible data analyses. As well, tools for electronic collaboration will assist the integrated access and comprehension of experiments to empower users to collaborate on FCM analyses. This coordinated, joint development of bioinformatics standards and software tools for FCM data analysis has the potential to greatly facilitate both basic and clinical research—impacting a notably diverse range of medical and environmental research areas.
PMCID: PMC2768474  PMID: 16901228
14.  Immune Biomarkers Predictive of Respiratory Viral Infection in Elderly Nursing Home Residents 
PLoS ONE  2014;9(10):e108481.
To determine if immune phenotypes associated with immunosenescence predict risk of respiratory viral infection in elderly nursing home residents.
Residents ≥65 years from 32 nursing homes in 4 Canadian cities were enrolled in Fall 2009, 2010 and 2011, and followed for one influenza season. Following influenza vaccination, peripheral blood mononuclear cells (PBMCs) were obtained and analysed by flow cytometry for T-regs, CD4+ and CD8+ T-cell subsets (CCR7+CD45RA+, CCR7-CD45RA+ and CD28-CD57+) and CMV-reactive CD4+ and CD8+ T-cells. Nasopharyngeal swabs were obtained and tested for viruses in symptomatic residents. A Cox proportional hazards model adjusted for age, sex and frailty, determined the relationship between immune phenotypes and time to viral infection.
1072 residents were enrolled; median age 86 years and 72% female. 269 swabs were obtained, 87 were positive for virus: influenza (24%), RSV (14%), coronavirus (32%), rhinovirus (17%), human metapneumovirus (9%) and parainfluenza (5%). In multivariable analysis, high T-reg% (HR 0.41, 95% CI 0.20–0.81) and high CMV-reactive CD4+ T-cell% (HR 1.69, 95% CI 1.03–2.78) were predictive of respiratory viral infection.
In elderly nursing home residents, high CMV-reactive CD4+ T-cells were associated with an increased risk and high T-regs were associated with a reduced risk of respiratory viral infection.
PMCID: PMC4183538  PMID: 25275464
15.  Automated analysis of multidimensional flow cytometry data improves diagnostic accuracy between mantle cell lymphoma and small lymphocytic lymphoma 
Mantle cell lymphoma (MCL) and small lymphocytic lymphoma (SLL) exhibit similar, but distinct immunophenotypic profiles. While many cases can be diagnosed with high confidence based on flow cytometry (FCM) results alone, ambiguous cases are frequently encountered and necessitate additional studies including immunohistochemistry for cyclinD1 and fluorescence in-situ hybridization (FISH) analysis for t(11;14) translocation.
Design and Methods
In order to determine if greater diagnostic accuracy could be achieved from flow cytometry data alone, we developed an unbiased, machine-based algorithm and used it to automatically identify those features within the multidimensional space that best distinguish between the two disease types.
Data from 44 MCL cases and 70 SLL cases were analyzed. Using conventional diagnostic criteria, we were able to accurately assign only 64% of MCL and 69% of SLL cases. Using features identified by our automated approach, we were able to assign 100% of MCL and 97% of SLL cases correctly. The most discriminating feature was the ratio of mean fluorescence intensities (MFI) between CD20 and CD23. Unexpectedly, we also observed that inclusion of FMC7 expression in the diagnostic algorithm reduced its accuracy.
Computational methods allow objective assessment of the relative contribution of component data features to overall diagnostic accuracy, and reveal some conventional criteria can actually compromise this accuracy. Furthermore, computational approaches enable exploiting the full dimensionality of FCM data and can potentially lead to discovery of novel biomarkers relevant for clinical outcome.
PMCID: PMC4090220  PMID: 22180480
16.  The Logic of Surveillance Guidelines: An Analysis of Vaccine Adverse Event Reports from an Ontological Perspective 
PLoS ONE  2014;9(3):e92632.
When increased rates of adverse events following immunization are detected, regulatory action can be taken by public health agencies. However to be interpreted reports of adverse events must be encoded in a consistent way. Regulatory agencies rely on guidelines to help determine the diagnosis of the adverse events. Manual application of these guidelines is expensive, time consuming, and open to logical errors. Representing these guidelines in a format amenable to automated processing can make this process more efficient.
Methods and Findings
Using the Brighton anaphylaxis case definition, we show that existing clinical guidelines used as standards in pharmacovigilance can be logically encoded using a formal representation such as the Adverse Event Reporting Ontology we developed. We validated the classification of vaccine adverse event reports using the ontology against existing rule-based systems and a manually curated subset of the Vaccine Adverse Event Reporting System. However, we encountered a number of critical issues in the formulation and application of the clinical guidelines. We report these issues and the steps being taken to address them in current surveillance systems, and in the terminological standards in use.
By standardizing and improving the reporting process, we were able to automate diagnosis confirmation. By allowing medical experts to prioritize reports such a system can accelerate the identification of adverse reactions to vaccines and the response of regulatory agencies. This approach of combining ontology and semantic technologies can be used to improve other areas of vaccine adverse event reports analysis and should inform both the design of clinical guidelines and how they are used in the future.
Sufficient material to reproduce our results is available, including documentation, ontology, code and datasets, at
PMCID: PMC3965435  PMID: 24667848
17.  B Cells With High Side Scatter Parameter by Flow Cytometry Correlate With Inferior Survival in Diffuse Large B-Cell Lymphoma 
Despite advances in the understanding of diffuse large B-cell lymphoma (DLBCL) biology, only the clinically based International Prognostic Index (IPI) is used routinely for risk stratification at diagnosis. To find novel prognostic markers, we analyzed flow cytometric data from 229 diagnostic DLBCL samples using an automated multiparameter data analysis approach developed in our laboratory. By using the developed automated data analysis pipeline, we identified 71 of 229 cases as having more than 35% B cells with a high side scatter (SSC) profile, a parameter reflecting internal cellular complexity. This high SSC B-cell feature was associated with inferior overall and progression-free survival (P = .001 and P = .01, respectively) and remained a significant predictor of overall survival in multivariate Cox regression analysis (IPI, P = .001; high SSC, P = .004; rituximab, P = .53).
This study suggests that high SSC among B cells may serve as a useful biomarker to identify patients with DLBCL at high risk for relapse. This is of particular interest because this biomarker is readily available in most clinical laboratories without significant alteration to existing routine diagnostic strategies or incurring additional costs.
PMCID: PMC3718075  PMID: 22523221
Side scatter; Flow cytometry; Diffuse large B-cell lymphoma; Lymphoma; Survival
18.  Integration of Lyoplate Based Flow Cytometry and Computational Analysis for Standardized Immunological Biomarker Discovery 
PLoS ONE  2013;8(7):e65485.
Discovery of novel immune biomarkers for monitoring of disease prognosis and response to therapy in immune-mediated inflammatory diseases is an important unmet clinical need. Here, we establish a novel framework for immunological biomarker discovery, comparing a conventional (liquid) flow cytometry platform (CFP) and a unique lyoplate-based flow cytometry platform (LFP) in combination with advanced computational data analysis. We demonstrate that LFP had higher sensitivity compared to CFP, with increased detection of cytokines (IFN-γ and IL-10) and activation markers (Foxp3 and CD25). Fluorescent intensity of cells stained with lyophilized antibodies was increased compared to cells stained with liquid antibodies. LFP, using a plate loader, allowed medium-throughput processing of samples with comparable intra- and inter-assay variability between platforms. Automated computational analysis identified novel immunophenotypes that were not detected with manual analysis. Our results establish a new flow cytometry platform for standardized and rapid immunological biomarker discovery with wide application to immune-mediated diseases.
PMCID: PMC3701052  PMID: 23843942
19.  GenePattern flow cytometry suite 
Traditional flow cytometry data analysis is largely based on interactive and time consuming analysis of series two dimensional representations of up to 20 dimensional data. Recent technological advances have increased the amount of data generated by the technology and outpaced the development of data analysis approaches. While there are advanced tools available, including many R/BioConductor packages, these are only accessible programmatically and therefore out of reach for most experimentalists. GenePattern is a powerful genomic analysis platform with over 200 tools for analysis of gene expression, proteomics, and other data. A web-based interface provides easy access to these tools and allows the creation of automated analysis pipelines enabling reproducible research.
In order to bring advanced flow cytometry data analysis tools to experimentalists without programmatic skills, we developed the GenePattern Flow Cytometry Suite. It contains 34 open source GenePattern flow cytometry modules covering methods from basic processing of flow cytometry standard (i.e., FCS) files to advanced algorithms for automated identification of cell populations, normalization and quality assessment. Internally, these modules leverage from functionality developed in R/BioConductor. Using the GenePattern web-based interface, they can be connected to build analytical pipelines.
GenePattern Flow Cytometry Suite brings advanced flow cytometry data analysis capabilities to users with minimal computer skills. Functionality previously available only to skilled bioinformaticians is now easily accessible from a web browser.
PMCID: PMC3717030  PMID: 23822732
Flow cytometry; Data analysis; GenePattern; FCS; Data preprocessing; Quality assessment; Normalization; Clustering
20.  The Luminal Progenitor Compartment of the Normal Human Mammary Gland Constitutes a Unique Site of Telomere Dysfunction 
Stem Cell Reports  2013;1(1):28-37.
Telomeres are essential for genomic integrity, but little is known about their regulation in the normal human mammary gland. We now demonstrate that a phenotypically defined cell population enriched in luminal progenitors (LPs) is characterized by unusually short telomeres independently of donor age. Furthermore, we find that multiple DNA damage response proteins colocalize with telomeres in >95% of LPs but in <5% of basal cells. Paradoxically, 25% of LPs are still capable of exhibiting robust clonogenic activity in vitro. This may be partially explained by the elevated telomerase activity that was also seen only in LPs. Interestingly, this potential telomere salvage mechanism declines with age. Our findings thus reveal marked differences in the telomere biology of different subsets of primitive normal human mammary cells. The chronically dysfunctional telomeres unique to LPs have potentially important implications for normal mammary tissue homeostasis as well as the development of certain breast cancers.
Graphical Abstract
•Normal human mammary gland luminal progenitors (LPs) have very short telomeres•LP nuclei selectively exhibit telomere-associated DNA damage responses•LPs have selectively elevated hTERT expression and telomerase activity•These LP features may play a role in mammary tissue homeostasis and transformation
PMCID: PMC3757746  PMID: 24052939
21.  Early immunologic correlates of HIV protection can be identified from computational analysis of complex multivariate T-cell flow cytometry assays* 
Bioinformatics  2012;28(7):1009-1016.
Motivation: Polychromatic flow cytometry (PFC), has enormous power as a tool to dissect complex immune responses (such as those observed in HIV disease) at a single cell level. However, analysis tools are severely lacking. Although high-throughput systems allow rapid data collection from large cohorts, manual data analysis can take months. Moreover, identification of cell populations can be subjective and analysts rarely examine the entirety of the multidimensional dataset (focusing instead on a limited number of subsets, the biology of which has usually already been well-described). Thus, the value of PFC as a discovery tool is largely wasted.
Results: To address this problem, we developed a computational approach that automatically reveals all possible cell subsets. From tens of thousands of subsets, those that correlate strongly with clinical outcome are selected and grouped. Within each group, markers that have minimal relevance to the biological outcome are removed, thereby distilling the complex dataset into the simplest, most clinically relevant subsets. This allows complex information from PFC studies to be translated into clinical or resource-poor settings, where multiparametric analysis is less feasible. We demonstrate the utility of this approach in a large (n=466), retrospective, 14-parameter PFC study of early HIV infection, where we identify three T-cell subsets that strongly predict progression to AIDS (only one of which was identified by an initial manual analysis).
Availability: The ‘flowType: Phenotyping Multivariate PFC Assays’ package is available through Bioconductor. Additional documentation and examples are available at:
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3315712  PMID: 22383736
22.  Scoring relevancy of features based on combinatorial analysis of Lasso with application to lymphoma diagnosis 
BMC Genomics  2013;14(Suppl 1):S14.
One challenge in applying bioinformatic tools to clinical or biological data is high number of features that might be provided to the learning algorithm without any prior knowledge on which ones should be used. In such applications, the number of features can drastically exceed the number of training instances which is often limited by the number of available samples for the study. The Lasso is one of many regularization methods that have been developed to prevent overfitting and improve prediction performance in high-dimensional settings. In this paper, we propose a novel algorithm for feature selection based on the Lasso and our hypothesis is that defining a scoring scheme that measures the "quality" of each feature can provide a more robust feature selection method. Our approach is to generate several samples from the training data by bootstrapping, determine the best relevance-ordering of the features for each sample, and finally combine these relevance-orderings to select highly relevant features. In addition to the theoretical analysis of our feature scoring scheme, we provided empirical evaluations on six real datasets from different fields to confirm the superiority of our method in exploratory data analysis and prediction performance. For example, we applied FeaLect, our feature scoring algorithm, to a lymphoma dataset, and according to a human expert, our method led to selecting more meaningful features than those commonly used in the clinics. This case study built a basis for discovering interesting new criteria for lymphoma diagnosis. Furthermore, to facilitate the use of our algorithm in other applications, the source code that implements our algorithm was released as FeaLect, a documented R package in CRAN.
PMCID: PMC3549810  PMID: 23369194
23.  Rapid Cell Population Identification in Flow Cytometry Data* 
We have developed flowMeans, a time-efficient and accurate method for automated identification of cell populations in flow cytometry (FCM) data based on K-means clustering. Unlike traditional K-means, flowMeans can identify concave cell populations by modelling a single population with multiple clusters. flowMeans uses a change point detection algorithm to determine the number of sub-populations, enabling the method to be used in high throughput FCM data analysis pipelines. Our approach compares favourably to manual analysis by human experts and current state-of-the-art automated gating algorithms. flowMeans is freely available as an open source R package through Bioconductor.
PMCID: PMC3137288  PMID: 21182178
flow cytometry; data analysis; cluster analysis; model selection; bioinformatics; statistics
24.  Flow cytometry data standards 
BMC Research Notes  2011;4:50.
Flow cytometry is a widely used analytical technique for examining microscopic particles, such as cells. The Flow Cytometry Standard (FCS) was developed in 1984 for storing flow data and it is supported by all instrument and third party software vendors. However, FCS does not capture the full scope of flow cytometry (FCM)-related data and metadata, and data standards have recently been developed to address this shortcoming.
The Data Standards Task Force (DSTF) of the International Society for the Advancement of Cytometry (ISAC) has developed several data standards to complement the raw data encoded in FCS files. Efforts started with the Minimum Information about a Flow Cytometry Experiment, a minimal data reporting standard of details necessary to include when publishing FCM experiments to facilitate third party understanding. MIFlowCyt is now being recommended to authors by publishers as part of manuscript submission, and manuscripts are being checked by reviewers and editors for compliance. Gating-ML was then introduced to capture gating descriptions - an essential part of FCM data analysis describing the selection of cell populations of interest. The Classification Results File Format was developed to accommodate results of the gating process, mostly within the context of automated clustering. Additionally, the Archival Cytometry Standard bundles data with all the other components describing experiments. Here, we introduce these recent standards and provide the very first example of how they can be used to report FCM data including analysis and results in a standardized, computationally exchangeable form.
Reporting standards and open file formats are essential for scientific collaboration and independent validation. The recently developed FCM data standards are now being incorporated into third party software tools and data repositories, which will ultimately facilitate understanding and data reuse.
PMCID: PMC3060130  PMID: 21385382
25.  Data reduction for spectral clustering to analyze high throughput flow cytometry data 
BMC Bioinformatics  2010;11:403.
Recent biological discoveries have shown that clustering large datasets is essential for better understanding biology in many areas. Spectral clustering in particular has proven to be a powerful tool amenable for many applications. However, it cannot be directly applied to large datasets due to time and memory limitations. To address this issue, we have modified spectral clustering by adding an information preserving sampling procedure and applying a post-processing stage. We call this entire algorithm SamSPECTRAL.
We tested our algorithm on flow cytometry data as an example of large, multidimensional data containing potentially hundreds of thousands of data points (i.e., "events" in flow cytometry, typically corresponding to cells). Compared to two state of the art model-based flow cytometry clustering methods, SamSPECTRAL demonstrates significant advantages in proper identification of populations with non-elliptical shapes, low density populations close to dense ones, minor subpopulations of a major population and rare populations.
This work is the first successful attempt to apply spectral methodology on flow cytometry data. An implementation of our algorithm as an R package is freely available through BioConductor.
PMCID: PMC2923634  PMID: 20667133

