The Flow Cytometry Standard (FCS) format was developed back in 1984. Since then, FCS became the standard file format supported by all flow cytometry software and hardware vendors. Over the years, updates were incorporated to adapt to technological advancements in both flow cytometry and computing technologies. However, flexibility in how data may be stored in FCS has led to implementation difficulties for instrument vendors and third party software developers. In this technical note, we are providing implementation guidance and examples related to FCS 3.1, the latest version of the standard. By publishing this text, we intend to prevent potential compatibility issues that could be faced when implementing the FCS spillover and preferred display keywords that have arisen during discussions among some implementers.
flow cytometry; FCS; data standard; file format; bioinformatics
Telomeres are essential for genomic integrity, but little is known about their regulation in the normal human mammary gland. We now demonstrate that a phenotypically defined cell population enriched in luminal progenitors (LPs) is characterized by unusually short telomeres independently of donor age. Furthermore, we find that multiple DNA damage response proteins colocalize with telomeres in >95% of LPs but in <5% of basal cells. Paradoxically, 25% of LPs are still capable of exhibiting robust clonogenic activity in vitro. This may be partially explained by the elevated telomerase activity that was also seen only in LPs. Interestingly, this potential telomere salvage mechanism declines with age. Our findings thus reveal marked differences in the telomere biology of different subsets of primitive normal human mammary cells. The chronically dysfunctional telomeres unique to LPs have potentially important implications for normal mammary tissue homeostasis as well as the development of certain breast cancers.
•Normal human mammary gland luminal progenitors (LPs) have very short telomeres•LP nuclei selectively exhibit telomere-associated DNA damage responses•LPs have selectively elevated hTERT expression and telomerase activity•These LP features may play a role in mammary tissue homeostasis and transformation
Motivation: Polychromatic flow cytometry (PFC), has enormous power as a tool to dissect complex immune responses (such as those observed in HIV disease) at a single cell level. However, analysis tools are severely lacking. Although high-throughput systems allow rapid data collection from large cohorts, manual data analysis can take months. Moreover, identification of cell populations can be subjective and analysts rarely examine the entirety of the multidimensional dataset (focusing instead on a limited number of subsets, the biology of which has usually already been well-described). Thus, the value of PFC as a discovery tool is largely wasted.
Results: To address this problem, we developed a computational approach that automatically reveals all possible cell subsets. From tens of thousands of subsets, those that correlate strongly with clinical outcome are selected and grouped. Within each group, markers that have minimal relevance to the biological outcome are removed, thereby distilling the complex dataset into the simplest, most clinically relevant subsets. This allows complex information from PFC studies to be translated into clinical or resource-poor settings, where multiparametric analysis is less feasible. We demonstrate the utility of this approach in a large (n=466), retrospective, 14-parameter PFC study of early HIV infection, where we identify three T-cell subsets that strongly predict progression to AIDS (only one of which was identified by an initial manual analysis).
Availability: The ‘flowType: Phenotyping Multivariate PFC Assays’ package is available through Bioconductor. Additional documentation and examples are available at: www.terryfoxlab.ca/flowsite/flowType/
Supplementary data are available at Bioinformatics online.
The lack of software interoperability with respect to gating due to lack of a standardized mechanism for data exchange has traditionally been a bottleneck preventing reproducibility of flow cytometry (FCM) data analysis and the usage of multiple analytical tools.
To facilitate interoperability among FCM data analysis tools, members of the International Society for the Advancement of Cytometry (ISAC) Data Standards Task Force (DSTF) have developed an XML-based mechanism to formally describe gates (Gating-ML).
Gating-ML, an open specification for encoding gating, data transformations and compensation, has been adopted by the ISAC DSTF as a Candidate Recommendation (CR).
Gating-ML can facilitate exchange of gating descriptions the same way that FCS facilitated for exchange of raw FCM data. Its adoption will open new collaborative opportunities as well as possibilities for advanced analyses and methods development. The ISAC DSTF is satisfied that the standard addresses the requirements for a gating exchange standard.
Flow cytometry; gating; XML; data standard; compensation; transformation; bioinformatics
One challenge in applying bioinformatic tools to clinical or biological data is high number of features that might be provided to the learning algorithm without any prior knowledge on which ones should be used. In such applications, the number of features can drastically exceed the number of training instances which is often limited by the number of available samples for the study. The Lasso is one of many regularization methods that have been developed to prevent overfitting and improve prediction performance in high-dimensional settings. In this paper, we propose a novel algorithm for feature selection based on the Lasso and our hypothesis is that defining a scoring scheme that measures the "quality" of each feature can provide a more robust feature selection method. Our approach is to generate several samples from the training data by bootstrapping, determine the best relevance-ordering of the features for each sample, and finally combine these relevance-orderings to select highly relevant features. In addition to the theoretical analysis of our feature scoring scheme, we provided empirical evaluations on six real datasets from different fields to confirm the superiority of our method in exploratory data analysis and prediction performance. For example, we applied FeaLect, our feature scoring algorithm, to a lymphoma dataset, and according to a human expert, our method led to selecting more meaningful features than those commonly used in the clinics. This case study built a basis for discovering interesting new criteria for lymphoma diagnosis. Furthermore, to facilitate the use of our algorithm in other applications, the source code that implements our algorithm was released as FeaLect, a documented R package in CRAN.
We have developed flowMeans, a time-efficient and accurate method for automated identification of cell populations in flow cytometry (FCM) data based on K-means clustering. Unlike traditional K-means, flowMeans can identify concave cell populations by modelling a single population with multiple clusters. flowMeans uses a change point detection algorithm to determine the number of sub-populations, enabling the method to be used in high throughput FCM data analysis pipelines. Our approach compares favourably to manual analysis by human experts and current state-of-the-art automated gating algorithms. flowMeans is freely available as an open source R package through Bioconductor.
flow cytometry; data analysis; cluster analysis; model selection; bioinformatics; statistics
The immune response in humans is usually assessed using immunogenicity assays to provide biomarkers as correlates of protection (CoP). Flow cytometry is the assay of choice to measure intracellular cytokine staining (ICS) of cell-mediated immune (CMI) biomarkers. For CMI analysis, the integrated mean fluorescence intensity (iMFI) was introduced as a metric to represent the total functional CMI response as a CoP. iMFI is computed by multiplying the relative frequency (% positive) of cells expressing a particular cytokine with the mean fluorescence intensity (MFI) of that population, and correlates better with protection in challenge models than either the percentage or the MFI of the cytokine-positive population. While determination of the iMFI as a CoP can readily be accomplished in animal models that allow challenge/protection experiments, this is not feasible in humans for ethical reasons. As a first step towards extending the iMFI concept to humans, we investigated the correlation of the iMFI derived from a human innate immune response ICS assay with functional cytokine release into the culture supernatant, as innate cytokines need to be released to have a functional impact. Next we developed a quantitatively more correlative mathematical approach for calculating the functional response of cytokine producing cells by incorporating the assignment of different weights to the magnitude (frequency of cytokine-positive cells) and the quality (the MFI) of the observed innate immune response. We refer to this model as GiMFI (Generalized iMFI).
GiMFI; correlation analysis; functional response; culture supernatant; cytokine; flow cytometry; antigen presenting cells; integrated mean fluorescent intensity
Flow cytometry is a widely used analytical technique for examining microscopic particles, such as cells. The Flow Cytometry Standard (FCS) was developed in 1984 for storing flow data and it is supported by all instrument and third party software vendors. However, FCS does not capture the full scope of flow cytometry (FCM)-related data and metadata, and data standards have recently been developed to address this shortcoming.
The Data Standards Task Force (DSTF) of the International Society for the Advancement of Cytometry (ISAC) has developed several data standards to complement the raw data encoded in FCS files. Efforts started with the Minimum Information about a Flow Cytometry Experiment, a minimal data reporting standard of details necessary to include when publishing FCM experiments to facilitate third party understanding. MIFlowCyt is now being recommended to authors by publishers as part of manuscript submission, and manuscripts are being checked by reviewers and editors for compliance. Gating-ML was then introduced to capture gating descriptions - an essential part of FCM data analysis describing the selection of cell populations of interest. The Classification Results File Format was developed to accommodate results of the gating process, mostly within the context of automated clustering. Additionally, the Archival Cytometry Standard bundles data with all the other components describing experiments. Here, we introduce these recent standards and provide the very first example of how they can be used to report FCM data including analysis and results in a standardized, computationally exchangeable form.
Reporting standards and open file formats are essential for scientific collaboration and independent validation. The recently developed FCM data standards are now being incorporated into third party software tools and data repositories, which will ultimately facilitate understanding and data reuse.
Recent biological discoveries have shown that clustering large datasets is essential for better understanding biology in many areas. Spectral clustering in particular has proven to be a powerful tool amenable for many applications. However, it cannot be directly applied to large datasets due to time and memory limitations. To address this issue, we have modified spectral clustering by adding an information preserving sampling procedure and applying a post-processing stage. We call this entire algorithm SamSPECTRAL.
We tested our algorithm on flow cytometry data as an example of large, multidimensional data containing potentially hundreds of thousands of data points (i.e., "events" in flow cytometry, typically corresponding to cells). Compared to two state of the art model-based flow cytometry clustering methods, SamSPECTRAL demonstrates significant advantages in proper identification of populations with non-elliptical shapes, low density populations close to dense ones, minor subpopulations of a major population and rare populations.
This work is the first successful attempt to apply spectral methodology on flow cytometry data. An implementation of our algorithm as an R package is freely available through BioConductor.
Experimental descriptions are typically stored as free text without using standardized terminology, creating challenges in comparison, reproduction and analysis. These difficulties impose limitations on data exchange and information retrieval.
The Ontology for Biomedical Investigations (OBI), developed as a global, cross-community effort, provides a resource that represents biomedical investigations in an explicit and integrative framework. Here we detail three real-world applications of OBI, provide detailed modeling information and explain how to use OBI.
We demonstrate how OBI can be applied to different biomedical investigations to both facilitate interpretation of the experimental process and increase the computational processing and integration within the Semantic Web. The logical definitions of the entities involved allow computers to unambiguously understand and integrate different biological experimental processes and their relevant components.
OBI is available at http://purl.obolibrary.org/obo/obi/2009-11-02/obi.owl
Ontology development is a rapidly growing area of research, especially in the life sciences domain. To promote collaboration and interoperability between different projects, the OBO Foundry principles require that these ontologies be open and non-redundant, avoiding duplication of terms through the re-use of existing resources. As current options to do so present various difficulties, a new approach, MIREOT, allows specifying import of single terms. Initial implementations allow for controlled import of selected annotations and certain classes of related terms.
OntoFox http://ontofox.hegroup.org/ is a web-based system that allows users to input terms, fetch selected properties, annotations, and certain classes of related terms from the source ontologies and save the results using the RDF/XML serialization of the Web Ontology Language (OWL). Compared to an initial implementation of MIREOT, OntoFox allows additional and more easily configurable options for selecting and rewriting annotation properties, and for inclusion of all or a computed subset of terms between low and top level terms. Additional methods for including related classes include a SPARQL-based ontology term retrieval algorithm that extracts terms related to a given set of signature terms and an option to extract the hierarchy rooted at a specified ontology term. OntoFox's output can be directly imported into a developer's ontology. OntoFox currently supports term retrieval from a selection of 15 ontologies accessible via SPARQL endpoints and allows users to extend this by specifying additional endpoints. An OntoFox application in the development of the Vaccine Ontology (VO) is demonstrated.
OntoFox provides a timely publicly available service, providing different options for users to collect terms from external ontologies, making them available for reuse by import into client OWL ontologies.
Flow cytometry (FCM) is widely used in health research and in treatment for a variety of tasks, such as in the diagnosis and monitoring of leukemia and lymphoma patients, providing the counts of helper-T lymphocytes needed
to monitor the course and treatment of HIV infection, the evaluation of peripheral blood hematopoietic stem cell
grafts, and many other diseases. In practice, FCM data analysis is performed manually, a process that requires an
inordinate amount of time and is error-prone, nonreproducible, nonstandardized, and not open for re-evaluation,
making it the most limiting aspect of this technology. This paper reviews state-of-the-art FCM data analysis
approaches using a framework introduced to report each of the components in a data analysis pipeline. Current
challenges and possible future directions in developing fully automated FCM data analysis tools are also outlined.
The development of the Functional Genomics Investigation Ontology (FuGO) is a collaborative, international effort that will provide a resource for annotating functional genomics investigations, including the study design, protocols and instrumentation used, the data generated and the types of analysis performed on the data. FuGO will contain both terms that are universal to all functional genomics investigations and those that are domain specific. In this way, the ontology will serve as the “semantic glue” to provide a common understanding of data from across these disparate data sources. In addition, FuGO will reference out to existing mature ontologies to avoid the need to duplicate these resources, and will do so in such a way as to enable their ease of use in annotation. This project is in the early stages of development; the paper will describe efforts to initiate the project, the scope and organization of the project, the work accomplished to date, and the challenges encountered, as well as future plans.
The Minimum Information for Biological and Biomedical Investigations (MIBBI) project provides a resource for those exploring the range of extant minimum information checklists and fosters coordinated development of such checklists.
Flow cytometry (FCM) is an analytical tool widely used for cancer and HIV/AIDS research, and treatment, stem cell manipulation and detecting microorganisms in environmental samples. Current data standards do not capture the full scope of FCM experiments and there is a demand for software tools that can assist in the exploration and analysis of large FCM datasets. We are implementing a standardized approach to capturing, analyzing, and disseminating FCM data that will facilitate both more complex analyses and analysis of datasets that could not previously be efficiently studied. Initial work has focused on developing a community-based guideline for recording and reporting the details of FCM experiments. Open source software tools that implement this standard are being created, with an emphasis on facilitating reproducible and extensible data analyses. As well, tools for electronic collaboration will assist the integrated access and comprehension of experiments to empower users to collaborate on FCM analyses. This coordinated, joint development of bioinformatics standards and software tools for FCM data analysis has the potential to greatly facilitate both basic and clinical research—impacting a notably diverse range of medical and environmental research areas.
The recent development of semiautomated techniques for staining and analyzing flow cytometry samples has presented new challenges. Quality control and quality assessment are critical when developing new high throughput technologies and their associated information services. Our experience suggests that significant bottlenecks remain in the development of high throughput flow cytometry methods for data analysis and display. Especially, data quality control and quality assessment are crucial steps in processing and analyzing high throughput flow cytometry data.
We propose a variety of graphical exploratory data analytic tools for exploring ungated flow cytometry data. We have implemented a number of specialized functions and methods in the Bioconductor package rflowcyt. We demonstrate the use of these approaches by investigating two independent sets of high throughput flow cytometry data.
We found that graphical representations can reveal substantial nonbiological differences in samples. Empirical Cumulative Distribution Function and summary scatterplots were especially useful in the rapid identification of problems not identified by manual review.
Graphical exploratory data analytic tools are quick and useful means of assessing data quality. We propose that the described visualizations should be used as quality assessment tools and where possible, be used for quality control.
flow cytometry; high throughput; quality assessment; visualization; exploratory data analysis; statistics; software
Flow cytometry (FCM) software packages from R/Bioconductor, such as flowCore and flowViz, serve as an open platform for development of new analysis tools and methods. We created plateCore, a new package that extends the functionality in these core packages to enable automated negative control-based gating and make the processing and analysis of plate-based data sets from high-throughput FCM screening experiments easier. plateCore was used to analyze data from a BD FACS CAP screening experiment where five Peripheral Blood Mononucleocyte Cell (PBMC) samples were assayed for 189 different human cell surface markers. This same data set was also manually analyzed by a cytometry expert using the FlowJo data analysis software package (TreeStar, USA). We show that the expression values for markers characterized using the automated approach in plateCore are in good agreement with those from FlowJo, and that using plateCore allows for more reproducible analyses of FCM screening data.
As a high-throughput technology that offers rapid quantification of multidimensional characteristics for millions of cells, flow cytometry (FCM) is widely used in health research, medical diagnosis and treatment, and vaccine development. Nevertheless, there is an increasing concern about the lack of appropriate software tools to provide an automated analysis platform to parallelize the high-throughput data-generation platform. Currently, to a large extent, FCM data analysis relies on the manual selection of sequential regions in 2-D graphical projections to extract the cell populations of interest. This is a time-consuming task that ignores the high-dimensionality of FCM data.
In view of the aforementioned issues, we have developed an R package called flowClust to automate FCM analysis. flowClust implements a robust model-based clustering approach based on multivariate t mixture models with the Box-Cox transformation. The package provides the functionality to identify cell populations whilst simultaneously handling the commonly encountered issues of outlier identification and data transformation. It offers various tools to summarize and visualize a wealth of features of the clustering results. In addition, to ensure its convenience of use, flowClust has been adapted for the current FCM data format, and integrated with existing Bioconductor packages dedicated to FCM analysis.
flowClust addresses the issue of a dearth of software that helps automate FCM analysis with a sound theoretical foundation. It tends to give reproducible results, and helps reduce the significant subjectivity and human time cost encountered in FCM analysis. The package contributes to the cytometry community by offering an efficient, automated analysis platform which facilitates the active, ongoing technological advancement.
Recent advances in automation technologies have enabled the use of flow cytometry for high throughput screening, generating large complex data sets often in clinical trials or drug discovery settings. However, data management and data analysis methods have not advanced sufficiently far from the initial small-scale studies to support modeling in the presence of multiple covariates.
We developed a set of flexible open source computational tools in the R package flowCore to facilitate the analysis of these complex data. A key component of which is having suitable data structures that support the application of similar operations to a collection of samples or a clinical cohort. In addition, our software constitutes a shared and extensible research platform that enables collaboration between bioinformaticians, computer scientists, statisticians, biologists and clinicians. This platform will foster the development of novel analytic methods for flow cytometry.
The software has been applied in the analysis of various data sets and its data structures have proven to be highly efficient in capturing and organizing the analytic work flow. Finally, a number of additional Bioconductor packages successfully build on the infrastructure provided by flowCore, open new avenues for flow data analysis.
Huntington disease (HD) is a neurodegenerative disorder caused by the abnormal expansion of CAG repeats in the HD gene on chromosome 4p16.3. A recent genome scan for genetic modifiers of age at onset of motor symptoms (AO) in HD suggests that one modifier may reside in the region close to the HD gene itself. We used data from 535 HD participants of the New England Huntington cohort and the HD MAPS cohort to assess whether AO was influenced by any of the three markers in the 4p16 region: MSX1 (Drosophila homeo box homologue 1, formerly known as homeo box 7, HOX7), Δ2642 (within the HD coding sequence), and BJ56 (D4S127). Suggestive evidence for an association was seen between MSX1 alleles and AO, after adjustment for normal CAG repeat, expanded repeat, and their product term (model P value 0.079). Of the variance of AO that was not accounted for by HD and normal CAG repeats, 0.8% could be attributed to the MSX1 genotype. Individuals with MSX1 genotype 3/3 tended to have younger AO. No association was found between Δ2642 (P=0.44) and BJ56 (P=0.73) and AO. This study supports previous studies suggesting that there may be a significant genetic modifier for AO in HD in the 4p16 region. Furthermore, the modifier may be present on both HD and normal chromosomes bearing the 3 allele of the MSX1 marker.
Huntington disease; Modifier; Onset age; Genetics; Trinucleotide repeat; HD gene
Despite advances in the understanding of diffuse large B-cell lymphoma (DLBCL) biology, only the clinically based International Prognostic Index (IPI) is used routinely for risk stratification at diagnosis. To find novel prognostic markers, we analyzed flow cytometric data from 229 diagnostic DLBCL samples using an automated multiparameter data analysis approach developed in our laboratory. By using the developed automated data analysis pipeline, we identified 71 of 229 cases as having more than 35% B cells with a high side scatter (SSC) profile, a parameter reflecting internal cellular complexity. This high SSC B-cell feature was associated with inferior overall and progression-free survival (P = .001 and P = .01, respectively) and remained a significant predictor of overall survival in multivariate Cox regression analysis (IPI, P = .001; high SSC, P = .004; rituximab, P = .53).
This study suggests that high SSC among B cells may serve as a useful biomarker to identify patients with DLBCL at high risk for relapse. This is of particular interest because this biomarker is readily available in most clinical laboratories without significant alteration to existing routine diagnostic strategies or incurring additional costs.
Side scatter; Flow cytometry; Diffuse large B-cell lymphoma; Lymphoma; Survival
Discovery of novel immune biomarkers for monitoring of disease prognosis and response to therapy in immune-mediated inflammatory diseases is an important unmet clinical need. Here, we establish a novel framework for immunological biomarker discovery, comparing a conventional (liquid) flow cytometry platform (CFP) and a unique lyoplate-based flow cytometry platform (LFP) in combination with advanced computational data analysis. We demonstrate that LFP had higher sensitivity compared to CFP, with increased detection of cytokines (IFN-γ and IL-10) and activation markers (Foxp3 and CD25). Fluorescent intensity of cells stained with lyophilized antibodies was increased compared to cells stained with liquid antibodies. LFP, using a plate loader, allowed medium-throughput processing of samples with comparable intra- and inter-assay variability between platforms. Automated computational analysis identified novel immunophenotypes that were not detected with manual analysis. Our results establish a new flow cytometry platform for standardized and rapid immunological biomarker discovery with wide application to immune-mediated diseases.
Traditional flow cytometry data analysis is largely based on interactive and time consuming analysis of series two dimensional representations of up to 20 dimensional data. Recent technological advances have increased the amount of data generated by the technology and outpaced the development of data analysis approaches. While there are advanced tools available, including many R/BioConductor packages, these are only accessible programmatically and therefore out of reach for most experimentalists. GenePattern is a powerful genomic analysis platform with over 200 tools for analysis of gene expression, proteomics, and other data. A web-based interface provides easy access to these tools and allows the creation of automated analysis pipelines enabling reproducible research.
In order to bring advanced flow cytometry data analysis tools to experimentalists without programmatic skills, we developed the GenePattern Flow Cytometry Suite. It contains 34 open source GenePattern flow cytometry modules covering methods from basic processing of flow cytometry standard (i.e., FCS) files to advanced algorithms for automated identification of cell populations, normalization and quality assessment. Internally, these modules leverage from functionality developed in R/BioConductor. Using the GenePattern web-based interface, they can be connected to build analytical pipelines.
GenePattern Flow Cytometry Suite brings advanced flow cytometry data analysis capabilities to users with minimal computer skills. Functionality previously available only to skilled bioinformaticians is now easily accessible from a web browser.
Flow cytometry; Data analysis; GenePattern; FCS; Data preprocessing; Quality assessment; Normalization; Clustering
The flow cytometry data file standard provides the specifications needed to completely describe flow cytometry data sets within the confines of the file containing the experimental data. In 1984, the first Flow Cytometry Standard format for data files was adopted as FCS 1.0. This standard was modified in 1990 as FCS 2.0 and again in 1997 as FCS 3.0. We report here on the next generation Flow Cytometry Standard data file format. FCS 3.1 is a minor revision based on suggested improvements from the community. The unchanged goal of the Standard is to provide a uniform file format that allows files created by one type of acquisition hardware and software to be analyzed by any other type.
The FCS 3.1 standard retains the basic FCS file structure and most features of previous versions of the standard. Changes included in FCS 3.1 address potential ambiguities in the previous versions and provide a more robust standard. The major changes include simplified support for international characters and improved support for storing compensation. The major additions are support for preferred display scale, a standardized way of capturing the sample volume, information about originality of the data file, and support for plate and well identification in high throughput, plate based experiments. Please see the normative version of the FCS 3.1 specification in supplementary material to this manuscript (or at http://www.isac-net.org/ in the Current standards section) for a complete list of changes.
Flow cytometry; FCS; data standard; file format; bioinformatics