One challenge in applying bioinformatic tools to clinical or biological data is high number of features that might be provided to the learning algorithm without any prior knowledge on which ones should be used. In such applications, the number of features can drastically exceed the number of training instances which is often limited by the number of available samples for the study. The Lasso is one of many regularization methods that have been developed to prevent overfitting and improve prediction performance in high-dimensional settings. In this paper, we propose a novel algorithm for feature selection based on the Lasso and our hypothesis is that defining a scoring scheme that measures the "quality" of each feature can provide a more robust feature selection method. Our approach is to generate several samples from the training data by bootstrapping, determine the best relevance-ordering of the features for each sample, and finally combine these relevance-orderings to select highly relevant features. In addition to the theoretical analysis of our feature scoring scheme, we provided empirical evaluations on six real datasets from different fields to confirm the superiority of our method in exploratory data analysis and prediction performance. For example, we applied FeaLect, our feature scoring algorithm, to a lymphoma dataset, and according to a human expert, our method led to selecting more meaningful features than those commonly used in the clinics. This case study built a basis for discovering interesting new criteria for lymphoma diagnosis. Furthermore, to facilitate the use of our algorithm in other applications, the source code that implements our algorithm was released as FeaLect, a documented R package in CRAN.
The lack of software interoperability with respect to gating due to lack of a standardized mechanism for data exchange has traditionally been a bottleneck preventing reproducibility of flow cytometry (FCM) data analysis and the usage of multiple analytical tools.
To facilitate interoperability among FCM data analysis tools, members of the International Society for the Advancement of Cytometry (ISAC) Data Standards Task Force (DSTF) have developed an XML-based mechanism to formally describe gates (Gating-ML).
Gating-ML, an open specification for encoding gating, data transformations and compensation, has been adopted by the ISAC DSTF as a Candidate Recommendation (CR).
Gating-ML can facilitate exchange of gating descriptions the same way that FCS facilitated for exchange of raw FCM data. Its adoption will open new collaborative opportunities as well as possibilities for advanced analyses and methods development. The ISAC DSTF is satisfied that the standard addresses the requirements for a gating exchange standard.
Flow cytometry; gating; XML; data standard; compensation; transformation; bioinformatics
We have developed flowMeans, a time-efficient and accurate method for automated identification of cell populations in flow cytometry (FCM) data based on K-means clustering. Unlike traditional K-means, flowMeans can identify concave cell populations by modelling a single population with multiple clusters. flowMeans uses a change point detection algorithm to determine the number of sub-populations, enabling the method to be used in high throughput FCM data analysis pipelines. Our approach compares favourably to manual analysis by human experts and current state-of-the-art automated gating algorithms. flowMeans is freely available as an open source R package through Bioconductor.
flow cytometry; data analysis; cluster analysis; model selection; bioinformatics; statistics
The immune response in humans is usually assessed using immunogenicity assays to provide biomarkers as correlates of protection (CoP). Flow cytometry is the assay of choice to measure intracellular cytokine staining (ICS) of cell-mediated immune (CMI) biomarkers. For CMI analysis, the integrated mean fluorescence intensity (iMFI) was introduced as a metric to represent the total functional CMI response as a CoP. iMFI is computed by multiplying the relative frequency (% positive) of cells expressing a particular cytokine with the mean fluorescence intensity (MFI) of that population, and correlates better with protection in challenge models than either the percentage or the MFI of the cytokine-positive population. While determination of the iMFI as a CoP can readily be accomplished in animal models that allow challenge/protection experiments, this is not feasible in humans for ethical reasons. As a first step towards extending the iMFI concept to humans, we investigated the correlation of the iMFI derived from a human innate immune response ICS assay with functional cytokine release into the culture supernatant, as innate cytokines need to be released to have a functional impact. Next we developed a quantitatively more correlative mathematical approach for calculating the functional response of cytokine producing cells by incorporating the assignment of different weights to the magnitude (frequency of cytokine-positive cells) and the quality (the MFI) of the observed innate immune response. We refer to this model as GiMFI (Generalized iMFI).
GiMFI; correlation analysis; functional response; culture supernatant; cytokine; flow cytometry; antigen presenting cells; integrated mean fluorescent intensity
Flow cytometry is a widely used analytical technique for examining microscopic particles, such as cells. The Flow Cytometry Standard (FCS) was developed in 1984 for storing flow data and it is supported by all instrument and third party software vendors. However, FCS does not capture the full scope of flow cytometry (FCM)-related data and metadata, and data standards have recently been developed to address this shortcoming.
The Data Standards Task Force (DSTF) of the International Society for the Advancement of Cytometry (ISAC) has developed several data standards to complement the raw data encoded in FCS files. Efforts started with the Minimum Information about a Flow Cytometry Experiment, a minimal data reporting standard of details necessary to include when publishing FCM experiments to facilitate third party understanding. MIFlowCyt is now being recommended to authors by publishers as part of manuscript submission, and manuscripts are being checked by reviewers and editors for compliance. Gating-ML was then introduced to capture gating descriptions - an essential part of FCM data analysis describing the selection of cell populations of interest. The Classification Results File Format was developed to accommodate results of the gating process, mostly within the context of automated clustering. Additionally, the Archival Cytometry Standard bundles data with all the other components describing experiments. Here, we introduce these recent standards and provide the very first example of how they can be used to report FCM data including analysis and results in a standardized, computationally exchangeable form.
Reporting standards and open file formats are essential for scientific collaboration and independent validation. The recently developed FCM data standards are now being incorporated into third party software tools and data repositories, which will ultimately facilitate understanding and data reuse.
Recent biological discoveries have shown that clustering large datasets is essential for better understanding biology in many areas. Spectral clustering in particular has proven to be a powerful tool amenable for many applications. However, it cannot be directly applied to large datasets due to time and memory limitations. To address this issue, we have modified spectral clustering by adding an information preserving sampling procedure and applying a post-processing stage. We call this entire algorithm SamSPECTRAL.
We tested our algorithm on flow cytometry data as an example of large, multidimensional data containing potentially hundreds of thousands of data points (i.e., "events" in flow cytometry, typically corresponding to cells). Compared to two state of the art model-based flow cytometry clustering methods, SamSPECTRAL demonstrates significant advantages in proper identification of populations with non-elliptical shapes, low density populations close to dense ones, minor subpopulations of a major population and rare populations.
This work is the first successful attempt to apply spectral methodology on flow cytometry data. An implementation of our algorithm as an R package is freely available through BioConductor.
Experimental descriptions are typically stored as free text without using standardized terminology, creating challenges in comparison, reproduction and analysis. These difficulties impose limitations on data exchange and information retrieval.
The Ontology for Biomedical Investigations (OBI), developed as a global, cross-community effort, provides a resource that represents biomedical investigations in an explicit and integrative framework. Here we detail three real-world applications of OBI, provide detailed modeling information and explain how to use OBI.
We demonstrate how OBI can be applied to different biomedical investigations to both facilitate interpretation of the experimental process and increase the computational processing and integration within the Semantic Web. The logical definitions of the entities involved allow computers to unambiguously understand and integrate different biological experimental processes and their relevant components.
OBI is available at http://purl.obolibrary.org/obo/obi/2009-11-02/obi.owl
Ontology development is a rapidly growing area of research, especially in the life sciences domain. To promote collaboration and interoperability between different projects, the OBO Foundry principles require that these ontologies be open and non-redundant, avoiding duplication of terms through the re-use of existing resources. As current options to do so present various difficulties, a new approach, MIREOT, allows specifying import of single terms. Initial implementations allow for controlled import of selected annotations and certain classes of related terms.
OntoFox http://ontofox.hegroup.org/ is a web-based system that allows users to input terms, fetch selected properties, annotations, and certain classes of related terms from the source ontologies and save the results using the RDF/XML serialization of the Web Ontology Language (OWL). Compared to an initial implementation of MIREOT, OntoFox allows additional and more easily configurable options for selecting and rewriting annotation properties, and for inclusion of all or a computed subset of terms between low and top level terms. Additional methods for including related classes include a SPARQL-based ontology term retrieval algorithm that extracts terms related to a given set of signature terms and an option to extract the hierarchy rooted at a specified ontology term. OntoFox's output can be directly imported into a developer's ontology. OntoFox currently supports term retrieval from a selection of 15 ontologies accessible via SPARQL endpoints and allows users to extend this by specifying additional endpoints. An OntoFox application in the development of the Vaccine Ontology (VO) is demonstrated.
OntoFox provides a timely publicly available service, providing different options for users to collect terms from external ontologies, making them available for reuse by import into client OWL ontologies.
Flow cytometry (FCM) is widely used in health research and in treatment for a variety of tasks, such as in the diagnosis and monitoring of leukemia and lymphoma patients, providing the counts of helper-T lymphocytes needed
to monitor the course and treatment of HIV infection, the evaluation of peripheral blood hematopoietic stem cell
grafts, and many other diseases. In practice, FCM data analysis is performed manually, a process that requires an
inordinate amount of time and is error-prone, nonreproducible, nonstandardized, and not open for re-evaluation,
making it the most limiting aspect of this technology. This paper reviews state-of-the-art FCM data analysis
approaches using a framework introduced to report each of the components in a data analysis pipeline. Current
challenges and possible future directions in developing fully automated FCM data analysis tools are also outlined.
The development of the Functional Genomics Investigation Ontology (FuGO) is a collaborative, international effort that will provide a resource for annotating functional genomics investigations, including the study design, protocols and instrumentation used, the data generated and the types of analysis performed on the data. FuGO will contain both terms that are universal to all functional genomics investigations and those that are domain specific. In this way, the ontology will serve as the “semantic glue” to provide a common understanding of data from across these disparate data sources. In addition, FuGO will reference out to existing mature ontologies to avoid the need to duplicate these resources, and will do so in such a way as to enable their ease of use in annotation. This project is in the early stages of development; the paper will describe efforts to initiate the project, the scope and organization of the project, the work accomplished to date, and the challenges encountered, as well as future plans.
The Minimum Information for Biological and Biomedical Investigations (MIBBI) project provides a resource for those exploring the range of extant minimum information checklists and fosters coordinated development of such checklists.
Flow cytometry (FCM) is an analytical tool widely used for cancer and HIV/AIDS research, and treatment, stem cell manipulation and detecting microorganisms in environmental samples. Current data standards do not capture the full scope of FCM experiments and there is a demand for software tools that can assist in the exploration and analysis of large FCM datasets. We are implementing a standardized approach to capturing, analyzing, and disseminating FCM data that will facilitate both more complex analyses and analysis of datasets that could not previously be efficiently studied. Initial work has focused on developing a community-based guideline for recording and reporting the details of FCM experiments. Open source software tools that implement this standard are being created, with an emphasis on facilitating reproducible and extensible data analyses. As well, tools for electronic collaboration will assist the integrated access and comprehension of experiments to empower users to collaborate on FCM analyses. This coordinated, joint development of bioinformatics standards and software tools for FCM data analysis has the potential to greatly facilitate both basic and clinical research—impacting a notably diverse range of medical and environmental research areas.
The recent development of semiautomated techniques for staining and analyzing flow cytometry samples has presented new challenges. Quality control and quality assessment are critical when developing new high throughput technologies and their associated information services. Our experience suggests that significant bottlenecks remain in the development of high throughput flow cytometry methods for data analysis and display. Especially, data quality control and quality assessment are crucial steps in processing and analyzing high throughput flow cytometry data.
We propose a variety of graphical exploratory data analytic tools for exploring ungated flow cytometry data. We have implemented a number of specialized functions and methods in the Bioconductor package rflowcyt. We demonstrate the use of these approaches by investigating two independent sets of high throughput flow cytometry data.
We found that graphical representations can reveal substantial nonbiological differences in samples. Empirical Cumulative Distribution Function and summary scatterplots were especially useful in the rapid identification of problems not identified by manual review.
Graphical exploratory data analytic tools are quick and useful means of assessing data quality. We propose that the described visualizations should be used as quality assessment tools and where possible, be used for quality control.
flow cytometry; high throughput; quality assessment; visualization; exploratory data analysis; statistics; software
Flow cytometry (FCM) software packages from R/Bioconductor, such as flowCore and flowViz, serve as an open platform for development of new analysis tools and methods. We created plateCore, a new package that extends the functionality in these core packages to enable automated negative control-based gating and make the processing and analysis of plate-based data sets from high-throughput FCM screening experiments easier. plateCore was used to analyze data from a BD FACS CAP screening experiment where five Peripheral Blood Mononucleocyte Cell (PBMC) samples were assayed for 189 different human cell surface markers. This same data set was also manually analyzed by a cytometry expert using the FlowJo data analysis software package (TreeStar, USA). We show that the expression values for markers characterized using the automated approach in plateCore are in good agreement with those from FlowJo, and that using plateCore allows for more reproducible analyses of FCM screening data.
As a high-throughput technology that offers rapid quantification of multidimensional characteristics for millions of cells, flow cytometry (FCM) is widely used in health research, medical diagnosis and treatment, and vaccine development. Nevertheless, there is an increasing concern about the lack of appropriate software tools to provide an automated analysis platform to parallelize the high-throughput data-generation platform. Currently, to a large extent, FCM data analysis relies on the manual selection of sequential regions in 2-D graphical projections to extract the cell populations of interest. This is a time-consuming task that ignores the high-dimensionality of FCM data.
In view of the aforementioned issues, we have developed an R package called flowClust to automate FCM analysis. flowClust implements a robust model-based clustering approach based on multivariate t mixture models with the Box-Cox transformation. The package provides the functionality to identify cell populations whilst simultaneously handling the commonly encountered issues of outlier identification and data transformation. It offers various tools to summarize and visualize a wealth of features of the clustering results. In addition, to ensure its convenience of use, flowClust has been adapted for the current FCM data format, and integrated with existing Bioconductor packages dedicated to FCM analysis.
flowClust addresses the issue of a dearth of software that helps automate FCM analysis with a sound theoretical foundation. It tends to give reproducible results, and helps reduce the significant subjectivity and human time cost encountered in FCM analysis. The package contributes to the cytometry community by offering an efficient, automated analysis platform which facilitates the active, ongoing technological advancement.
Recent advances in automation technologies have enabled the use of flow cytometry for high throughput screening, generating large complex data sets often in clinical trials or drug discovery settings. However, data management and data analysis methods have not advanced sufficiently far from the initial small-scale studies to support modeling in the presence of multiple covariates.
We developed a set of flexible open source computational tools in the R package flowCore to facilitate the analysis of these complex data. A key component of which is having suitable data structures that support the application of similar operations to a collection of samples or a clinical cohort. In addition, our software constitutes a shared and extensible research platform that enables collaboration between bioinformaticians, computer scientists, statisticians, biologists and clinicians. This platform will foster the development of novel analytic methods for flow cytometry.
The software has been applied in the analysis of various data sets and its data structures have proven to be highly efficient in capturing and organizing the analytic work flow. Finally, a number of additional Bioconductor packages successfully build on the infrastructure provided by flowCore, open new avenues for flow data analysis.
Huntington disease (HD) is a neurodegenerative disorder caused by the abnormal expansion of CAG repeats in the HD gene on chromosome 4p16.3. A recent genome scan for genetic modifiers of age at onset of motor symptoms (AO) in HD suggests that one modifier may reside in the region close to the HD gene itself. We used data from 535 HD participants of the New England Huntington cohort and the HD MAPS cohort to assess whether AO was influenced by any of the three markers in the 4p16 region: MSX1 (Drosophila homeo box homologue 1, formerly known as homeo box 7, HOX7), Δ2642 (within the HD coding sequence), and BJ56 (D4S127). Suggestive evidence for an association was seen between MSX1 alleles and AO, after adjustment for normal CAG repeat, expanded repeat, and their product term (model P value 0.079). Of the variance of AO that was not accounted for by HD and normal CAG repeats, 0.8% could be attributed to the MSX1 genotype. Individuals with MSX1 genotype 3/3 tended to have younger AO. No association was found between Δ2642 (P=0.44) and BJ56 (P=0.73) and AO. This study supports previous studies suggesting that there may be a significant genetic modifier for AO in HD in the 4p16 region. Furthermore, the modifier may be present on both HD and normal chromosomes bearing the 3 allele of the MSX1 marker.
Huntington disease; Modifier; Onset age; Genetics; Trinucleotide repeat; HD gene
The flow cytometry data file standard provides the specifications needed to completely describe flow cytometry data sets within the confines of the file containing the experimental data. In 1984, the first Flow Cytometry Standard format for data files was adopted as FCS 1.0. This standard was modified in 1990 as FCS 2.0 and again in 1997 as FCS 3.0. We report here on the next generation Flow Cytometry Standard data file format. FCS 3.1 is a minor revision based on suggested improvements from the community. The unchanged goal of the Standard is to provide a uniform file format that allows files created by one type of acquisition hardware and software to be analyzed by any other type.
The FCS 3.1 standard retains the basic FCS file structure and most features of previous versions of the standard. Changes included in FCS 3.1 address potential ambiguities in the previous versions and provide a more robust standard. The major changes include simplified support for international characters and improved support for storing compensation. The major additions are support for preferred display scale, a standardized way of capturing the sample volume, information about originality of the data file, and support for plate and well identification in high throughput, plate based experiments. Please see the normative version of the FCS 3.1 specification in supplementary material to this manuscript (or at http://www.isac-net.org/ in the Current standards section) for a complete list of changes.
Flow cytometry; FCS; data standard; file format; bioinformatics
A fundamental tenet of scientific research is that published results are open to independent validation and refutation. Minimum data standards aid data providers, users and publishers by providing a specification of what is required to unambiguously interpret experimental findings. Here, we present the Minimum Information about a Flow Cytometry Experiment (MIFlowCyt) standard, stating the minimum information required to report flow cytometry (FCM) experiments.
We brought together a cross-disciplinary international collaborative group of bioinformaticians, computational statisticians, software developers, instrument manufacturers, and clinical and basic research scientists to develop the standard. The standard was subsequently vetted by the International Society for the Advancement of Cytometry (ISAC) Data Standards Task Force, Standards Committee, membership and Council.
The MIFlowCyt standard includes recommendations about descriptions of the specimens and reagents included in the FCM experiment, the configuration of the instrument used to perform the assays and the data processing approaches used to interpret the primary output data.
MIFlowCyt has been adopted as a standard by ISAC, representing the flow cytometry scientific community including scientists as well as software and hardware manufactures.
MIFlowCyt’s adoption by the scientific and publishing communities will facilitate third-party understanding and reuse of flow cytometry data.
immunology; fluorescence-activated cell sorting; knowledge representation
Flow cytometry technology is widely used in both health care and research. The rapid expansion of flow cytometry applications has outpaced the development of data storage and analysis tools. Collaborative efforts being taken to eliminate this gap include building common vocabularies and ontologies, designing generic data models, and defining data exchange formats. The Minimum Information about a Flow Cytometry Experiment (MIFlowCyt) standard was recently adopted by the International Society for Advancement of Cytometry. This standard guides researchers on the information that should be included in peer reviewed publications, but it is insufficient for data exchange and integration between computational systems. The Functional Genomics Experiment (FuGE) formalizes common aspects of comprehensive and high throughput experiments across different biological technologies. We have extended FuGE object model to accommodate flow cytometry data and metadata.
We used the MagicDraw modelling tool to design a UML model (Flow-OM) according to the FuGE extension guidelines and the AndroMDA toolkit to transform the model to a markup language (Flow-ML). We mapped each MIFlowCyt term to either an existing FuGE class or to a new FuGEFlow class. The development environment was validated by comparing the official FuGE XSD to the schema we generated from the FuGE object model using our configuration. After the Flow-OM model was completed, the final version of the Flow-ML was generated and validated against an example MIFlowCyt compliant experiment description.
The extension of FuGE for flow cytometry has resulted in a generic FuGE-compliant data model (FuGEFlow), which accommodates and links together all information required by MIFlowCyt. The FuGEFlow model can be used to build software and databases using FuGE software toolkits to facilitate automated exchange and manipulation of potentially large flow cytometry experimental data sets. Additional project documentation, including reusable design patterns and a guide for setting up a development environment, was contributed back to the FuGE project.
We have shown that an extension of FuGE can be used to transform minimum information requirements in natural language to markup language in XML. Extending FuGE required significant effort, but in our experiences the benefits outweighed the costs. The FuGEFlow is expected to play a central role in describing flow cytometry experiments and ultimately facilitating data exchange including public flow cytometry repositories currently under development.