|Home | About | Journals | Submit | Contact Us | Français|
High Content Screening (HCS) and High Content Analysis (HCA) have emerged over the past 10 years as a powerful technology for both drug discovery and systems biology. Founded on the automated, quantitative image analysis of fluorescently labeled cells or engineered cell lines, HCS provides unparalleled levels of multi-parameter data on cellular events and is being widely adopted, with great benefits, in many aspects of life science from gaining a better understanding of disease processes, through better models of toxicity, to generating systems views of cellular processes. This paper looks at the role of informatics and bioinformatics in both enabling and driving HCS to further our understanding of both the genome and the cellome and looks into the future to see where such deep knowledge could take us.
Completed in 2003, the Human Genome Project (HGP) resulted in the sequencing of the 30,000 genes contained in the entire human genome , while this was a remarkable effort, we are only at the beginning of our understanding of the role of all these genes in living systems. Functional genomics, the role of genes in complex traits and disease, gene regulation and complex systems biology are just some of the questions that were raised by the HGP and are the subject of much research. The cell can be considered as the simplest unit of life that is amenable to the study of many of the questions raised by the HGP. However “sequencing” the cell, i.e. deep understanding of cellular processes, interactions, signaling, death etc., represents a technological challenge akin to sequencing genes- a challenge many have termed the ‘cellome’. The emergence of HCS with its ability to quantitatively measure what, where and when an event occurs in a cell offers just that richness of data from which many of the key questions about gene function may be answered. However, in the same manner that genomics solved the problem of high throughput data acquisition but then hit a bottleneck with respect to infrastructure and tools to manage and mine that data for knowledge, so HCS is now reaching the same status.
HCS systems typically scan a multi-well plate with cells or cellular components in each well, acquire multiple images of cells, and extract multiple features (or measurements) relevant to the biology, resulting in a large quantity of data and images. The amount of data and images generated from a single microtiter plate can range from hundreds of megabytes (MB) to multiple gigabytes (GB). Large numbers of plates are typically analyzed in screening operations and large scale system biology experiments, often resulting in billions of features and millions of images with a need for multiple terabytes (TB) of storage in a short period of time.
While the management of this kind of data is becoming commonplace, tools to generate ‘omic knowledge from billions of cellular measurements are less mature and we believe may hinder HCS from achieving its full potential of solving the cellome.
Our goal in this chapter is to provide a brief overview of informatics for managing HCS data, then to provide a series of examples of the use of HCS to solve key discovery problems and how informatics and bioinformatics are playing a role in this. Finally we look to the future to see how computational modeling and simulation could impact our insight of the cellome.
In order to best describe the role of informatics and bioinformatics and the impact on driving the adoption and penetration of HCS into discovery, we have considered tiers of functionality, with each tier contributing to the overall systems view.
This MID model provides a simple way to discuss the relative functionality needed to appropriately deal with HCS data and the impact derived in terms of gaining value from HCS.
HCS data, derived from some form of automated instrument can easily consume many terabytes of disk space. HCS data can be classified into 3 categories 
From a data volume perspective, the data to be saved per sample is primarily based on the Image data and the Derived data, Meta data is negligible in proportion from a storage perspective but is highly value for providing context.
Management of HCS data is the foundation of being able to derive value from it. Poor data management impacts both the ease of performing the other steps as well as the scientific robustness of the conclusions drawn from that data. Fig. (11) shows how HCS data must be considered together with other kinds of data in order for true value to be obtained. Ideal HCS data management ensures that all these disparate sources of data can be federated together to provide the knowledge with which to make biological decisions.
From an information technology perspective, management of large volumes of images and associated derived and meta-data represents a challenge and certainly when HCS was in its infancy this issue represented a bottleneck to the adoption of HCS. However, for the most part, with the move towards reliable, scalable ‘n’ tier architectures for HCS image and data management , this bottleneck has mostly been mitigated.
However, there remain two major challenges to HCS data management that will need to be solved if the cellome is to be achieved, the challenges relate to the format of HCS data storage, and how HCS data can be integrated in meaningful ways with other data such as chemical structures, gene sequences and pathway information. Key to solving both these issues is the role of standards. Many disciplines such as genomics and proteomics have proposed and adopted standards for recording and describing data, chief amongst these standards are the Minimum Information Standards [3-5] as well as the Open Microscopy Environment (OME) for describing data derived from images . The minimum information and the OME standards provide a way to both store HCS data and images in an open and self describing format (XML) to facilitate data interchange such that images and meta-data acquired on one platform may be analyzed and interpreted by a wide variety of tools. Flow cytometry has used standard file formats and data models for many years  which has resulted in a variety of tools to analyze flow data. In many ways flow data is very comparable to HCS Derived and Meta data and so many of tools used for flow data may well contribute to the analysis of HCS data. Recently the flow community proposed its own minimum information standard (MIFlowCyt) .
The HCS community has yet to adopt a standard though MIAHA (Minimum Information about a High Content Assay) has been proposed .
Once data are available in an open and self describing form, analysis and interpretation are possible, however, key to this is the semantic meaning of the data. Whole cellome analysis of HCS data, perhaps bringing together data from different HCS platforms, imaging tools as well as data sources (chemical structures, gene sequences) will not be possible unless clear semantic meaning can be assigned. Indeed this is not just a challenge for HCS, genomics and proteomics face many of the same issues. Overcoming the problem of assigning meaning requires the use of ontologies, essentially frameworks of controlled vocabularies that provide annotations to data and meta data to facilitate analysis. The Open Biological Organization (OBO)  for example defines multiple ontologies for biology and brings together a number of previously developed ontologies (e.g., GO – the gene ontology). Once ontologies are agreed they can be adopted as part of the standards process.
To be clear we discuss here interpretation and analysis of the derived and meta data, there is no discussion of the image analysis involved in HCS.
Leveraging the foundation of robust data management for HCS, the analysis tier provides the basic insights into HCS data, allows for quality control and statistical analyses and most importantly utilizes the multi-parameter data to make decisions. It is also important to consider that, knowledge generation from HCS data must be considered a multistep workflow, requiring a number of functional stages from the basic gathering of data through QA, visualization, annotation and data mining. Fig. (22) explains these functional units and we discuss the tiers in the context of that workflow.
Basic interpretation of data covers the initial part of the HCS workflow from quality control, visualization and basic reporting to annotations (Fig. 22). Such relatively simple functional steps have allowed HCS to be well integrated into many lead selection campaigns, either at the primary screening stage or at the secondary screening and lead prioritization stages. Although good tools exist for analysis of HTS data , some of which can be applied to HCS, HCS data presents some interesting challenges and opportunities, due to the more than one value per well. Early analyses of HCS data focused on reducing the data to a single measurement at the well level, and this reflected the fact that early on, HCS was often the only way to perform a hitherto intractable assay (e.g., neurite outgrowth). The multiple measurements of HCS allow not only activity at the target to be elucidated, but simple toxicity (measuring cell number for example) as well as off target effects to be determined. Traditional HTS approaches, such as thresholds based on descriptive statistics ignore the value of this multi parameter data, yet simple statistical and data visualization methods can be used to refine hit selection for HCS . Allowing scientists to merely view all the HCS data in a variety of visualizations together with the image provides useful information on small numbers of plates and allows the user to drill down though the well level multi-parameter data to the cell subpopulation data (Fig. 33). Use of data visualization tools such as Spotfire® allow more sophisticated visualizations, Fig. (44) shows a simple viewing technique where the target measurement is represented in one color and the number of cells in the well is the size of the spot. More sophisticated 3D plots can also quickly visualize multiple parameters, Fig. (55), providing useful toxicity data in addition to the target measurements. In addition to filtering and visualization, the concept of building rule sets to analyze multiple parameters can also be employed. For example, it has been possible to successfully classify hits in toxicology into late stage, early stage and reversible status  by simply using Boolean rule sets (i.e. parameter 1 >=50 AND parameter 2 <30 OR parameter 3 >=200). Such rule sets are based on a priori knowledge however decision trees that are capable of being learned from data  offer greater flexibility and can often elucidate subtle effects.
The plethora of measurements possible with HCS using sophisticated image analysis often needs to be reduced to a smaller subset so as to determine the key parameters that separate the stimulated (positive biological effect) from the un-stimulated (control effect). It is commonplace to make a number of measurements of the stimulated/un-stimulated biology during assay development and then determine which are the top parameters that separate those states. In an internal study at Thermo Fisher Scientific, we employed T-tests, Z’ measurements, Self Organizing maps (SOM) and K-nearest neighbor (K-NN) analyses to determine the optimal set of morphological parameters. Fig. (66) shows the results of using K-NN to separate un-stimulated populations from stimulated populations. The K-NN identifies 3 key parameters (from a set of 52) that allow maximal separation. Such data reduction techniques can then be used to reduce the number of measurements made in a screening campaign without losing any discriminatory power, while maintaining manageable data set sizes in screens that may generate billions of data points.
Whole well analysis of multiple parameters, while more sophisticated than a single number, ignores the value of the subpopulation effects inherent in cell based imaging assays. While descriptive statistics such as mean, median, standard deviation and standard error provide some insight into the variation of the underlying cell data, more powerful statistics such as K-S (Kolmogorov-Smirnov)  have been widely adopted to compare the significance of distributions of cell populations for up to two parameters across experimental conditions, e.g., test vs control. While these techniques still reduce the data to a single number they provide increased confidence that the single number reflects the cell based data variation and the K-S statistic has been used successfully by a variety of studies [16-18].
In probably the first example of leveraging the power of more than one parameter in HCS studies. The authors  use relatively simple population density distributions of over 30 shape, texture and location measurements of cells against a range of concentrations of several known anticancer compounds. Plotting the natural log of these parameters for various concentrations of the drugs allowed a ‘high content profile’ to be generated that allowed easy comparison of drug effects on various cellular processes. Further visualizations such as quadrant plots, dot plots and scatter plots of cell based data revealed new insights into the interactions of drugs at the cell level in unprecedented detail. Similar visualizations of a number of cell measurements demonstrated that a panel of cell based assays could detect and classify threat agents based on cellular responses in those assays .
Early analyses of HCS data, described above, began to reveal the power of measuring multiple parameters and demonstrated that relatively simple statistics and visualizations (available in common informatics and statistical packages) could elegantly elucidate cellular responses.
It is now recognized that much of the power of HCS lies in generating cellular profiles or phenotypes from multivariate cell based data. Sophisticated informatics and bioinformatics techniques can be employed to analyses these phenotypes resulting in insights to cell biology. Such tools are represented further downstream in the HCS workflow (Fig. 22) and build on the conclusions and insights made earlier in the workflow. Data quality control is of particular importance since data driven methods such as those detailed in this section require robust data sets to avoid poor performance and potentially misleading conclusions.
Classifiers of one type or another (e.g., supervised, unsupervised, statistical and machine learning) are very powerful techniques for analyzing multi-parameter data and have been successfully used for HCS. In a study of morphological effects of 107 compounds, known to inhibit protein kinases on a panel of 5 cell lines, Principal Component Analysis (PCA) of the morphological phenotypes following treatment with known kinase inhibitors identified a novel compound that inhibited CRB1, an enzyme involved in cell signaling. What was interesting was the fact that the phenotype detected was different from the cell phenotype of the known compound, yet the compounds differed chemically by only one hydroxyl group, indicating that HCS is able to clearly differentiate a minor structure difference on the basis of analysis of complex phenotypes . Availability of such complex phenotypes and their analysis is key to realizing the potential of HCS data and utilizing the subtle effects of multiple cellular measurements. In another study , factor analysis of cell phenotypes based on cell cycle measures was used to profile a compound library and infer, based on the phenotypic profiles, mechanism of action of compounds. This work also demonstrates that phenotypic profiles are rich enough to provide biological meaning.
In addition to PCA, other techniques such as Hierarchical Clustering have been used to classify cellular phenotypes in response to both drug and RNAi treatments  furthering the impact of HCS in combinatorial biology experiments. Classifiers have also been shown to play a valuable role in predicting actives and non-actives in a screen. Several classifiers were trained on the cell profiles of known reference compounds and then the classifiers were used to predict actives and inactives in a screen for neurite outgrowth. A combination of K-nearest neighbors (K-NN), Fisher Linear Discriminant Analysis (LDA) and support vector machines were used to create a system able to predict an “active phenotype” in screens five times better than using traditional hit selection methods .
Highly complex data mining tools such as Support Vector Machines (SVM)  can be employed to analyze HCS data and may hold promise as they are tolerant to noise in data sets, a consideration of some importance for cell based measurements. SVMs have been successfully used to recognize phases of the cell cycle by classifying a set of fifty nine morphological measurements of cells. The SVM classification was compared with human annotations and demonstrated a high degree of accuracy and specificity in predicting mitotic sub-phases . SVMs have also been successful in cellular multi-phenotypic mitotic analysis  as well as determining the best segementation of images from morphological measurements.
There is no doubt that classification of cellular phenotypes can begin to unlock the types of cellular knowledge that are useful for both drug screening as well as systems biology .
From a pure systems engineering standpoint, biology from the ecosystem level to the genome level is a highly interconnected network. Attempting to probe such a highly interconnected system using a reductionist approach ignores this richness of connections and limits our ability to generate valuable knowledge. HCS provides sophisticated multi-parametric probes that when coupled with powerful bioinformatics tools can yield an understanding of these connections.
Pathways represent some of the more complex connection networks and ones that are heavily involved in cell regulation. A key regulatory network is the cell cycle and using a combination of HCS measurements of morphological changes in cell and RNAi knockdown of genes involved in cell cycle regulation complex phenotypic data sets have been generated. Analysis of these data sets using a combination of clustering and functional annotations showed a number of new pathways and processes involved in cell cycle and cell-size regulation , and identified a new translational inhibitor of the Cyclin/Cdk pathway. Generating these kinds of insights utilized data not just from HCS but from FACS as well as gene annotation, functional assignments and so on. Such system wide analysis using sophisticated tools are beginning to be used with great benefit, in areas such as transcriptional changes in breast cancer cells , modeling Parkinson’s disease  as well as the search for therapies for hepatocarcinomas .
In recent years, pathway analysis and modeling tools have been adopted for a wide variety of approaches from elucidating pathways to genome wide association studies. For the most part these tools have used genomic data (expression profiles) as a data source , but there is an increasing demand to use HCS data as a model input.
In genome wide RNA screening of human kinases involved in neurodegeneration, HCS was used to identify candidate kinases involved in neurite outgrowth and retraction. The candidate kinases were then grouped and linked using pathway analysis software (PathwayArchitect – Strategene) to create a regulatory network of the kinases involved in signaling. By combining HCS data, RNAi knockdown and pathway analysis the authors were able to get the first overall picture of the signaling that occurs during neurite degeneration as well as identify novel cross-talk between unrelated signaling pathways. This would not have been possible without such sophisticated bioinformatics tools , and it is now clear that the richness of data available from HCS is starting spark interest in the modeling community.
The virtual cell  which is hosted at the National Resource for Cell Analysis and Modeling is a novel tool that combines computational biology with imaging. The tool allows scientists to model and simulate specific cellular functions from simple molecular motors to complex signaling, in a simple Java environment. At this time, it used together with images in order to model compartments, but this author considers that using HCS data instead of the images themselves could lead to a revolution in the complexity and breadth of cell modeling. Models are always seeking data to both improve the model as well as validate the model. The virtual cell brings together data from physiological models, cellular structures, reactions, fluxes, reactions and so on with spatial data (from images) as well as external data such as pathway analysis and external literature as well as sources such as KEGG  (which is a database of the building blocks of biological systems such as, pathways genes and so on). These biological facts are converted into mathematical models, and the simulation engine runs to provide information on time response, steady state data and sensitivities. These data can then be used to drive experiments, refine protocols, and do further modeling. To date, there are approximately 30 published paper using the virtual cell, covering a wide variety of topics from signaling  to cell structure dynamics  to calcium transport .
HCS has come a long way in the past 10 years, and informatics and bioinformatics have played a key role, from its early use to perform assays that were intractable without imaging, through phenotype analyses, to today’s genome wide, multi disciplinary studies.
The next steps lie in leveraging this data to build models and perform simulations, as these methods allow the researcher to test many more conditions than are possible in the wet laboratory. It is highly conceivable that a future laboratory could take HCS data from many cell types as source data for pathway models as well as virtual cell models, opening up the tantalizing possibility of modeling gene function, compound mechanism of action and cellular responses in cells and tissues in silico, truly achieving the cellome.