High-throughput screening studies in biology and other fields are increasingly popular due to ease of sample tracking and decreasing technology costs. These experimental setups enable researchers to obtain numerous measurements across multiple individuals in parallel (e.g. gene expression and diverse plate-based assays) or in series (e.g. metabolomics and proteomics platforms). The large number of measurements collected often comes at the cost of measurement precision or the overall power of detection. For many large-scale studies, the experimental design aims to maximize the number of compounds or individuals tested, resulting in limited replication and few to no controls. In the case of microarray studies, several methods for normalizing arrays have been developed [1
] with no universal method adopted as the standard. Quantitative PCR faces the same issues as it is used more frequently in high throughput platforms, with analysis methodologies being developed paralleling those for expression arrays [4
Metabolite profiling is a rapidly expanding area of high throughput measurements, where samples having large amounts of biological variability and diverse physical properties makes quantification of large numbers of structurally diverse metabolites challenging [5
]. Few strategies exist for normalization in metabolite analysis to control for run-to-run variance other than to include negative and positive controls. For large-scale screens involving mutagenized populations (plant, bacteria) or crosses (plant breeding), the goal is to identify putative hits, or individuals that are likely to be different from the bulk of the samples for subsequent follow-up (e.g. [6
]). In these conditions, properties of the sample cohort serve as controls with the measure of differences between an individual and its cohort used to identify samples differentially accumulating a metabolite [6
]. This strategy can streamline sample processing and maximize throughput when the expected effects are large and easily observable.
For studies where comparisons are sought across an experiment conducted over the course of several months or in different sample batches, normalizing factors are necessary, especially given typically high levels of biological and technical variability [7
]. Ideal experiments include technical and biological replication within each set as well as controls facilitating comparisons between sample batches, but these are often limited or omitted entirely due to likely increases in experimental costs or the negative impacts on throughput. However, absence of these experimental controls limits the ability to handle variability between sample groups (e.g. remove batch effects) making it a greater challenge to identify individuals within the range between normal and aberrant phenotypes. Without the ability to normalize the data provided by experimental controls, some of the benefits of high throughput screens are lost, yet the desire to maximize throughput places constraints on the experimental design.
The motivation for algorithm development came from the Arabidopsis thaliana
Chloroplast 2010 Project large-scale reverse genetic phenotypic screen (Chloroplast 2010, http://www.plastid.msu.edu/
]). This project leverages the collection of T-DNA insertion lines and genomic sequence for the plant model species A. thaliana
to screen large numbers of putative gene knockouts with the aim of functionally characterizing chloroplast-targeted genes. The presence of a large T-DNA insertion can block or reduce expression of the gene it lands in, and altered phenotypes can provide insights into the normal function of the gene and its protein or RNA product(s).
In addition to qualitative and semi-quantitative measures of physiological and morphological characteristics, the pipeline assayed levels of leaf fatty acids and leaf and seed free amino acids, important outputs of chloroplast metabolism. The pipeline assays were preformed on groups of individual plants planted in units of up to thirty-two per tray and three trays of plants per assay group. Two assay groups were grown concurrently under controlled environment plant growth conditions. Individuals representing T-DNA insertion events in different locations within the same gene (alleles) are present in the dataset, and it is of interest to compare the assay responses of these individuals as well as to identify other individuals with similar responses. Because the experimental design lacked cross-group controls (e.g. designated WT), the ability to make even semi-quantitative cross-dataset comparisons was not possible using existing methodology.
Developing phenotypic annotation for un- and under- annotated genes is a primary goal for the Chloroplast 2010 project and identification of individuals with like phenotypes (phenotypic clustering) is a way to achieve that goal. Thus, a method that would allow cross-dataset comparisons and identify putative mutants was needed to achieve the goal. The resulting method, MIPHENO (Mutant Identification by Probabilistic High throughput-Enabled Normalization), is aimed at improving first-pass screening capabilities for large datasets in the absence of defined controls. Algorithm performance was tested using a synthetic data set and the Chloroplast 2010 high throughput phenotypic dataset. The executable code and data for the Chloroplast 2010 analysis are available as Additional file 1
and as a CRAN package (MIPHENO, http://cran.r-project.org/web/packages/MIPHENO/index.html
The following describes a quality control process for identifying aberrant groups followed by a data normalization method, which aims to bring samples into the same distribution allowing for dataset-wide comparisons. Additionally, we describe a hit detection function based on the cumulative distribution function (CDF) to identify samples with putative, 'non-normal' phenotypes. For clarity, the terms normal and wild type (WT) are used to describe the typical response of the population. Generally, this could be the untreated (chemically or genetically) population or the base level of the system (e.g. background response). Non-wild type responses, a hit or mutant, refer to a response that is distinct from the normal response distribution, with a putative hit/putative mutant referring to a sample that is predicted to have a response different from the normal response distribution but has yet been confirmed. In high throughput screens, the objective is to identify putative hits balancing the false positive rate (FPR), or the number of WT samples that are called hits, with the false non-discovery rate (FNDR), the number of true hits that are missed. Results are presented from analysis of the synthetic dataset and biological data.