Home | About | Journals | Submit | Contact Us | Français |

**|**Bioinformatics**|**PMC2853690

Formats

Article sections

Authors

Related links

Bioinformatics. 2010 April 15; 26(8): 1043–1049.

Published online 2010 March 4. doi: 10.1093/bioinformatics/btq097

PMCID: PMC2853690

* To whom correspondence should be addressed.

Associate Editor: John Quackenbush

Received 2010 January 13; Revised 2010 February 22; Accepted 2010 February 25.

Copyright © The Author(s) 2010. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article has been cited by other articles in PMC.

**Motivation:** Global expression patterns within cells are used for purposes ranging from the identification of disease biomarkers to basic understanding of cellular processes. Unfortunately, tissue samples used in cancer studies are usually composed of multiple cell types and the non-cancerous portions can significantly affect expression profiles. This severely limits the conclusions that can be made about the specificity of gene expression in the cell-type of interest. However, statistical analysis can be used to identify differentially expressed genes that are related to the biological question being studied.

**Results:** We propose a statistical approach to expression deconvolution from mixed tissue samples in which the proportion of each component cell type is unknown. Our method estimates the proportion of each component in a mixed tissue sample; this estimate can be used to provide estimates of gene expression from each component. We demonstrate our technique on xenograft samples from breast cancer research and publicly available experimental datasets found in the National Center for Biotechnology Information Gene Expression Omnibus repository.

**Availability:** R code (http://www.r-project.org/) for estimating sample proportions is freely available to non-commercial users and available at http://www.med.miami.edu/medicine/x2691.xml

**Contact:** ude.imaim.dem@ekralcj

In the past decade, gene expression profiling has demonstrated an amazing potential for identifying disease biomarkers and improving our understanding of cellular processes (Pittman *et al.*, 2004; van't Veer *et al.*, 2002; Wheelan *et al.*, 2008). An issue not often discussed is that many biological samples contain mixtures of cell or tissue types (Wang *et al.*, 2006); for example, cancer cells may only constitute part of a biopsy sample. The amount of each mRNA detected in a microarray experiment is influenced by the composition of the sample; observed changes in gene expression may simply reflect a change in the distribution of the cell types in the sample population (Causton *et al.*, 2003). In breast cancer Cleator *et al.* (2006) noticed that the proportion of benign tissue of biopsy samples can significantly affect expression profiles, and taking into consideration this proportion can improve response prediction. Sample heterogeneity severely limits the conclusions that can be made about specificity of gene expression and may explain in part why the results of numerous gene expression experiments have failed rigorous validation (Michiels *et al.*, 2005).

Given a heterogeneous sample there exist laboratory approaches to separate cells of distinct types. Laser capture microdissection (LCM; Fend and Raffeld, 2000) is a popular technique for isolating regions of a biological sample that are separated by distances of a few cell widths. However, the cell types of interest need to be morphologically distinct. LCM, is very time-consuming and specialized equipment, is required to obtain a sufficient quantity of biological material for profiling. If the sample of interest is in suspension, cell-sorting methods can be used to isolate cells of interest. This requires a suitable biomarker for the cell type of interest. The main drawback of cell sorting with respect to profiling is that the act of separation itself can alter gene expression (Gosink *et al.*, 2007).

We present a method for deconvoluting expression from a heterogeneous sample into components that reflect the contributions to the observed expression attributable to each component cell or tissue type. The key component of this method is the estimation, from a mixed tissue sample, of the proportion of mRNA from a single tissue type. Estimation is based on specific logarithmic data transformations and theory from differential geometry regarding the radius of curvature (Lipschutz, 1969). We demonstrate our method on several datasets from breast cancer xenograft studies, from both proprietary sources and the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) repository (http://www.ncbi.nlm.nih.gov/geo/; Barrett *et al.*, 2006).

Several approaches have been taken to the problem of expression deconvolution and each approach depends on access to different types of information, different statistical assumptions and different objectives.

If there are genes known to be expressed exclusively in one tissue type, then these genes can be used to estimate the proportion of expression coming from that tissue. For example, the program DECONVOLUTE (Lu *et al.*, 2003) uses simulated annealing and genes expressed only during specific cell cycles to identify the proportion of cells in each cycle from an asynchronous cellular sample. These methods depend on known tissue- or cell-specific genes, and technology that can detect their expression with little or no cross-hybridization. If these conditions is not met, widely varying estimates of *p*_{A} can be obtained by selecting different subsets of tissue-specific genes. Note that low specificity of microarray hybridizations has been suggested to be one of the prime measures affecting discrepancies in gene-expression profiles between different probes targeting the same region of a given transcript or between different microarray platforms (Koltai and Weingarten-Baror, 2008). We do not assume knowledge of cell- or tissue-specific genes in our method, although such knowledge may be available, particularly for samples from xenograft studies (where the tissues of interest are from different species).

Similarly, several researchers have used expression data from purified reference tissue types to determine the expression of each tissue type in heterologous samples (Lahdesmaki *et al.*, 2005; Venet *et al.*, 2001). For example, Wang *et al.* (2006) use a method similar to that of Lu *et al.* (2003), mentioned above, to determine the proportions of each cell type in a mixed sample. This method generates estimates by obtaining solutions to linear equations via simulated annealing. These approaches depend on having expression data from a purified reference sample for each cell or tissue type, which may not be available.

Another approach uses proportions of each sample or cell type, assessed by pathologists, to establish either tissue-specific expression or differential expression between mixed and control samples. In Stuart *et al.* (2004) linear regression models, regressing expression on fractional content of tumor (or stroma), were used to estimate the expected cell-type expression as the regression coefficient. A more sophisticated statistical approach was used by Ghosh (2004) to determine differential expression in the presence of mixed cell populations. In his approach, a pathologist's assessments of the proportions of each cell type were used in a hierarchical mixture model to model the data. A combination of methods of moments procedures and the expectation–minimization (EM) algorithm provided estimates of the model parameters. Although not shown in the publication, this method could be adapted to provide expression estimates specific to each cell type, as opposed to estimates of differential expression. Unfortunately, the assessment of a pathologist only provides the proportion of each cell or tissue type in the sample, and not an assessment of the amount of mRNA or protein attributable to each. It is well known that the total amount of mRNA generated by tumor cells, for example, is much higher than the amount generated by normal cells. As a result estimates of expression based on pathological assessments of tissue proportions may not be accurate.

Finally, an approach exists to use expression data from a single cell type to determine the proportion of each cell type in a heterogeneous sample (Gosink *et al.*, 2007). This method depends on the estimation of the minimum of a proportion, a minimum that provides a good estimate in noiseless or simulated data. However, this minimum is much more difficult to estimate in noisy data, and microarray data is inherently noisy. Our research builds upon this work by providing a method for estimating this minimum that has reasonable accuracy and can be applied in situations where one or multiple heterologous samples are available.

First, we will discuss the idea of estimating the proportion of a single cell or tissue type in a two-type mixed sample. We will then describe the role of data transformation in this estimation and the interest in finding the point of minimum radius of curvature. Finally, we will describe the use of the bootstrap (Efron, 1979) for obtaining a standard error for our estimate.

The idea of estimating the proportion of one type in a two-type mixed sample comes from Gosink *et al.* (2007). As they describe, let *A* be a purified sample of one type and *AB* be a mixed sample, composed of tissue or cell types *A* and *B*. Let *E*_{i}(*AB*)(*E*_{i}(*A*)) be expression of gene *i* in Sample *AB*(*A*) for *i*=1,…, *m*. Let **E**(**AB**)={*E*_{1}(*AB*),…, *E*_{m}(*AB*)}. We want to estimate *p*_{A}, the proportion of expression in the mixed sample (Sample *AB*) due to tissue type *A*. For a given gene *i* we can express *E*_{i}(*AB*) as

Let *R*_{i}=*R*_{imix/pure} = *E*_{i}(*AB*)/*E*_{i}(*A*). In the noiseless case,

Note that for a fixed *p*_{A} this ratio is at its minimum when *E*_{i}(*B*) = 0, since expression is assumed to be non-negative. Hence, if *E*_{i}(*A*)>0,

Thus, under the assumption that *E*_{i}(*B*)→0 for some sequence of *i*'s, min_{i}*R*_{i}=*p*_{A}. This can be seen in Figure 1 where rank-sorted ratios *R*_{i} have been plotted from ‘electronic’ simulated data at a range of proportion values *p*_{A}. The ‘electronic data’ was generated by computationally combining expression values from purified samples of each composite type in these specific proportions, e.g. for *p*_{A} = 0.25 the electronic data is 0.25***E**(**A**) + 0.75***E**(**B**) where **E**(**A**) are expression values from a purified sample of breast cancer cell mRNA and **E**(**B**) the expression values from a purified sample of normal mouse lung mRNA. Note that the values of *R*_{i} are sorted from lowest to highest.

Unfortunately, the minimum ratio is an underestimate of the true proportion value *p*_{A} for simulated noisy data and for observed data [as Gosink *et al.* (2007) establish]. For example, Figure 2 shows observed data from a titration series (*p*_{A}=0.25, 0.5 and 0.75) of breast cancer cell mRNA (MDA231) and normal mouse lung mRNA. By a titration series, we mean a set of mixed samples (breast cancer and normal lung) in which each sample has a fixed proportion of each tissue/cell type. What we observe is expression data from each mixed sample in this series, so a total of three samples with proportions of breast cancer mRNA to normal mouse lung mRNA of {(0.25, 0.75),(0.5,0.5) and (0.75,0.25)}. Hence for *p*_{A} = 0.25 the observed data is expression from a mixed sample (*AB*) composed as 0.25**A*+0.75**B*. The ‘electronic data’ is the same data as shown in Figure 1. The values of min_{i}*R*_{i} are very accurate estimates of *p*_{A} for the ‘electronic’ data but are poor estimates of *p*_{A} for the observed data. Clearly, the ability of min_{i}*R*_{i} to estimate *p*_{A} is greatly affected by the noise in the data; understanding and incorporating the noise and its effect on min_{i}*R*_{i} in the estimation process is the key to finding an accurate estimate of *p*_{A}.

The noise in the observed expression data from mixed samples causes the minimum ratio to be an underestimate of the true proportion value. A transformation that increases small ratio values while shrinking larger ratio values may improve the accuracy of this estimate. To explore this proposition, we considered transforming both **E**(**AB**) and **E**(**A**) with a transformation of the form

for some α > 0 and for all *i*. The untransformed values of *R*_{i} have a skewed distribution with a long tail of large values (data not shown). As such the mean of the *R*_{i}*s* is larger than the median. The above transformation, by decreasing large values and increasing small values, brings the mean and the median closer together.

We discovered that across several datasets a value for α does exist for which min_{i}*tR*_{i}=min_{i}*tE*_{i}(*AB*)/*tE*_{i}(*A*) is an accurate estimate of *p*_{A}. Unfortunately, this value for α varies with each dataset and with the value of *p*_{A}, i.e. within each dataset and across datasets the value of α that provides an accurate estimate of *p*_{A} is different for each value of *p*_{A}. For any given dataset and value of *p*_{A} we could successfully model α as a function of *p*_{A}, using a function of the form −log(θ**p*_{A}^{γ}+1)/(*p*_{A}−1) for some θ, γ > 0. However, this function depends on *p*_{A}, the value we are trying to estimate.

We acknowledge that the minimum value of *tR*_{i} is sensitive to the noise in the data, particularly in relation to the mean or the median. Hence we decided to explore the possibility of using information from a summary statistic of *tR*_{i} (e.g. mean or median) as a function of α to determine the correct value of α, and hence the value of our estimate min*tR*_{i}. The mean of as a function of α is defined as

where *m* is the number of expression values (i.e. number of genes). The median is analogously defined. We decided to plot the mean and median of *tR*_{i} for a fixed *p*_{A} across a range of values of α. Example plots for two different titration series [MDA231/mouse lung at *p*_{A} = 0.5 and data from MicroArray Quality Control (MAQC) project 2006 (GEO accession 5350) at *p*_{A}=0.75] are shown in Figure 3. The value of α that provides the most accurate estimate of *p*_{A} is marked by a vertical black line.

Values of and md(*tR*(α)) as functions of α for (**a**) MDA231/mouse titration data at *p*_{A}=0.5 and (**b**) MAQC human titration data at *p*_{A}=0.75. The vertical line indicates the correct value of α.

The value of α that provides the most accurate estimate of *p*_{A}, in these plots and many others, is located at what one may refer to as the ‘knee’ or ‘elbow’ of the curve. This point may be familiar from principal components analysis as the point on a scree plot that indicates the number of significant principal components (Jolliffe, 2002). To calculate this point, we need a mathematical definition for the ‘elbow’ of a curve.

We want to find the value of α at the ‘elbow’ of the curve defined by as a function of α. The ‘elbow’ of a curve is the point at which the tradeoff between pulling low values up and pulling high values down (values of *R*_{i}(α)) is optimal. Here, we formalize this by choosing that point at which the radius of curvature is at its minimum. The radius of curvature ρ(*s*) is defined as the inverse of the vector norm of the second derivative of the curve, expressed as a function of arc length *s*, i.e.

(1)

where *C* is the curve of interest originally parameterized in terms of *x* (Lipschutz, 1969). Thus, to find the value of α of interest several steps are required. First, we need to represent the function as a curve in the plane. Second, we must reparameterize this curve in terms of the arc length *s*. Third, we use the reparameterized curve to determine the value of arc length *s*^{*} that minimizes the radius of curvature ρ(*s*). Finally, we determine the value of α that corresponds to *s*^{*}.

Recall that a parameterized curve in the plane is of the form

(2)

where *x*_{1}, *x*_{2} are the coordinate functions, *e*_{1} = (1, 0) and *e*_{2} = (0, 1) the natural basis and α the parameter of the curve. To define the radius of curvature of *x*(α) at a point *x*, we first reparameterize in terms of arc length *s*. The arc length parameterization is defined to be the parameterization with unit speed along the curve. This eliminates the possibility of an unnaturally high or low radius of curvature simply due to the local speed of transversal of the curve.

The arc length *s* of a curve is defined as

(3)

where ‖·‖ is the Euclidean norm. Now consider a function *f*(α) and observe that its graph (α, *f*(α)) is a geometric curve in the plane. Thus, as in Equation (2), we can write

Hence

So the arc length parameter [Equation (3)] is given in terms of α by

Since the parameterization is in terms of unit speed, it is invertible, so we can write α=α(*s*) as well. Thus, *s*′=*s*(α′) and *s*′′=*s*(α′′). The radius of curvature of a geometric curve *C* as stated in Equation (1) can now be defined as

for the curve . We will argue that choosing α to minimize ρ(*s*) leads to a good estimate of *p*_{A} over [α′, α′′].

Given the definitions in the last subsection, it remains to obtain the arc length parameterization for the curve and find the value of α that corresponds to *s*^{*}. Replacing *f*(α) with we have the following:

Partition the interval [α′, α′′] uniformly by setting

Then,

where we approximate as

This will yield a one-to-one relationship between α and *s*, hence a one-to-one relationship between *s* and . Once we have this we can find the value of *s* that minimizes the radius of curvature ρ(α(*s*)), i.e. maximizes ‖*f*′′(*x*(α(*s*)))‖ over *s*[*s*′, *s*′′].

To find the maximum of

we find the value of *s*, *s*^{*}, which maximizes the absolute value of the second derivative with respect to *s* using centered difference approximations (Ames, 1977). Approximate by

Using this approximation, we calculate over a range of values [*s*′, *s*′′] and determine the value *s*^{*} that minimizes .

Note that this method for finding *s*^{*} (and subsequently α) only works if the two axes of the plot for , are similarly scaled. If the two scales are not equal, they must be equalized prior to calculating *s*^{*} by rescaling one axis to be the same length as the other. For example, to rescale the axis for we would use values of the following in place of

That is, the range of the function is the same as the range of the parameter α. This ‘scaling’, like the arc length parameterization, seems necessary to prevent arbitrary choices from dominating the solution.

One key task is choosing *k* large enough so that the approximation of the second derivative with respect to second differences is accurate over the range [α_{0}, α_{j}]. We found that *k* of several thousand worked well in the examples in Section 4.

We used a simple bootstrap resampling procedure (Efron, 1979) to generate standard errors for our estimate of *p*_{A}. For a given dataset of *n* observations and *m* genes, we draw *T* bootstrap samples; each sample contains expression values of *m*′ genes drawn at random with replacement where *m*′ ≈ 0.6**m* (so a total of *nm*′ values). From each sample *j*, *j* = 1,…, *T*, we calculate the mean of across values of α in a given range. We then determine the value of α that corresponds to the minimum radius of curvature (*s*^{*}) of , plotted as a function of α (as described in Section 3.3). This value of α is used to generate *tR*_{i} for the genes in sample *j* and determine its minimum, i.e. our estimate of *p*_{A}. The result of our bootstrap procedure is *T* estimates of *p*_{A}, , one for each sample. The SD of these estimates is taken as the standard error of our estimate of *p*_{A}, and a (1 − τ) confidence interval for our estimate is calculated as where and are the (τ/2)th and (1−(τ/2))th percentiles of our 100 estimates of *p*_{A}.Stated as psuedo-code for clarity, our procedure is as follows:

- Generate
*T*bootstrap samples where each sample contains*nm*′ expression values, i.e. expression values for*m*′ genes from each sample. The*m*′ genes,*m*′≈0.6**m*, are selected at random and with replacement. - For each sample, calculate the values of
*tR*_{i},*i*=1,…,*m*′, for a range of values of α. - For each sample, calculate the values on the curve for a range of values of α, using the result of step 2.
- For each sample, use the curve calculated in Step 3 to determine the value of α that corresponds to the minimum radius of curvature
*s*^{*}(as described in Section 3.3). Label this value as α_{j}for each sample*j*,*j*=1,…,*T*. - For each sample
*j*, use the values of*tR*_{i}that correspond to α_{j}(as calculated in Step 2) and determine its minimum, i.e. our estimate of*p*_{A}. This yields . - Calculate the standard error of our estimate (as the SD of ) and a (1 − τ) confidence interval (as where and are the (τ/2)th and (1−(τ/2))th percentiles of our 100 estimates of
*p*_{A}).

As a sort of stability analysis, we chose a range of values for *T* in our computations below to see whether there was any obvious relationship between the size of *T* and the likelihood that a bootstrapped confidence interval contained the true value. The results in Table 2 suggest that the size of *T* and the accuracy of the bootstrap intervals is slight at most.

We implemented our procedure for estimating *p*_{A} in several gene expression datasets, both proprietary and public, in which expression data was generated from samples composed of two tissue/cell types. Some of the samples consist of different cell types from the same organism, while other samples are a mix of cell types from different organisms. The proportion of each component type is known, as the data come from titration series; we use these values to assess the accuracy of our estimates.

Our data consists of six datasets obtained either from the University of Miami School of Medicine (UMiami) or the NCBI GEO (Barrett *et al.*, 2006). The UMiami datasets were created as a titration series of RNA from breast cancer cells (either MDA231 or MCF7) and normal mouse lung cells. The expression platform is Illumina Human WG-6 version 2 (MCF7) or version 3 (MDA231) (Illumina Inc., 2009); chips were processed at two different laboratories. The data from GEO includes titration series of Universal Human Reference RNA and Human Brain RNA from the MAQC study (MAQC Consortium, 2006). We selected data processed at two different laboratories and on two different platforms, either Human-6 BeadChip 48K version 1 (Illumina Inc., 2009) or HG-U133 Plus 2.0 GeneChip (Affymetrix Inc., 2009). Two other datasets from GEO were also included in our studies; these data include two titration series of mouse T and B cells (Shearstone *et al.*, 2006). These sets were processed on the Mouse 430A version 2 GeneChip platform (Affymetrix Inc., 2009). The details of each dataset are presented in Table 1.

The method of normalization of gene expression data can impact substantially which probes are identified as detected and which probes are identified as differentially expressed between conditions (Dunning *et al.*, 2008b; Johnstone *et al.*, 2008). For this reason, we implemented several normalization methods on our proprietary datasets, while using the available normalized data for the publicly available datasets. The normalization methods for the Illumina data include quantile normalization and qspline normalization as implemented in the R package beadarray (Dunning *et al.*, 2008a; R Development Core Team, 2009) and cubic normalization as implemented in the Illumina BeadStudio software (Illumina Inc., 2009). After normalization, only those genes with a detection *P*-value <0.01 in all samples (Illumina Inc., 2009) or considered present in all samples according to the Affymetrix MAS5 algorithm (Affymetrix Inc., 2009) were included in further analyses (i.e. bootstrap estimation of *p*_{A} by the procedure described in Section 3.4).

Select results of our bootstrap estimation procedure for each dataset are shown in Figure 4. For the UMiami datasets, we chose to display results for only one normalization method for brevity.

Bootstrap estimates of *p*_{A} with 90% confidence intervals. Boxes indicate the point estimates of *p*_{A}; light grey vertical lines indicate the true values of *p*_{A}. (**a**) MDA231 qspline-normalized data; (**b**) MCF7 quantile-normalized data; (**c**) MAQC ILM cubic spline-normalized **...**

In ~90% of cases, our point estimate is within 5% of the true proportion; in ~80% of cases, the 90% bootstrap confidence interval for our estimate contains the true value of *p*_{A}. We note that our method found the BIIB 100 dataset to be the most challenging. This is no surprise as this titration series was designed with very low levels of mRNA, as a challenge to the procedure used for RNA amplification prior to running the expression assay (Shearstone *et al.*, 2006). In other words, this data was generated from a very small amount of biological material so the estimation of the proportion of the biological components is very challenging.

There is evidence in Figure 4 of an interaction between the normalization procedure and the accuracy of our estimation method. For example, we tend to overestimate *p*_{A} when the data is qspline normalized, as with the UMiami MDA231 data, but we tend to underestimate *p*_{A} when the data is quantile normalized. We note that this relationship could also be a consequence of other experimental variables, such as the expression platform or the specific laboratory in which the data were generated. Further, datasets and analysis are required to determine which factors (e.g. normalization, platform and laboratory) have significant effects on the accuracy of our procedure.

In addition, the stated confidence level of the confidence intervals (90%) is predicated on the validity of the underlying model (Leeb, 2009; Shen *et al.*, 2004). Because our underlying model has some level of uncertainty, the stated level of confidence is an overestimate of the actual level of confidence. In other words, model uncertainty tells us that a true 90% confidence interval is larger than the stated 90% confidence interval. In light of this the accuracy of our method is most likely better than the results in Table 2 would suggest.

An accurate estimate of *p*_{A} can be used to generate estimates of expression specific to each tissue/cell type. Given expression from a mixed sample *AB* and an estimate of *p*_{A} we can estimate *E*(*A*) and *E*(*B*) as

As we observe *E*(*A*), we can compare with the observed *E*(*A*) to assess the quality of our estimate. Whether the error in using as an estimate of *E*(*A*) can be used to improve our estimate of *p*_{A} is a topic for future research.

We have demonstrated a statistical method for estimating the proportions of each sample (Samples *A* and *B*) in a two-sample mixture (*AB*). This method requires expression data generated from the mixed sample *AB* and expression data generated from a purified sample of one type *A*. Given this information, the method approximates the proportion *p*_{A} as the minimum of the ratios of expression in the mixed and purified samples, where the minimum is taken over genes. For this estimate to be accurate, it is required that the data be transformed; the value of the parameter of the transformation is determined by a geometric argument involving the minimum radius of curvature of a function, parameterized as a curve in the plane. Our results show that our method provides a reasonably accurate estimate of *p*_{A} on both proprietary and publicly available datasets.

As demonstrated in Cleator *et al.* (2006) a large value of *p*_{A} (say, over 0.5) can have a substantial effect on the results of tests for differential expression and confound tumor classification. However, whether a large *p*_{A} should be cause for concern depends on the specific study. We would argue that *p*_{A} should be assessed in all samples, but the action of the investigator in response to a large value of *p*_{A} may vary from no action to discarding the sample from further consideration. In the case where *p*_{A} is very large, our method will still give a reasonable estimate of but the variability in this estimate could be large. Whether a large *p*_{A} necessitates a renormalization of the data is unknown; we conjecture that if and *E*(*A*) are comparable then renormalization is unnecessary.

The results presented are preliminary and as such further research is required to optimize and validate our method. Our bootstrap point estimates and confidence intervals could be substantially improved by increasing the number of bootstrap samples *T* and running diagnostics to ensure that the number of samples and size of samples are adequate for generating valid bootstrap quantities of interest (Canty *et al.*, 2002). In addition, we would like to explore the relationship between the method of normalization and our estimation technique. By altering the noise distribution, normalization alters the relationship between the noise and the values of *R*_{i}, thereby influencing the accuracy of min*R*_{i} as an estimate of *p*_{A}. The extent of this influence is unknown, but further research may help determine which normalization method yields the most accurate estimate of *p*_{A}. Finally, the calculation of the radius of curvature depends on the estimation of the second derivative of the curve; we approximate the second derivative by the second difference equation [Equation (4)]. This approximation is accurate if the curve is smooth and is well sampled, i.e. the distance between *s*_{k} and *s*_{k+1} is small. Using a well-sampled curve in our method can be computationally expensive if the range of value of α (i.e. values of *s*) is large. We would like to design a variation of our method which starts with a sparsely sampled curve over a large range of values of α and iteratively narrows the range of interest and increases the sampling density as information about the probable location of *s*^{*} is obtained. This should yield a better estimate of *p*_{A} at lower computational expense. We hope to implement this variation and provide our approach to the statistics community as an R package (R Development Core Team, 2009).

Our definition of the ‘elbow’ of a curve as the point of minimum radius of curvature is applicable to other problems in statistics, such as the choice of the number of principal components in a principal components analysis (Jolliffe, 2002). One existing way to make this choice is to identify the ‘elbow’ of the curve from a scree plot and choose the number of components closest to the ‘elbow’. Our procedure for finding the minimum radius of curvature, coupled with a curve-fitting method, may be directly applicable to this problem. This would provide a formalization, in the spirit of Zhu and Ghodsi Zhu and Ghosdi (2006), of what is currently an *ad hoc* approach.

We would like to thank the Center for Computational Science and the lab of Marc E. Lippman, MD, for their suggestions and input.

*Funding*: National Cancer Institute of the National Institutes of Health (K25 CA111636 to J.C.).

*Conflict of Interest*: none declared.

- Affymetrix Inc. Affymetrix Expression Console Software Version 1.0 — User Guide. Santa Clara, CA: 2009.
- Ames W. Numerical Methods for Partial Differential Equations. New York: Academic Press; 1977.
- Barrett T, et al. NCBI GEO: mining tens of millions of expression profiles—database and tools. Nucleic Acids Res. 2006;35:D760–D765. [PubMed]
- Canty A, et al. Bootstrap diagnostics and remedies. Can. J. Stat. 2002;34:5–27.
- Causton H, et al. Microarray gene expression data analysis: A beginner's guide. Malden, MA: Blackwell Science; 2003.
- Cleator S, et al. The effect of the stromal component of breast tumours on prediction of clinical outcome using gene expression microarray analysis. Breast Cancer Res. 2006;8:R32. [PMC free article] [PubMed]
- Dunning M, et al. beadarray: R classes and methods for illumina bead-based data. Bioinformatics. 2008a;23:2183–2184. [PubMed]
- Dunning M, et al. Statistical issues in the analysis of illumina data. BMC Bioinformatics. 2008b;9:85. [PMC free article] [PubMed]
- Efron B. Bootstrap methods: Another look at the jackknife. Ann. Stat. 1979;7:1–26.
- Fend F, Raffeld M. Laser capture microdissection in pathology. J. Clin. Pathol. 2000;53:666–672. [PMC free article] [PubMed]
- Ghosh D. Mixture models for assessing differential expression in complex tissues using microarray data. Bioinformatics. 2004;20:1663–1669. [PubMed]
- Gosink M, et al. Electronically subtracting expression patterns from a mixed cell population. Bioinformatics. 2007;23:3328–3334. [PubMed]
- Illumina Inc. BeadStudio Gene Expression Module v3.4 User Guide (11317265 Rev A). San Diego, CA: 2009.
- Johnstone D, et al. Proceedings of the 19th International Conference on Genome Informatics (GIW 2008) Poster; 2008. Effects of different normalisation and analysis procedures on illumina gene expression microarray data.
- Jolliffe I. Principal components analysis. 2nd. Berlin: Springer; 2002.
- Koltai H, Weingarten-Baror C. Specificity of DNA microarray hybridization: characterization, effectors, and approaches for data correction. Nucleic Acids Res. 2008;36:2395–2405. [PMC free article] [PubMed]
- Lahdesmaki H, et al. In silico microdissection of microarray data from heterogeneous cell populations. BMC Bioinformatics. 2005;6:54. [PMC free article] [PubMed]
- Leeb H. Conditional predictive inference post model selection. Ann. Stat. 2009;37:2838–2876.
- Lipschutz M. Schaum's outline of theory and problems of differential geometry. New York: McGraw Hill; 1969.
- Lu P, et al. Expression deconvolution: A reinterpretation of DNA microarray data reveals dynamic changes in cell populations. Proc. Nat. Acad. Sci. USA. 2003;100:10370–10375. [PubMed]
- MAQC Consortium. The microarray quality control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol. 2006;24:1151–1161. [PubMed]
- Michiels S, et al. Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet. 2005;365:488–492. [PubMed]
- Pittman J, et al. Clinico-genomic models for personalized prediction of disease outcomes. Proc. Nat. Acad. Sci. USA. 2004;101:8431–8436. [PubMed]
- R Development Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: Foundation for Statistical Computing; 2009. ISBN 3-900051-07-0.
- Shearstone J, et al. Accurate and precise transcriptional profiles from 50 pg of total RNA or 100 flow-sorted primary lymphocytes. Genomics. 2006;88:111–121. [PubMed]
- Shen X, et al. Inference after model selection. J. Am. Stat. Assoc. 2004;99:751–761.
- Stuart R, et al. In silico dissection of cell-type-associated patterns of gene expression in prostate cancer. Proc. Nat. Acad. Sci. USA. 2004;101:615–620. [PubMed]
- van't Veer L, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415:530–536. [PubMed]
- Venet D, et al. Separation of samples into their constituents using gene expression data. Bioinformatics. 2001;17:S279–S287. [PubMed]
- Wang M, et al. Computational expresssion deconvolution in a complex mammalian organ. BMC Bioinformatics. 2006;7:328. [PMC free article] [PubMed]
- Wheelan S, et al. The incredible shrinking world of DNA microarrays. Mol. Biosyst. 2008;4:726–732. [PMC free article] [PubMed]
- Zhu M, Ghosdi A. Automatic dimensionality selection from the scree plot via the use of profile likelihood. Comput. Stat. Data Anal. 2006;51:918–930.

Articles from Bioinformatics are provided here courtesy of **Oxford University Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |