Home | About | Journals | Submit | Contact Us | Français |

**|**Bioinformatics**|**PMC2951082

Formats

Article sections

Authors

Related links

Bioinformatics. 2010 October 15; 26(20): 2571–2577.

Published online 2010 July 14. doi: 10.1093/bioinformatics/btq406

PMCID: PMC2951082

Timo Erkkilä,^{1,}^{2,}^{*} Saara Lehmusvaara,^{3} Pekka Ruusuvuori,^{1,}^{2} Tapio Visakorpi,^{3} Ilya Shmulevich,^{1,}^{2} and Harri Lähdesmäki^{1,}^{4,}^{*}

* To whom correspondence should be addressed.

Associate Editor: Trey Ideker

Received 2010 January 28; Revised 2010 June 7; Accepted 2010 July 6.

Copyright © The Author(s) 2010. Published by Oxford University Press.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article has been cited by other articles in PMC.

**Motivation:** Tissue heterogeneity, arising from multiple cell types, is a major confounding factor in experiments that focus on studying cell types, e.g. their expression profiles, in isolation. Although sample heterogeneity can be addressed by manual microdissection, prior to conducting experiments, computational treatment on heterogeneous measurements have become a reliable alternative to perform this microdissection *in silico*. Favoring computation over manual purification has its advantages, such as time consumption, measuring responses of multiple cell types simultaneously, keeping samples intact of external perturbations and unaltered yield of molecular content.

**Results:** We formalize a probabilistic model, DSection, and show with simulations as well as with real microarray data that DSection attains increased modeling accuracy in terms of (i) estimating cell-type proportions of heterogeneous tissue samples, (ii) estimating replication variance and (iii) identifying differential expression across cell types under various experimental conditions. As our reference we use the corresponding linear regression model, which mirrors the performance of the majority of current non-probabilistic modeling approaches.

**Availability and Software:** All codes are written in Matlab, and are freely available upon request as well as at the project web page http://www.cs.tut.fi/~erkkila2/. Furthermore, a web-application for DSection exists at http://informatics.systemsbiology.net/DSection.

**Contact:** if.tut@alikkre.p.omit; if.tut@ikamsedhal.irrah

For being able to fully utilize capabilities of high-throughput measurement techniques that often have to deal with physically small but also heterogeneous tissue samples, attention should be paid as to how heterogeneity, the presence of multiple cell types in tissue, is addressed. In many studies the focus of interest hovers around identifying behavioral differences across cell types, and in such cases sample heterogeneity clearly has a confounding effect on downstream experiments and analysis.

Although laser-capture microdissection (LCM; Emmert-Buck *et al.*, 1996) offers a direct way to address tissue heterogeneity by allowing for isolation of morphologically distinguishable cell types, there are occasions when it is not feasible. Yield of biological content (e.g. mRNA) for conducting experiments becomes consequently lowered, which often needs to be compensated for with either more sensitive measurement devices or amplification of molecular quantities (Sooriakumaran *et al.*, 2009). However, amplification of mRNA from small albeit pure cell samples has its shortcomings, most notably nonlinearity (Otsuka *et al.*, 2007), obscuring the underlying profiles for distinct cell types.

Several authors have already studied performing computational microdissection for heterogeneous tissues, and proposed promising methods for microarray expression data. Initial attempts stem from Venet *et al.* (2001), who proposed a linear model for estimating both cell-type proportions and cell-type-specific gene expression profiles; the model assumes that, as prior information, there exist known, exclusively expressed genes for each cell-type. Subsequent studies have then demonstrated that the linearity assumption and prior information on either gene expression profiles, cell-type proportions, or both, can yield meaningful interpretations for the constituents of heterogeneous tissues (Abbas *et al.*, 2009; Gosink *et al.*, 2007; Hoffmann *et al.*, 2006; Jacobsen *et al.*, 2006; Lähdesmäki *et al.*, 2005; Quon and Morris, 2009; Stuart *et al.*, 2004).

In real experiments, conducted on the basis of heterogeneous tissue samples, having *precise* prior information is unrealistic, even though current models consistently rely on such information. We incorporate this missing functionality into the already-familiar linear regression framework through Bayesian prior densities whose shapes reflect the uncertainties associated with the prior information, such as cell-type proportions or cell-type-specific expression profiles.

For all model parameters, an efficient Markov chain Monte Carlo (MCMC) sampler is proposed. In addition to existing microdissection models, we further assume that the heterogeneous tissues have been measured under various experimental conditions, having a possible impact on cell-type-specific expression profiles. As cell-type-specific profiles are assumed to be different across both cell types and experimental conditions, assessment of statistically significant differential expression is performed with the two-sample *t*-test, though other tests for differential expression can be used.

We use simulated and real gene expression data for assessing the performance of the Bayesian model in contrast to a linear regression model that essentially captures properties common to the aforementioned, deterministic approaches. A series of case studies are used for demonstrating that the proposed method is capable of (i) de-noising uncertain prior information about cell-type proportions, (ii) more accurate estimation of replication variance, consequently leading to (iii) more accurate identification of differential expression across cell types and experimental conditions.

We denote the tissue sample index with *j* and assume that there are *J* tissue samples in total. The number of cell types represented in the *J* samples needs to be known, and it is crucial that each of the *J* samples have *the same* cell types represented. We denote the cell type index by *t* and assume that there are *T* cell types in total. Lastly, we denote the number of probes (a generic term, e.g. a gene or miRNA) in an experiment by *I* so that the modeled data, which we denote by , consists of *I***J* probe measurements,^{1} *y*_{ij}, one for each probe *i* and tissue sample *j*.

In the simplest form this is all that is required. In addition, samples are often prepared under various experimental conditions, say, under ‘No treatment’, ‘Treatment 1’, ‘Treatment 2’, etc. and the analysis may be focused on finding differences in probe measurements across experimental conditions. Therefore, we incorporate the condition information into the model with variable *c*(*j*) that takes on values 1, 2,…, *C*, being linked to the *C* different experimental conditions. For instance, if tissue samples 2 and 4 were measured under experimental condition ‘No treatment’, that information could be encoded by assigning *c*(2) = *c*(4) = 1; thus, condition ‘No treatment’ would be associated with index 1, and so on.

For tissue sample *j* under experimental condition *c*(*j*), the data point for probe *i*, *y*_{ij}, is modeled as a sum of pure probe readings of all cell types, **x**_{ic(j)} = (*x*_{1ic(j)}, *x*_{2ic(j)},…, *x*_{Tic(j)}), weighted by the respective cell type proportions, **p**_{j} = (*p*_{1j}, *p*_{2j},…, *p*_{Tj}), plus an additive, normally distributed noise term, ϵ_{ij}, reflecting replication noise with variance 1/λ_{i}:

(1)

so that the likelihood of data point *y*_{ij} becomes *y*_{ij}|**p**_{j}, **x**_{i}, λ_{i} ~Normal(∑_{t=1}^{T}*p*_{tj}*x*_{tic(j)}, 1/λ_{i}). Thus, we model the replication variance, 1/λ_{i}, as heteroscedastic across probes and homoscedastic across cell types and experimental conditions. Assuming independent and identically distributed (IID) measurements (elements in ), a factorized form for the joint data likelihood can then be written as *f*(|**θ**) = _{i=1}^{I}_{j=1}^{J}*f*(*y*_{ij}|**p**_{j}, **x**_{i}, λ_{i}), where **θ** is a collection of all model parameters, i.e. *p*_{tj}'s, *x*_{tic}'s, and λ_{i}'s. The assumptions of additive, normally distributed noise and IID measurements is standard practice, although there is statistical evidence that at least the IID assumption may not always be valid (Efron, 2009).

The model is next extended to account for parameter priors, so that the posterior distribution of all unknown model parameters required for sampling could be formulated. The prior assignments are done in a way that allows for easy sampling, and the shapes of the prior distributions are chosen to reflect the assumed variability of parameters.

We impose a normal prior *x*_{tic} ~ N(μ_{tic}, ν) for the cell type and condition-specific probe measurement *i*, where the prior expression means and precision, μ_{tic} and ν, are extracted from the least-squares solution to the corresponding linear regression model assuming cell-type proportions known (see Supplementary Material for details). Normality is preferred so as to make use of the property of conjugate priors (posterior for *x*_{tic} will be a normal density, given that the prior and likelihood densities are also normal). Furthermore, a shared Gamma prior, Gamma(α, β), is placed on the inverses of replication variances, i.e. precisions, λ_{1},…, λ_{i},…, λ_{I}. Positive support and flexibility of Gamma(·, ·) make it useful in modeling precision parameters in a Bayesian framework (Gelman, 2006). Furthermore, the shared prior shrinks posterior estimates of λ_{i}'s toward their common prior mean, α/β, regularizing estimates especially when dealing with small sample sizes (Smyth, 2004).

The mixing proportions for tissue sample *j*, **p**_{j} = (*p*_{1j},…, *p*_{Tj}), are limited to a *T*-simplex; all elements in **p**_{j}'s are non-negative and, vector-wise, sum up to one. A natural prior density for such vectors is the Dirichlet density, which we parameterize with *w*_{0} and **p**_{0j} as **p**_{j} ~ Dirichlet(*w*_{0}**p**_{0j}). The parametrization is done in a way that allows for prior knowledge on *p*_{tj}'s to be plugged into the model in a straightforward manner. Namely, we assume that a user has obtained prior information on the cell-type proportion in the *J* samples (e.g. by looking at the histology slides of the samples and making rough estimates or in an automated manner using digital microscopy images of the samples, or with flow cytometry, etc.), and these prior proportions are stored in **p**_{0j}. Moreover, the belief of the correctness of prior proportions is specified by the multiplicative weight *w*_{0}. This way the user can tune the peakedness of the prior density around the prior guess, **p**_{0j}; increasing *w*_{0} increases the peakedness and vice versa. For compactness, we encapsulate the aforementioned parameters in a vector **ξ** = (α, β, μ_{111},…, μ_{TIC}, *w*_{0}, **p**_{01},…, **p**_{0J}).

Unknown parameters, i.e. **θ**, in our model are estimated in an MCMC fashion, which means we first must devise a sampling scheme under which samples from the posterior density of our parameters, given data and fixed parameters, *f*(**θ**|, **ξ**) *f*(|**θ**)*f*(**θ**|**ξ**), are drawn. Assuming *S* samples drawn from the posterior, the samples are subsequently used for summarization, i.e. approximating the expected value of the parameters with Monte Carlo integration (Gelman *et al.*, 2004), [**θ**|, **ξ**] ≈ 1/*S*∑_{s=1}^{S}**θ**^{(s)}. Gibbs sampling (Gelman *et al.*, 2004) is one such sampling method, employing the idea of drawing a value from a conditional posterior for the respective parameters one at a time, while conditioning on all other model parameters, being set to previously sampled values, and data.

Next, we will construct a hybrid Gibbs and Metropolis–Hastings (M–H) sampler for all the model parameters; detailed derivations are shown in the Supplementary Material. The posterior for *x*_{tic} is

(2)

where the parameters of that distribution are *P*_{tic} = λ_{i}∑_{j:c(j)=c}(*y*_{ij}*p*_{tj} − *p*_{tj}∑_{t′≠t}*p*_{t′j}*x*_{t′ic}) + νμ_{tic} and *Q*_{tic} = λ_{i}∑_{j:c(j)=c}*p*_{tj}^{2} + ν. In a similar fashion, one finds the posterior for λ_{i} to be

(3)

where *e*_{ij} is the model residual *e*_{ij} = *y*_{ij}−∑_{t=1}^{T}*p*_{tj}*x*_{tic(j)}. However, one cannot find such a density for the cell-type proportions since the normalizing constant for that posterior is computationally infeasible to solve. Thus, we cannot proceed with Gibbs sampling in this particular case but make use of M–H sampling (Gelman *et al.*, 2004) instead; Gibbs sampling is a special case of M–H, thus, both Gibbs and M–H sampling can be utilized in the same framework (Andrieu *et al.*, 2003).

For employing M–H sampling, one needs an un-normalized posterior of **p**_{j} and a transition kernel. The un-normalized posterior is

(4)

where *e*_{ij} is, again, the model residual and *s*_{j} = ∑_{t=1}^{T}(*w*_{0}*p*_{0tj} − 1)ln(*p*_{tj}). Dirichlet density as the transition kernel for M–H works well in our case since the sampler for the posterior of **p**_{j} must stay within the *T*-simplex, as previously explained. Now, if the previous value in the Markov chain is denoted by **p**_{j}^{*}, a proposal value, denoted by **p**_{j}, will be drawn from Dirichlet(*w***p**_{j}^{*}), and the corresponding kernel, i.e. Dirichlet density function, is denoted by *K*(**p**_{j}^{*} → **p**_{j}). The role of *w* is analogous to that of *w*_{0}, as *w* is used to control the peakedness of the transition kernel around the previously sampled value, **p**_{j}^{*}. The acceptance of the proposed, newly sampled value then depends on the factor

(5)

and the probability of acceptance is determined by [accept] = min{1, ρ_{j}(**p**_{j}^{*} → **p**_{j})}.

In computing the forthcoming results with DSection, we used the following values for controlling parameters of our model. Namely, we set peakedness around prior cell-type proportions to *w*_{0} = 10, peakedness of transition kernel to *w* = 100, burn-in period to *B* = 2000 iterations, and chain length to *S* = 500 iterations. Along sampling, we also computed and visualized estimates of autocovariance functions of the sampled parameters, which indicated that our choice for the chain length was reasonable, i.e. covariance diminished relatively rapidly as lag was increased (data not shown) (Cowles and Carlin, 1996; Rasmussen, 2000).

In order to demonstrate full functionality of DSection, we designed a simulation experiment containing both multiple cell types and experimental conditions; an analysis of simpler, real data will follow. Expression profiles of 700 genes of three cell types under two experimental conditions were created. The expressions, *x*_{tic}, were chosen so that there existed probes for which expression profiles were either identical across cell types and conditions, differed only across cell types, differed only across conditions, or both, and expressions were set to vary within the range 100…1600; thus, the theoretically maximum, achievable fold-change is log_{2}(1600/100) = 4. Next, for each gene, a precision, λ_{i}, was drawn from Gamma(5, 1/0.0003) (mean precision 0.0015) justification for using the Gamma density is the same as with prior densities. In total, 14 samples, 7 per experimental condition, were created and normally distributed noise with variance 1/λ_{i} was added.

Performance of the models is assessed on the basis of their ability to identify differential expression across cell types and experimental conditions—that is, probe *i* may be differentially expressed across some cell types and experimental conditions, at most in different ways, which are tested separately with the two-sample *t*-test (see Supplementary Material for more details).

The data are analyzed with the two models, linear regression and DSection, where the latter is utilized both with fixed cell-type proportions and by sampling from posterior of cell-type proportions. Simulation results (Fig. 1) show an increase in identification accuracy of differential expression for DSection, in contrast to our reference, the linear regression model. Thus, the analysis results indicate that our method with uncertainty in proportions incorporated actually attains an accuracy comparable with the ‘best-case’ scenario, i.e. cell-type proportions are known precisely and a linear regression model is used.

Analysis results with simulated data—3 cell types, 2 experimental conditions, 700 genes and 14 samples (seven for each experimental condition). (**a**) Estimation of cell-type proportions (bright spots), given noisy priors (faint spots). (**b**) ROC curves **...**

The methods differ mostly in estimation of replication variance, 1/λ_{i}. Actually the discrepancy between ground-truth and estimates is sometimes so high that we visualize replication standard deviation (SD), , instead. As the visuals suggest, only those models assuming fixed and precisely known cell-type proportions suffer from these high biases (Fig. 1c–e), whereas for DSection, which assumes noisy cell-type proportion priors, this bias is absent (Fig. 1f). Importantly, the bias is most strongly present in probes for which differential expression across cell types and experimental conditions is high; to elucidate this, we labeled each SD estimate with a color, and the intensity of that color increased along with average differential expression.

Next, we analyzed a publicly available dataset from Affymetrix oligonucleotide arrays [data downloaded from Affymetrix (2009)], consisting of over 15 000 genes whose heterogeneous expressions comprising of human brain and heart cells were summarized using robust multi-array averaging (RMA) procedure (Irizarry *et al.*, 2003). There are 33 samples in the dataset in total, each sample being designed to contain specific proportions of the distinct cell types. Table 1 contains all the samples provided within the Affymetrix dataset, but we only use those that contain cell types with ratio 25% : 75% and vice versa. Other samples—especially the ones with pure samples that we used for reference—were discarded from the analysis, for better reflecting the scarcity of repeated measurements and heterogeneity within samples, which is usually the case. Moreover, we use the procedure described in the Supplementary Material for deriving noisy estimates for cell-type proportions, in turn reflecting inaccurate prior proportion predictions.

Although no ground-truth for replication variances of Affymetrix data truly exists, we can exploit the samples for each mixture experiment to at least derive good estimates (see Supplementary Material for details). Using these derived ground-truth estimates, Figure 2 shows, again, a similar bias pattern to what is observable with simulated data (Fig. 1). Bias in SD estimation accuracy for most highly differentially expressed genes is visible for the linear regression model that assumes fixed cell-type proportions, whereas DSection, which accounts for noisy cell-type proportion priors, reduces such biases.

Analysis results with Affymetrix data—2 cell types, 1 experimental condition, ~15 000 genes and 6 samples (25%/75% and vice versa). (**a**) Estimation of cell-type proportions (bright spots), given noisy priors (faint spots). (**b**) ROC curves **...**

Moreover, no ground-truth for truly differentially and non-differentially expressed genes exist for Affymetrix data. However, as we have samples representing pure cell types, they can be derived as well (see Supplementary Material for details). As can be seen in Figure 2b, the receiver operating characteristic (ROC) curves clearly have a similar pattern to what we observed with simulated data. DSection not only outperforms the linear regression model in terms of ROC, but also the performance of DSection is comparable with the ‘best-case’, which we computed by plugging the true cell-type proportions into the linear regression model, as described earlier.

Additionally, we assessed the effect an increase in sample size has on both cell-type proportion estimation and expression profiling. In addition to the six samples (25%/75% and vice versa) we already used in the previous case study, we augment that data by the ones which contain cell types with ratio 10%/90% and vice versa—that is, 6 more samples making 12 samples in total.

The assessment of improvement was made in the following manner. The six samples of 25%/75% etc. purity were augmented by (i) a subset of 0, 1,…, 6 samples of 10%/90%, etc. purity, (ii) noise was added to the ground-truth cell-type proportions of the selected samples with the previously used method, (iii) linear regression model and DSection was fitted to the data and (iv) this was repeated 10 times.

For each iteration, mean absolute differences (MAD) between the estimates and ground-truth cell-type proportions and expression profiles were computed, followed by computing a sample mean over the 10 iterations. MAD was preferred as it essentially captures both bias and variance into single quantity. As we increased the number of samples from 6 to 12, MAD was consistently lower for DSection than that for the noisy estimates of cell-type proportions (those used directly with the linear regression model) (Fig. 3). A decreasing trend for MAD is observable while more samples were added, however, that is due to our way of adding noise to cell-type proportions. Namely, the closer the true cell-type proportions are to 1/*T*, i.e. as heterogeneous sample as possible, the more noise is added. And since the augmented samples were less heterogeneous in contrast to 25%/75% ones, increasing sample size in turn decreased the average MAD of noisy cell-type proportions, in turn decreasing the MAD of DSection estimates. We did not observe any significant difference of MAD for expression profiling between the two models (data not shown), indicating that DSection relies heavily upon the priors derived using the deterministic linear regression counterpart.

Previous studies, including this, have almost exclusively been considering microarray gene expression data. However, due to recent revolutionizing improvements in sequencing techniques, gene expression measurements by sequencing, or RNA-seq (Wang *et al.*, 2009; Wilhelm and Landry, 2009), has become a serious competitor to standard probe-based microarray alternatives, not only due to increased genome coverage offered by RNA-seq, but also due to increased measurement reproducibility (Marioni *et al.*, 2008). Although data preprocessing and normalization steps between microarray and RNA-seq data are different, there are no fundamental factors that would directly make current modeling approaches obsolete. In fact, since a strong linear relationship between RNA concentrations and sequence reads has been reported (Mortazavi *et al.*, 2008), in contrast to not-so-linear microarrays (Quackenbush, 2002), one would expect the modeling transition from array-based analysis to RNA-seq to be rather effortless for any model, including ours.

We propose a framework under which measurements, arising from heterogeneous tissues, can be analyzed without having to rely upon manual—and possibly time consuming—sample preprocessing steps such as LCM. Instead, DSection assumes that measurements contain profiles of all cell types of interest with varying proportions in the tissue samples. Furthermore, as without constraints this task would contain no unique solution for expression profiles and cell-type proportions, uncertain information is assumed to be available on the cell-type proportions. In realistic situations where information about cell-type proportions is extracted on the basis of, say, microscopy or flow cytometry, it is evident that such estimates are prone to inaccuracy. We showed that, under the Bayesian framework, not only the passing of uncertain information to our model is straightforward due to the notion of prior information, but also that our model is capable of ‘de-noising’ that uncertain information, thus resulting in more accurate overall modeling performance in contrast to traditional models without this functionality implemented.

The extraction of information about cell-type proportions was not addressed in this article, although it is a crucial part required to make the model work as intended. In real experiments, i.e. those including real tissue samples with unknown cell-type proportions, as opposed to data we used, such precise information as cell-type proportions does not exist. However, as our results suggest, prior information about the proportions of different cell types can be exploited in modeling even though the estimates of proportions would include uncertainty. Thus, including image-based prior estimation could provide a valuable addition into the current analysis framework, but in order to be useful the image analysis needs to be done in an automated manner. Numerous tissue image analysis methods have been presented in the literature, such as those in Kleiner *et al.* (2009); Newberg and Murphy (2008) and Strömberg *et al.* (2007), and incorporating similar methods as a part of the analysis pipeline is one of our main objectives.

Imposing *w*_{0} = 10 results in a lightly concentrated density surface around the prior cell-type proportions, **p**_{0j}, which along with the results suggest that having strong prior information, at least on cell-type proportions, is not required. However, constraining model parameters albeit vaguely is required as the model would otherwise become unidentifiable. If proportions for some cell types are missing, due to morphological indistinguishability, for instance, one could consider pooling those cell types together and model them as one; this approximation would be accurate only in cases where pooled cell types share similar expression profiles. On the other hand, if the precise value for *T* is debatable but now cell-type proportions for different values of *T* existed, cross-validation, reversible-jump MCMC (Green, 1995), etc., for determining most suitable *T* could be utilized.

Although the assumed linearity may not strictly hold for some or even most of the genes being considered, it is still expected that such a linear model can, to some extent, capture nearly linear responses with sufficient accuracy (Hoffmann *et al.*, 2006). In fact, during parameter estimation, we used Affymetrix data with and without log-transform (results shown here are for non-log data) with comparable accuracy in terms of ROC, suggesting that the linearity assumption indeed is fairly robust. Furthermore, Gaussian processes (Rasmussen and Williams, 2006) are currently under investigation as part of incorporating nonlinear responses into the model.

Academy of Finland (application number 134290; application number 213462, Finnish Programme for Centres of Excellence in Research 2006–2011); Finnish Graduate School in Computational Sciences; National Institutes of Health (application number NIHP50GMO76547).

*Conflict of Interest*: none declared.

^{1}Data in linear form is preferred as modeling assumptions may otherwise become violated; see Section 4 for further discussion.

- Abbas AR, et al. Deconvolution of blood microarray data identifies cellular activation patterns in systemic lupus erythematosus. PLoS One. 2009;4:e6098. [PMC free article] [PubMed]
- Affymetrix (2009) Available at http://www.affymetrix.com/support/technical/sample_data/gene_1_0_array__data.affx. (last accessed date June 22, 2009)
- Andrieu C, et al. An introduction to mcmc for machine learning. Mach. Learn. 2003;50:5–43.
- Cowles MK, Carlin BP. Markov chain monte carlo convergence diagnostics: a comparative review. J. Am. Stat. Assoc. 1996;91:883–904.
- Efron B. Are a set of microarrays independent of each other? Ann. Appl. Stat. 2009;3:922–942. [PMC free article] [PubMed]
- Emmert-Buck MR, et al. Laser capture microdissection. Science. 1996;274:998–1001. [PubMed]
- Gelman A, et al. Bayesian Data Analysis. Chapman & Hall/CRC; 2004.
- Gelman A. Prior distributions for variance parameters in hierarchical models. Bayesian Anal. 2006;1:1–19.
- Gosink MM, et al. Electronically subtracting expression patterns from a mixed cell population. Bioinformatics. 2007;23:3328–3334. [PubMed]
- Green P. Reversible jump Markov chain monte carlo computation and bayesian model determination. Biometrika. 1995;82:711–732.
- Hoffmann M, et al. Robust computational reconstitution - a new method for the comparative analysis of gene expression in tissues and isolated cell fractions. BMC Bioinformatics. 2006;7:369. [PMC free article] [PubMed]
- Irizarry RA, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4:249–264. [PubMed]
- Jacobsen M, et al. Deconfounding microarray analysis - independent measurements of cell type proportions used in a regression model to resolve tissue heterogeneity bias. Methods Inf. Med. 2006;45:557–563. [PubMed]
- Kleiner HE, et al. Tissue microarray analysis of eif4e and its downstream effector proteins in human breast cancer. J. Exp. Clin. Cancer Res. 2009;28:5. [PMC free article] [PubMed]
- Lähdesmäki H, et al. In silico microdissection of microarray data from heterogeneous cell populations. BMC Bioinformatics. 2005;6:54. [PMC free article] [PubMed]
- Marioni JC, et al. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18:1509–1517. [PubMed]
- Mortazavi A, et al. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods. 2008;5:621–628. [PubMed]
- Newberg J, Murphy RF. A framework for the automated analysis of subcellular patterns in human protein atlas images. J. Proteome Res. 2008;7:2300–2308. [PubMed]
- Otsuka Y, et al. Correlating purity by microdissection with gene expression in gastric cancer tissue. Scand. J. Clin. Lab. Invest. 2007;67:367–379. [PubMed]
- Quackenbush J. Microarray data normalization and transformation. Nat. Genet. 2002;32(Suppl.):496–501. [PubMed]
- Quon G, Morris Q. Isolate: a computational strategy for identifying the primary origin of cancers using high-throughput sequencing. Bioinformatics. 2009;25:2882–2889. [PMC free article] [PubMed]
- Rasmussen CE. The infinite gaussian mixture model. Adv. Neural Inf. Process. Syst. 2000;12:554–560.
- Rasmussen CE, Williams CKI. Gaussian Processes for Machine Learning. The MIT Press; 2006.
- Smyth GK. Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 2004;3 Article 3. [PubMed]
- Sooriakumaran P, et al. A novel method of obtaining prostate tissue for gene expression profiling. Int. J. Surg. Pathol. 2009;17:238–243. [PubMed]
- Strömberg S, et al. A high-throughput strategy for protein profiling in cell microarrays using automated image analysis. Proteomics. 2007;7:2142–2150. [PubMed]
- Stuart RO, et al. In silico dissection of cell-type-associated patterns of gene expression in prostate cancer. Proc. Natl Acad. Sci. USA. 2004;101:615–620. [PubMed]
- Venet D, et al. Separation of samples into their constituents using gene expression data. Bioinformatics. 2001;17(Suppl. 1):S279–S287. [PubMed]
- Wang Z, et al. RNA-seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 2009;10:57–63. [PMC free article] [PubMed]
- Wilhelm BT, Landry J-R. RNA-seq-quantitative measurement of expression through massively parallel rna-sequencing. Methods. 2009;48:249–257. [PubMed]

Articles from Bioinformatics are provided here courtesy of **Oxford University Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |