|Home | About | Journals | Submit | Contact Us | Français|
RNA-seq, a next-generation sequencing based method for transcriptome analysis, is rapidly emerging as the method of choice for comprehensive transcript abundance estimation. The accuracy of RNA-seq can be highly impacted by the purity of samples. A prominent, outstanding problem in RNA-seq is how to estimate transcript abundances in heterogeneous tissues, where a sample is composed of more than one cell type and the inhomogeneity can substantially confound the transcript abundance estimation of each individual cell type. Although experimental methods have been proposed to dissect multiple distinct cell types, computationally "deconvoluting" heterogeneous tissues provides an attractive alternative, since it keeps the tissue sample as well as the subsequent molecular content yield intact.
Here we propose a probabilistic model-based approach, Transcript Estimation from Mixed Tissue samples (TEMT), to estimate the transcript abundances of each cell type of interest from RNA-seq data of heterogeneous tissue samples. TEMT incorporates positional and sequence-specific biases, and its online EM algorithm only requires a runtime proportional to the data size and a small constant memory. We test the proposed method on both simulation data and recently released ENCODE data, and show that TEMT significantly outperforms current state-of-the-art methods that do not take tissue heterogeneity into account. Currently, TEMT only resolves the tissue heterogeneity resulting from two cell types, but it can be extended to handle tissue heterogeneity resulting from multi cell types. TEMT is written in python, and is freely available at https://github.com/uci-cbcl/TEMT.
The probabilistic model-based approach proposed here provides a new method for analyzing RNA-seq data from heterogeneous tissue samples. By applying the method to both simulation data and ENCODE data, we show that explicitly accounting for tissue heterogeneity can significantly improve the accuracy of transcript abundance estimation.
The rapidly advancing next-generation sequencing based transcriptome analysis tool, RNA-seq, provides a comprehensive and accurate method for analyzing the entire RNA components of the transcriptome . The efficiency and sensitivity of RNA-seq make it a primary method for detecting alternatively-spliced forms and estimating their abundances [2,3]. However, estimating transcript abundances in heterogeneous tissues by RNA-seq remains an unsolved, outstanding problem because of the confounding effect from different cell types . Many tissue samples from native environments are heterogeneous. For example, tumor samples are usually composed of tumor cells and surrounding normal cells . Therefore, reads from an RNA-seq experiment of tumor samples will consist of contributions from both tumor and normal cells. Additionally, tumor tissues themselves are often heterogeneous, consisting of different subclones (e.g. breast cancer subtypes ), leading to even more complicated tissue environments.
Experimental methods have been proposed to address issues arising from contamination of different cell types, such as laser-capture microdissection , which allows dissection of morphologically distinguishable cell types. The mRNA content yield by this technology is consequently lowered, and needs to be compensated for, usually by molecular amplification. However, the nonlinearity induced by amplifying mRNA  has its own problems, and can make the expression profiles of distinct cell types less distinguishable, weakening the sensitivity of RNA-seq technology. Other experimental approaches, including cell purification and enrichment, are comparatively expensive and laborious . Therefore developing alternative in silico approaches to resolving the tissue heterogeneity problem, especially in cancer research, remains a major problem in RNA-seq analysis .
Research in computational approaches to resolving the tissue heterogeneity problem of different biotechnologies has a fairly long history [11-14]. The first attempt to computationally micro-dissect heterogeneous tissues for microarray expression data was based on a linear model , which estimated both cell-type proportion and gene expression level. Prior information regarding "marker genes", which are genes uniquely expressed in each cell-type, was incorporated into the linear model to identify distinct cell types. The linear model was extended with Bayesian prior densities of cell-type proportions , and a posterior sampling approach was then constructed for cell-type-specific expression profiling. A statistical testing method  was proposed for single nucleotide polymorphism (SNP) array based copy number alterations analysis from heterogeneous tissue samples. In this method, Bayesian differentiation between hemizygous deletion and homozygous deletion were used to infer the underlying normal cell proportion and copy number profiles of both normal cells and tumor cells. One common feature shared by these methods is that they all adopted probabilistic models, not only allowing prior information about different cell types to be smoothly incorporated into the models, but also taking advantages of the flexibility of probabilistic model to capture specific aspects of each data type.
To the best of our knowledge, no computational approaches have been proposed to resolve the tissue heterogeneity problem from RNA-seq data in a probabilistic fashion. Typically, researchers apply transcriptional profiling tools designed for homogeneous tissue samples directly to RNA-seq data from heterogeneous tissue samples. Subsequent estimation results are interpreted as transcriptional profiling of a particular single cell type of interest. Therefore, we ask whether it is possible to estimate trancriptive abundances of individual cell types from RNA-seq of heterogeneous tissues, by decoupling the contributions from multiple cell types. We propose a probabilistic model-based approach, Transcript Estimation from Mixed Tissue samples (TEMT) to address this question. Currently, TEMT requires two sets of single-end RNA-seq reads. One read set is from a heterogeneous tissue sample composed of two cell types, while the other is from a pure tissue sample composed of one of the two cell types. TEMT incorporates prior information of cell type proportion and can calculate probabilities of RNA-seq reads sampled from each cell type. Because TEMT implements an online EM algorithm , it has a time requirement proportional to the data size and a constant memory requirement. To further improve the estimation accuracy, TEMT also implements a bias module, which incorporates both positional bias [16-18] and sequence-specific bias [19,20].
To assess the performance of TEMT, we analyzed a series of both simulation and real data from ENCODE , and compared the transcript relative abundances estimation from TEMT to those obtained from other methods that do not take the tissue heterogeneity into account. Our results show that explicitly accounting for tissue heterogeneity can significantly improve transcript abundance estimation accuracy.
In this section, we first introduce the generative mixture model of TEMT. Combined with cell type proportion as prior information, we propose a maximum a posteriori estimation approach for finding model parameters. Next, we explain how to incorporate a positional and sequence-specific bias module into the model. Finally, we introduce an online EM algorithm for parameter estimation, reducing the time complexity to be proportional to the data size and the space complexity to be constant.
We focus on transcript abundance estimation. Denote as a set of reference transcripts, which we assume is known and complete. Let denote the length of transcript t in the set with , where T is the total number of transcripts in the reference set. Suppose we are interested in transcriptome analysis in two cell types: a and b. Let and denote the relative transcript abundance of transcript t in cell type a and b, respectively, with . We assume are properly normalized such that and .
We assume RNA-seq reads are available in two samples: one consisting of cells of only type a, which we call the "pure sample", and the other consisting of cells of both type a and b with percentage from cell type a and from cell type b, which we call the "mixed sample." In the cancer transcriptome analysis, cell type a can represent normal cells as it is usually easy to obtain a pure tissue sample, while cell type b can represent tumor cells as most tumor tissue samples are contaminated by normal cells.
Because the pure sample consists of only cell type a, its relative transcript abundance is described by for all t. However, the relative abundance of transcript t within the mixed sample is a weighted sum of the transcript abundance of both cell type a and b
Denote the read set from the pure sample by and the read set from the mixed sample by. Our goal is to estimate the relative abundance of each transcript in the reference set T from the RNA-seq read data and in both cell type a and b
We first map reads to the reference transcript set and convert the raw read data into a corresponding alignment representation. Denote the alignment representation of the read set by =, where if read i from aligns to transcript t and 0 otherwise, and Np is the total number of reads in read set . The alignment representation is similarly defined for read set from the mixed sample. Note that one read might map to multiple transcripts due to alternative splicing, sequence similarity shared by homologous genes, or other reasons. As a result, the summation of over all transcripts may be bigger than 1 for some i. These "ambiguous reads" introduce a major source of uncertainty into transcript abundance estimation.
We model the sequencing of reads as a sampling process, randomly chooses a transcript t from the reference transcript set according to its relative abundance and effective length, and then generates a read from a random location of the chosen transcript. Under this model, the probability of a read originating from transcript t is
with s being either p for the pure sample or m for the mixed sample. Here, is the effective length of transcript t, which quantifies the number of positions at which a read can start within transcript t. Different methods have been proposed to model the effective length [19,22]. In TEMT, the effective length is modeled with consideration to the length distribution of RNA-seq fragments 
We assume the fragment length x has a normal distribution with mean µ and variance , and is the normal probability density function of. By renormalizing , we obtain the discrete distribution of all possible fragment lengths. The effective length is then the expectation of the number of positions a read can start within transcript t, based on the discrete distribution of fragment length.
Suppose a read is generated uniformly from each location covered by the effective length of each transcript. Then the probability of observing read i as represented by its alignment map is
for s = p or m.
Assume each read is generated independently in both the pure and the mixed samples. The likelihood of observing the read set from the pure sample and from the mixed sample is then described by
We are interested in estimating the relative transcript abundances set , but since it can be uniquely defined by the read sampling probability set ,
We can directly estimate the read sampling probability set from the likelihood function Equation (5) instead. Note that, again for all t as it is the parameter of pure sample, but unlike the linear form in Equation (1), in terms of is given as a nonlinear form
Where, the factor induce the nonlinearity. But due to the averaging effect of the large number of transcripts, practically lies within 1 ± 0:05. So we approximate with the linear form
As it brings computational convenience in the following learning step.
Finally, we define
as the parameters of our model. The likelihood in Equation (5) can be then expressed as
Several analysis have noticed the identifiability problem [12,13] in estimating cell type specific expression in heterogeneous tissue samples. Ideally, if the proportion information for some cell types is missing, we can then pool these cell types as one type, making the expression of each individual cell type inside unidentifiable. Previously, prior constraints have been used to resolve the problem [12,13]. In our model, the prior knowledge of cell type proportions is combined with the model likelihood, and we subsequently use maximum a posteriori (MAP) estimation to find the optimal parameters.
Specifically, we place a Beta distribution as the prior for cell proportions of type a and type b. The parameter quantify the location and sharpness of the prior. Practically, we found setting 10 times as the data size gave a good convergence rate and accuracy. Combining the prior with the likelihood given in Equation (11), the posterior distribution of the model is proportional to
Both positional [16-18] and sequence-specific [19,20] sequencing biases have been observed in next generation sequencing data. These biases mainly result from non-uniformly distributed cDNA fragments during the RNA-seq library preparation . Under positional bias, reads positioning is not uniformly distributed across the effective length of the target transcript, but preferentially distributed around either the 5' end or the 3' end of the target transcript. Under sequence-specific bias, the sequences near the two ends of the fragments affect their probability to be sequenced. To account for these non-uniformity effects during transcript abundance estimation, we incorporate the bias module of  into our model.
In order to further describe the local alignment context, we define another two sets of variables. Specifically, for read i from either read set or , we denote as the starting position of the alignment within transcript t relative to the 5' end of the strand. We also denote , where , as the local sequence of transcript t with length L and centered at .Then we define the bias weight as
for s=p or m.
This bias weight is essentially the ratio of the probability of observing and under the bias model to the probability under the uniform model. If no bias exists, the weight reduces to 1. The bias re-weighted Equation (4) is then:
To calculate the bias weight, we use the bin method and Markov chain for positional bias and sequence-specific bias respectively. Complete details can be found in the Supplementary (Additional file 1). The final unnormalized posterior distribution of the model is then described as
Where and are the bias weights computed based on read set and . The directed graphical model of TEMT is shown in Figure Figure1.1. The estimated parameters are given by
We solve the maximum a posteriori problem in Equation (16) using the Expectation-Maximization (EM)  framework. For each read i from read set of pure sample, we denote the latent variable of the transcript alignment representation as , where if read i aligns to transcript t and 0 otherwise. But now , which means only one , indicating read i is actually originating from transcript t. Similarly, for each read i from read set of mixed sample, we denote the latent variable of the transcript alignment representation as , where if read i aligns to transcript t and is originating from cell type a within the mixed sample, and 0 otherwise. or 0 is similar defined for cell type b. Thus means read i is actually originating from only one transcript, and either from cell type a or b within the mixed sample. We also define the auxiliary variable , and as the conditional probability weight of each latent variable , and conditional on model parameters Θ and the observed read alignment representations . Then based on Jensen's inequality , the complete posterior distribution, which is also the lower bound of Equation (15) can be written as
In which C is a normalizing constant and the equality holds only if the conditional probabilities , , are the true posterior distributions of latent variables .
The EM framework maximizes Equation (17) by iteratively applying the expectation step and the maximization step to update both the conditional probabilities , , and model parameters Θ until convergence. The expectation step of typical batch EM algorithm has to fetch all the data points into memory, and calculates the conditional probabilities based on the average of all the data points. While this batch method guarantee's the log-likelihood function to monotonically increase, it also induces inefficiency in both time and space complexity. Considering the high-throughput nature of next-generation sequencing technology as well as its huge data size, we implemented the EM algorithm in an online fashion  to both lower the memory requirement and boost the convergence rate.
The main difference between the batch EM and the online EM is in the E-step. The E-step of the online EM algorithm first calculates the conditional probabilities of only one new data point, and then updates the conditional probabilities of all the current data points by interpolating between the conditional probabilities of all the previous data points and the conditional probabilities of the new data point, with a forgetting factor σ controlling the convergence rate.
It is shown in  that with the constraint 0.5 < σ ≤ 1, the online EM algorithm is asymptotically equivalent to stochastic gradient ascent, and is guaranteed to converge to the maximum likelihood estimator, which is extended to the maximum a posteriori estimator in our model.
Specifically, the online EM updates in our model is given by
In Equation (18-20), we compute the conditional probabilities , , of just one new read i + 1 based on previous parameter estimation , , ; In Equation (21-23), we compute the new conditional probabilities average , , by interpolating between the previous conditional probabilities average , , and , , . n is the index of iteration step and i is the index of data points. σ is the forgetting factor which controls the convergence rate, with the constraint 0.5 < σ ≤ 1.
In the subsequent M-step, parameters , , , are updated according to new conditional probabilities average , , .
Next we test the performance of the proposed method on both simulation data and the recently released ENCODE data . For both datasets, we used the following three-step protocol and parameters to construct the analysis:
1. We aligned the raw read set from either simulation or the ENCODE data to a given transcript set using bowtie-0.12.7 . For each read, we allowed 2 mismatches and reported at most 10 candidate alignments.
2. The abundance of each transcript in terms of estimated counts was estimated via both TEMT and a control model. Estimated counts is defined as the estimated number of reads generated from the target transcript. In TEMT, the prior of each cell type proportion was set to the same as the proportion used in simulation and ENCODE data respectively, and βa, βb was set to 10 times the size of the read set . μ = 200; σ = 80 were used as the mean and standard deviation of the RNA-seq fragment length distribution. We chose eXpress-0.9.4  as the control model, as it is the state-of-the-art method for transcript abundance estimation and also utilizes an online-EM algorithm. Note that, to run TEMT, we need two read sets, in which one is for the pure sample and the other is for the mixed sample, as previously mentioned. In contrast, to run eXpress, we only need one read set from either the pure sample or the mixed sample. The forgetting factor for the on-line EM algorithms in both TEMT and eXpress was set to be σ = 0:85, and the error-model in eXpress was disabled for comparison.
3. To measure the model accuracy, we used the Error Fraction (EF) measure introduced by  to quantify the discrepancy between the model estimates and the ground truth estimates. The Error Fraction is defined as the fraction of transcripts for which the estimates are significantly different (percent error >10% in our case) from the ground truth.
To show the utility of TEMT, we first carried out a series of simulation studies. To obtain simulated read sets, we used FluxSimulator , a software for transcriptome and read generation by simulating the biochemical processes underlying the library preparation. FluxSimulator requires a reference transcript set to start the simulation process, so we manually downloaded 406 transcripts of 208 alternatively spliced genes in human from Alternative Splicing Structural Genomics Project (AS3D) , and used these 406 transcripts as the reference transcript set. We first simulated the transcript expression process twice producing two sets of relative transcript abundances, corresponding to cell type a and b respectively. Based on these two transcript abundance sets, we then simulated 6 pairs of 1 million 75-bp single-end read sets corresponding to six different cell type b proportions from 40% up to 90%. The relative transcript abundances of cell type a and b were kept the same throughout these simulations. For each paired read set, one read set is for the pure sample composed of only cell type a, whereas the other read set is for the mixed sample composed of both cell type a and b, mixed with the cell type b proportion. Within the mixed-sample read set, we also extracted the reads simulated purely from cell type b, which was used for control model eXpress.
The simulated data are analyzed with the bias module both enabled and disabled. Surprisingly, the positional and sequence-specific bias module did not improve the accuracy of the transcript abundance estimation as measured by the Error Fraction of estimated counts in both TEMT and eXpress. This result may due to the stochasticity during the simulation of FluxSimulator. So we only present the results with the bias module disabled in both TEMP and eXpress in Figure Figure22.
We note that the estimates of cell type a from TEMT achieve roughly the same accuracy, compared with the estimates from eXpress based on the read set of the pure sample of cell type a. Also, this accuracy does not change significantly under the effect of different cell type b proportions. This is mainly due to the pure sample read set of cell type a within the input data for TEMT.
The accuracy of the estimates of cell type b from TEMT is also shown in Figure Figure2,2, which shows that TEMT generally outperforms the direct estimation method. To the best of our knowledge, there are no computational tools similar to our model that can estimate the relative transcript abundances of cell type b via RNA-seq data generated from mixed samples. Typically, computational methods are applied directly to the noisy data of mixed samples and results are interpreted as the estimates of cell type b. To compare the estimates of cell type b from TEMT with direct estimates using the current method, we applied the control model eXpress directly to the read set of the mixed sample. The estimated counts from eXpress were then compared with the true counts from another 1 million simulated read set purely of cell type b, while keeping the same relative transcript abundance as the previous simulations. The corresponding Error Fractions are shown as the red line in Figure Figure22 regarding different cell type b proportions. Although the accuracy of cell type b estimates from TEMT is affected by different cell type b proportions, it is generally better than the direct estimates. This can be further illustrated in Figure Figure3,3, which shows that the direct estimated counts of cell type b from eXpress deviate more from the true counts as the cell type b proportion decrease, while the estimates of TEMT have much reduced deviation. We notice that as the cell type b proportion gradually decreases, the accuracy of the estimates of cell type b from TEMT also decreases. This is the result of the contamination effect from the cell type a within the mixed sample. A recent paper  also observed this similar phenomenon when studying copy number aberrations from heterogeneous tumor tissue.
Next we analyzed the recently released ENCODE data. Due to the lack of RNA-seq data sampled from mixed tissue samples with known cell type proportions, we artificially generated the mixed-sample read sets by mixing reads obtained from two different cell types. Specifically, we chose two Tier 1 cell lines, GM12878 and K562, and treated them as cell type a and cell type b respectively. The corresponding single-end RNA-seq data of these two cell lines, GM78 1×75D A 1 (UCSC Accession: wgEncodeEH000125) and K562 1×75D A 1 (UCSC Accession: wgEncodeEH000126) from the Wold lab  at Caltech, were download from ENCODE (2012). The data downloaded from the same lab under similar protocols is intended to reduce the deviation resulting from experiments. We then randomly selected 10 million reads from GM12878 cells to form the read set of the pure sample, and 10 million reads from both GM12878 and K562 cells using different K562 cells proportions to form the read set of the mixed sample. Similar to the previous simulation study, we extracted the reads purely selected from K562 cells within the mixed sample, and used them for the eXpress control model. We studied 6 different K562 cells proportions from 40% to 90% in order to compare with the previous simulation study. 36908 human RefSeq  transcripts from UCSC known genes  were used as the transcript set for the ENCODE data.
One major issue in studying the ENCODE data is that the ground truth of relative transcript abundance in each cell type is unknown. We used the estimates from eXpress based on the GM12878 and K562 pure samples as the ground truth. Again, the bias module was disabled for both TEMT and eXpress. The general result of ENCODE data is shown in Figure Figure4.4. Similar to the simulated data, the indirect estimates for K562 cells from TEMT generally outperforms the direct estimates from eXpress based on the read set of the mixed sample. The contamination effect from cell type a within the mixed sample observed in Figure Figure33 is also seen in the eXpress analysis of ENCODE data, while TEMT does not have this issue. Note that the measure of relative transcript abundances as shown in the red line of Figure Figure44 is no longer estimated counts, but reads per kilobase of transcript per million mapped reads (RPKM), as the total number of reads from K562 cells within the mixed sample is less than the total number of reads of the mixed sample, so that normalization is necessary for comparison. We notice TEMT underperforms direct estimates from eXpress when K562 cells proportion equals 90%. Possibly the contamination effect of GM12878 cells within the mixed sample is not severe enough at this point, as we can imagine the red line in Figure Figure44 will finally reach 0% Error Fraction when K562 cells proportion reaches 100%. On the other hand, since the estimates from eXpress based on the pure sample are considered the ground truth, the lower bound Error Fraction of K562 cells estimates from TEMT should be the same as the Error Fraction of GM12878 cells estimates, which is around 20% to 30% in Figure Figure44.
We formulated our model under the assumption that the heterogeneous tissue is only composed of two cell types, but in reality, a heterogeneous tissue might be much more complicated, consisting of multiple cell types. To relax this constraint, our model needs to be further extended to analyze more complex cases in which each cell type may have its own subtypes, e.g. breast cancer subtypes, leading to a more sophisticated heterogeneous tissue environment. Further dissecting cell subtype heterogeneity is the next step in refining our model. Moving from two cell types to arbitrarily many cell types is of great interest, since it may substantially facilitate transcriptome study of heterogeneous tissues.
One critical component necessary to make our model work is the prior information of cell type b proportion, which is necessary to resolve the identifiability problem of mixed samples. In real experiments, precise prior information regarding cell type proportions may be unavailable. One solution in the context of our model is to down weight the effect of the prior by decreasing the parameter βa, βb, which adds more uncertainty to the cell mixture proportion. However, this approach may decrease the performance of the model as the uncertainty in cell mixture proportion cannot be distinguished from the uncertainty in transcript abundance estimation. This observation suggests another direction to further improving our model which is to solely estimate cell type b proportion without the prior information. To fulfill this requirement, the identifiability problem needs to be resolved as mentioned in section 2.3, which turns out to be comparatively hard for RNA-seq data. Unlike the heterozygous and homozygous deletions in , which can be utilized to differentiate between the SNP array data generated by normal cells and tumor cells, there are no such explicit differences between the reads generated by distinct cell types in RNA-seq data, thus making the generative mixture model unconstrained. The "marker genes" method proposed by , which tries to distinguish distinct cell types by utilizing genes uniquely expressed in each cell type, provides a future potential direction to extend the current model.
In this article, we propose a probabilistic model-based method TEMT to estimate transcript abundance of individual cell types based on RNA-seq data from heterogeneous tissue samples. TEMT utilizes prior information to distinguish reads generated by each cell type within the heterogeneous tissue sample. Positional and sequence-specific biases are also incorporated to improve estimation accuracy. TEMT is able to process large datasets as the online EM algorithm is adopted to guarantee a time complexity proportional to the data size and a constant space complexity. Our experiments on both simulated datasets and ENCODE data shows that explicitly accounting for tissue heterogeneity can significantly improve the accuracy of transcript abundance estimation.
The authors declare that they have no conflict of interests.
Designed the experiments: LY and XX; Performed the experiments: LY; Wrote the paper: LY and XX; All authors contributed to the analysis, and approved the paper.
Supplementary. Complete details for calculating positional and sequence-specific bias weights.
We gratefully acknowledge helpful discussions with Jake Biesinger, Daniel Newkirk and Ali Mortazavi. The work is partly supported by National Institute of Health grant R01HG006870.
Publication of this article was supported by National Institute of Health grant R01HG006870.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 5, 2013: Proceedings of the Third Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-seq 2013). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S5.