|Home | About | Journals | Submit | Contact Us | Français|
Motivation: Next-generation (NextGen) sequencing is becoming increasingly popular as an alternative for transcriptional profiling, as is the case for micro RNAs (miRNA) profiling and classification. miRNAs are a new class of molecules that are regulated in response to differentiation, tumorigenesis or infection. Our primary motivating application is to identify different viral infections based on the induced change in the host miRNA profile. Statistical challenges are encountered because of special features of NextGen sequencing data: the data are read counts that are extremely skewed and non-negative; the total number of reads varies dramatically across samples that require appropriate normalization. Statistical tools developed for microarray expression data, such as principal component analysis, are sub-optimal for analyzing NextGen sequencing data.
Results: We propose a family of Poisson factor models that explicitly takes into account the count nature of sequencing data and automatically incorporates sample normalization through the use of offsets. We develop an efficient algorithm for estimating the Poisson factor model, entitled Poisson Singular Value Decomposition with Offset (PSVDOS). The method is shown to outperform several other normalization and dimension reduction methods in a simulation study. Through analysis of an miRNA profiling experiment, we further illustrate that our model achieves insightful dimension reduction of the miRNA profiles of 18 samples: the extracted factors lead to more accurate and meaningful clustering of the cell lines.
Availability: The PSVDOS software is available on request.
Supplementary information: Supplementary data are available at Bioinformatics online.
Gene expression profiling is at the center of targeted therapy and rapid disease diagnosis. High-throughput or NextGen sequencing has recently emerged as an alternative platform to hybridization-based microarrays for the purpose of gene transcription profiling. For example, Witten (2011) claims that NextGen sequencing is ‘on track to replace microarray as the technology of choice’ for characterizing gene expression.
NextGen sequencing data have several features that create statistical challenges. First of all, sequencing data record the number of reads between a sample and a particular region of interest, which are naturally skewed non-negative counts with a large number of zeros. Second, the nature of the sequencing experiment, such as technical sequence lane capacity, can result in different samples with dramatically different total number of sequence reads, which suggest that the samples need to be normalized in a certain way. It is well established that for high-throughput sequencing data applications, Poisson distribution represents an appropriate choice (Chen et al., 2008; Jiang and Wong, 2009; Srivastava and Chen, 2010). However, the predominant form of sequencing data analysis is to forcefully transform the data and then use statistical tools that were initially developed for microarray data, which are continuous and reasonably modeled using normal distributions (with or without transformation). This leads to sub-optimal results and prompted the development of Poisson-based methods (Bullard et al., 2010; Marioni et al., 2008; Witten, 2011).
Different from existing Poisson-based methods for analyzing sequencing data, we focus on dimension reduction, i.e. identifying low-dimensional features or factors in the data. Genetic studies usually involve a large number of genetic markers (e.g. thousands of genes) for a small group of samples (e.g. individuals, tumor samples or virus-infected cells), which face the ‘curse-of-dimensionality’. Dimension reduction is thus a desirable (and sometimes necessary) pre-processing step, and the identified features can then be used as inputs for unsupervised clustering. In this article, we specifically take into account the Poisson nature of NextGen expression profiling count data and develop a new family of Poisson factor models for efficient dimension reduction on a collection of sequencing samples. Our approach extends the earlier work of Shen and Huang (2008) to automatically address the issue of sample normalization through the use of unknown offset parameters, which are simultaneously estimated along with the underlying factors. Model identification constraints are derived and incorporated in an efficient alternating estimation algorithm. We also introduce Poisson factor models to sequencing analysis and consider follow-up clustering analysis.
The rest of the article is organized as follows. We present our model in Section 2 and develop a computationally efficient estimation algorithm. A simulation study is reported in Section 3.1 to demonstrate the performance of our algorithm, as well as several methodological details. In Section 3.2, we then illustrate its performance through our primary motivating application—micro RNA (miRNA) profiling. miRNAs are a class of 21–25 nt non-coding RNAs that are able to post-transcriptionally regulate gene expression (O’Hara et al., 2008, 2009a, b). In addition to being heavily skewed, miRNA NextGen sequencing data are also sparse because in a typical cell, <1% of all known miRNAs are expressed, and 99% are not. The numerical illustrations suggest that our method results in accurate extraction of key features from the sequencing data, which then leads to more meaningful clustering of various experimental cell lines. We conclude the article with some discussion of future work in Section 4.
We first review standard factor models that can be used to analyze NextGen sequencing data, and then propose our Poisson factor model. Consider an sequencing data matrix where the n rows correspond to samples (cell lines), the m columns correspond to the different genetic markers (e.g. miRNAs) and the entry records the read count of the jth miRNA from the ith cell line. Denote the ith row of Y as , referred to as the count profile of the ith sample.
Consider the following K-factor model on the count profiles :
where is the kth factor and is the corresponding score. For model identifiability, the factors are assumed to be orthonormal.
The factor model can be estimated through either principal component analysis (PCA) of Y or equivalently its singular value decomposition (SVD). PCA is a classical non-parametric linear dimension reduction technique that can be used for estimating the factor models, and the outcomes of PCA are often considered as inputs to unsupervised clustering analysis. PCA seeks to lower dimensions to a smaller number of components that capture most of the relevant structure in the data. It is particularly useful in identifying clusters of related samples, e.g. tumor subtypes based on gene expression levels, or in identifying clusters of co-regulated genes in a collection of different samples.
Not surprisingly, many investigations use PCA/SVD-based approaches in genetic studies to infer significant population or genetic structures (Holter et al., 2000; Lee et al., 2010; Liu et al., 2003; Patterson et al., 2006; Price et al., 2006; Simon et al., 2004; Witten et al., 2010). Most of genetics modalities, such as microarray data, include systematic variations caused by non-biological sources, e.g. instrument error. For PCA to be more effectively, normalization or transformation is a common pre-processing step to reduce the non-biological variation (Bowtell and Sambrook, 2003; Cui et al., 2003). Previous studies have used other variants of PCA for non-normal data, such as generalized PCA for exponential family (Collins et al., 2002; Roy and Gordon, 2002) and sparse non-negative generalized PCA (Allen and Maletić-Savatić, 2011). We show that our approach augments the repertoire of tools for the analysis of extremely sparse multi-dimensional count data, such as those encountered in RNA and miRNA sequencing experiments.
Note that the raw RNAseq counts are often skewed. Hence, in practice, PCA/SVD is usually applied to transformed RNAseq data. For example, one option is to first normalize the data through the following cube-root transformation (Gentleman, 2005):
where IQR stands for the inter-quartile range. One can also use relative frequency profiles of miRNA-seq data where the miRNA count profile of each sample is divided by the total number of hit counts across all miRNA targets for that sample, i.e. the row count, and then apply SVD to the centered relative frequency data. Alternatively, one can apply quantile normalization (Bolstad et al., 2003) before SVD. We refer to these methods as SVD-Cuberoot, RSVD and QN-SVD, respectively. Although their implementations are fairly straightforward, such transformations ignore the distributional nature of the data, and potentially can lose important features of the data (Witten, 2011). Our numerical comparisons (Section 3) show that they perform inferiorly to our proposed Poisson Singular Value Decomposition with Offset (PSVDOS) method.
As discussed earlier, NextGen sequencing data exhibit special features that are not seen in hybridization microarray data, which create statistical challenges that need to be addressed. Later in the text we propose a new class of Poisson factor models with offsets to explicitly incorporate the special features: the Poisson count nature, the abundance of zero reads and the need for sample normalization.
We consider Poisson factor models within the generalized linear model framework and simultaneously incorporate normalization and dimension reduction. We assume that the read count is a Poisson random variable with rate , and let denote the hidden Poisson rate matrix. Specifically, we consider the following Poisson factor model:
where the scalar is the offset parameter for the ith sample, is the normalized proportion of the jth miRNA in the ith rate profile is the canonical link function for Poisson variables used in generalized linear model, is the kth factor score for the ith rate profile and is the kth factor. For the identifiability of Model (3), the factors need to satisfy some constraints as discussed in the Supplementary Material.
To estimate Model (3), we propose to maximize the corresponding Poisson likelihood. The offset parameters, the factors and their scores are all unknown, which makes direct likelihood maximization over all the unknown parameters challenging. Hence, we consider an alternating maximum-likelihood algorithm to estimate the parameters.
Assuming that the factors are known, for each row i, we can estimate the offset parameter and the factor scores by fitting a log-linear Poisson regression model with the ith count profile as the response, the factors as the covariates and the offset parameter as the intercept. The parameters are estimated via iteratively re-weighted least squares (IRLS), which has nice convergence properties (McCullagh and Nelder, 1989). Then, given and the factor scores , we estimate the factors by fitting a log-linear Poisson regression model with the jth column count profile as the response, as the fixed offset and the factor scores as the covariates. The identifiability constraints on the factors are incorporated into the IRLS algorithm.
The previous discussion suggests the following iterative algorithm for estimating the Poisson factor Model (3). We refer to the algorithm as PSVDOS, in the sense that the algorithm extends the SVD algorithm for fitting the standard factor Model (1) to incorporate Poisson distributions with offset parameters. Note that, although PCA and SVD make no distributional assumptions of the data Y, there exist some theoretical justifications for using PCA and SVD when the data are approximately normally distributed: the SVD estimates then are in fact the maximum-likelihood estimates of Model (1) (Gabriel and Zamir, 1979). The alternating algorithm increases the likelihood function at each iteration and is guaranteed to converge because of the convexity of the optimization criterion at every step. At the end of each iteration, we apply SVD to , where and . This process ensures the uniqueness and the orthogonality of the updated components. The code is written in R (Chambers et al., 1992), using the built-in glm function. In our numerical studies, the algorithm converges within 30 iterations on average.
We make three comments regarding the offset parameters and selection of the number of factors. First, the row-centering in Step 3 enforces the identifiability of the offset parameters. See Supplementary Materials for details. Second, sometimes it makes sense to assume the offsets as known from a priori knowledge. For example, one can treat the total read count of a sample as the offset. In that case, there is no need to update or estimate the offsets as part of the aforementioned PSVDOS algorithm. Finally, in practice, the number of factors K needs to be selected in a data-driven fashion. We propose to use the deviance reduction-based approach suggested by Shen and Huang (2008). More details are given in Section 3.1.
We illustrate the performance of our proposed PSVDOS method through a simulation study (Section 3.1) and an analysis of an miRNA-sequencing dataset (Section 3.2). We compare PSVDOS with five other SVD-based methods:
The comparison will illustrate the shortcomings of ignoring the Poisson count nature of the data, as well as the necessity of incorporating sample-specific scaling effect through offsets. Both numerical studies suggest that PSVDOS performs the best. We also show that our data-driven approach can select the number of underlying factors accurately and stably, and that the PSVDOS algorithm can estimate the offset parameters accurately.
We generate a synthetic miRNA-seq dataset according to Model (3). The Poisson rate matrix follows
where the offset matrix , and the proportion matrix P (in log-scale) follows a four-factor SVD model, with the left singular vector matrix U, the right singular vector matrix V and the diagonal singular value matrix S containing the four positive singular values as its diagonal entries.
We simulate different samples measured on miRNAs. Figure 1a displays the true offsets in the log-scale, which are generated as follows: in the log-scale, they fall into six clusters with distinct cluster means (gray dotted lines) and are uniformly distributed around the cluster mean within each cluster; the overall mean across the clusters is 4 (solid line). Each offset’s cluster membership is displayed with different colors and markers.
The four diagonal elements of S, i.e. the singular values, are . The four columns of U and V, i.e. the left and right singular vectors, are plotted in the columns of Figure 1b and c, respectively. Each row in Figure 1b corresponds to one sample, which depicts four different clustering patterns in the samples. Figure 1c shows the heat map of the right singular vectors in the 200 × 4 matrix V or the miRNA factor profiles, embedded with certain clustering patterns. Each column indicates one miRNA factor profile, and each row corresponds to one miRNA. Additional details regarding the data generation can be found in the Supplementary Materials.
Figure 2a displays the true used in the simulation using a blue (negative)–red (positive) color coding. The color bar on the left side of the heat map shows the clustering membership of the rows, which fall into six clusters as indicated by the six colors (red, cyan, green, blue, magenta and yellow). The clustering pattern is obtained by applying the complete linkage hierarchical clustering analysis on the rows (Eisen et al., 1998; Wilkinson and Friendly, 2009). Similarly, the color bar above the heat map shows the cluster membership of the miRNAs, as represented by the color coding and the corresponding clustering dendrogram plot.
All six methods are applied to the synthetic data to extract the first four SVD components, which result in best-rank-four approximation for the underlying signal, as plotted in Panels b–g of Figure 2. We superimpose the hierarchical clustering results based on the rows and columns, respectively, of the estimated signal matrices. To better illustrate the accuracy of the clustering, we use the ordinal cluster membership contained in to color code the rows and the columns in the dendrogram plots. PSVDOS (Panel b) offers the best approximation to the true signal, and it gives the most accurate clustering result, which is as expected, as the dataset was designed in a way that takes advantage of the unique features of PSVDOS.
Figure 2c shows that PSVD incorrectly separates red and purple clusters, and it cannot separate the blue and green clusters. As the original PSVD algorithm does not take into account abundance sequence depth, the variation of the total number of reads was reflected in the first components. SVD-Raw in Figure 2d performs badly overall, as it is driven by absolute abundance. As the data were generated from Poisson distribution, exponential relationship is naturally imposed. Thus, in the absence of any normalizing previous transformation, SVD cannot recover the underlying pattern. SVD-Cuberoot in Figure 2e improves on SVD-Raw by reducing the influence of estimated sample/miRNAs, but the clustering result of the rows differs significantly from the truth signal. RSVD (Fig. 2f) fails to separate the blue and green clusters. QN-SVN in Figure 2g performs well for bi-clustering, but worse than the PSVDOS. This indicates that quantile normalization is not sufficient to overcome the extreme differences in sequence depth, even though it does take into account abundance to improve the clustering results. Heatmaps of the estimated singular vectors can be found in the supplementary document.
We use the deviance reduction plot as suggested by Shen and Huang (2008) to choose the number of underlying factors, which is a likelihood-based extension of the screen plot. Figure 3a displays the deviance reduction by the number of factors for one particular simulated dataset (dotted line), which shows how much additional data ‘variability’ can be explained by every extra factor. The elbow of the deviance reduction plot suggests that four factors are sufficient for modeling the data. To better illustrate the ignorable contribution of the additional factors, a zoomed-in version of the plot is included in the Supplementary Material.
This way of selecting the number of factors is stable. We repeated the simulation 100 times and obtained the deviance reduction plot for each simulation run. The horizontal gray lines represent the pointwise 95% intervals of the 100 obtained deviance reduction plots (gray lines) by the numbers of factors. The deviance reduction plots all have an elbow when the model includes four factors.
We now demonstrate that PSVDOS can accurately estimate the offset parameters, the ’s and their overall mean (in log-scale). For each of the 100 simulation runs, we applied PSVDOS with four factors (as suggested by the deviance reduction plot) to obtain the estimates for . For each , we calculated the 2.5 and 97.5% quantiles of , which provide the empirical 95% confidence intervals (CI) for the differences, as depicted by the vertical lines of Figure 3b. All CIs contain the value zero (the dotted line). Similar 95% CI is plotted for the overall mean in the Supplementary Materials.
We also investigated how the bias of the offset estimates depends on the number of factors K included in the model. As shown in the Supplementary Materials, the estimates seem to be biased when K is under-estimated, i.e. <4, whereas the estimates become less biased as K increases; eventually when K is ≥4, the offset estimates do not have any bias.
We first illustrate the application of PSVDOS to NextGen-based miRNA expression profiles of different virus-infected samples (the validity of our approach will be further established through a simulation study in Section 3.1.). The miRNA-sequencing data were collected on a series of samples that were infected with human and non-human primate herpesviruses. One such virus, the monkey B virus, is fatal to humans, whereas its relative, the human herpes simplex virus, only causes cold sores in 99% of the infections. These experiments (to be described elsewhere) were designed to identify novel biomarkers that can differentiate between harmless and fatal exposures. Because viral infection is deduced based on the transcription profile of hot miRNA, this indirect approach is particularly useful if the infecting virus is not known or even entirely novel.
In this particular application, we want to test the hypothesis (i) that the cells are infected by a virus of the family α-herpesviruses, and (ii) does the infecting virus have biological consequences similar to herpes simplex virus, in which case, the patient develops minor skin ulcerations, i.e. cold sores, or to monkey B virus, in which case the patient dies within 7 days. PSVDOS sensitively revealed meaningful clusterings among the samples, as well as the corresponding miRNA markers that can be potentially used to differentiate the samples.
A key characteristic of this dataset is the even greater sparsity of the data and more limited depth of the count data compared with the simulated dataset in Section 3.1 because of the expense associated with comprehensive NextGen sequencing. This makes for an important practical role. PSVDOS outperformed other methods under generous experimental constraints. The ratio of samples to targets was 40:200 or 1:5. Under those experimental parameters, PSVDOS recovered the ‘true’ data structure. As there are >2000 human miRNAs, a corresponding experiment would require at least 400 samples. Such large number of samples can be obtained and sequenced only by large consortia, e.g. Hudson et al. (2010). Most published studies use 1:50 ratios, for which we expect the practical benefit of PSVDOS compared with previous methods to be even more pronounced.
Briefly, the Illumina small RNA kit v1.5 was used to establish cDNA libraries of small RNAs from human fibroblasts infected with either human or non-human primate α-herpesviruses. These were the human herpes simplex virus 1 and 2 (HSV-1, HSV-2) and their homologous primate viruses for baboons, squirrel monkeys and macaques. The small RNAs libraries were then sequenced using the Illumina platform for single-end sequencing. As additional known controls, we used HUVEC and CHME cells, which were either mock infected or infected with an irrelevant, widely divergent virus, namely, West Nile virus (WNV). We used a slightly different method of isolation and purification for each cDNA library of small RNAs, as we tried different versions of the manufacturer’s kit. This is clearly not an ideal experimental design, as it introduces additional technical variation, but one that is typical for experimental science. Further samples were lymphoma (PEL), human tonsil and another cell line. The resulting reads for each sample were aligned to a human miRNA database (Kozomara and Griffiths-Jones, 2011).
The read counts for each of 398 miRNAs were obtained for each sample of cells infected with human and non-human primate α-herpesviruses (HSV-1, HSV-2, HVP2, SQHV, BV, SA8 and ChHV) and other control and infected cell lines and tissues (PEL-A, PEL-B, fibroblasts, tonsil, Control1, Control2, Control3, CHMEinfected, CHME5mock, HUVECinfected and HUVECmock). As the dataset was sparse indeed, in that many miRNAs have zero or small number of reads, only miRNAs with the total count over all the cell lines are included in the analysis, which means that 265 miRNAs are used in the analysis reported later in the text.
For these 265 miRNAs, the total counts of each cell line are displayed in Table 1, along with the number (%) of miRNAs with zero reads. Note that 50% of miRNAs had zero counts. This is typical for the biology of miRNA expression. The group of HSV-infected cells have ~3000–4000 total counts, the cells in the other group have mildly varying total counts, ~30 000–80 000 and others have total counts >900 000. The heterogeneity among the total counts, i.e. wide spread, is one of the features that PSVDOS is designed to accommodate.
As suggested by the deviance reduction plot (not shown here, also see Section 3.1), we extracted three factors using the methods except PSVD, which could only produce estimates for the first two factors. The legend in Figure 4 lists the cell lines numbered and colored according to their correct grouping. Panels (af) then display the scatter plots of the extracted factor scores for each method, respectively, using the corresponding numbers and colors provided in the legend. We then applied complete linkage hierarchical clustering (Eisen et al., 1998; Wilkinson and Friendly, 2009) to the cell lines based on the dimension reduced data, setting the number of clusters as six. The dendrograms plots are displayed on top of the corresponding scatter plots, where the leaves are colored according to the clustering results, whereas the labels are colored according to the legend, i.e. the correct grouping. Hence, a miss match between the colors of the leaf and the label would mean wrong clustering.
The plots suggest that PSVDOS correctly clusters all the HSV-infected cells together and groups the other cells according to their true subtypes. The fibroblasts sample is clustered with the HSV-infected samples. This actually represents the expected biological results, as the HSV samples represent fibroblasts that were infected with the different herpes simplex viruses, whereas the other samples stem from different tissues of origin. This is also apparent from the similarity in terms of the numerical summaries in Table 1 between the HSV samples and the fibroblasts sample. Generally, miRNA profiles tend to be linked to tissue of origin.
By contrast, the other five methods give inferior clustering results for these samples. PSVD (Panel b) gives the second best clustering performance; however, it groups one control sample together with the tonsil sample. SVD-Raw (Panel c) lacks clear clustering. SVD-Cuberoot (Panel d), RSVD (Panel e) and QN-SVD (Panel f) split the control group samples into different clusters; in addition, the HUVEC and CHME5 cells that are infected with the same virus are not clustered together using the RSVD method. Taken together, these clustering results reveal that PSVDOS outperforms the other methods and reveals more accurate hierarchical clustering results from different sample groups of NextGen sequencing data.
Furthermore, we applied a separate cluster analysis to the eight HSV-infected cells, including the seven herpesvirus and the one fibroblast samples. The clustering results are included in Figure 4 of the Supplementary Material, which show that (i) PSVDOS and PSVD separate fibroblast from the other cells, whereas the other methods place fibroblast close to HSV samples; (ii) all approaches, except RSVD and QN-SVD, confirm the finding of Ohsawa et al. (1999) that HVP2 is more similar to BV than to HSV1 or HSV2.
NextGen sequencing-based mRNA and miRNA expression profiling is rapidly gaining popularity and may eventually replace other methods; however, it yields a completely different data structure, compared with hybridization-based microarray experiments. Microarray-based expression profiling data can, with some difficulty, be transformed into a dataset that has a normal distribution and is amenable to statistical tools for pattern discovery and classification; NextGen sequencing-based expression data cannot.
Expression patterns of miRNAs represent examples of the most skewed expression data that are encountered in the biological literature. The reasons for this extreme distortion are both technical and biological, as only a handful of miRNAs tend to dominate the miRNA population within cells. It offers a practical justification to develop statistical methods for extreme data. Profiling miRNAs is a novel approach to query cell status and to classify samples, in our case, different viral infections. Hence, there exists an urgent need to develop appropriate clustering and classification approaches that take into account the particulars of these data structures.
PSVD is a factorization method to perform dimension reduction, specifically for Poisson count data. It has been applied for data dimension reduction in non-biological applications, such as call center data (Shen and Huang, 2008). In this article, we proposed an extended approach (PSVDOS) to improve unsupervised clustering of RNAseq data by incorporating offset parameters, so that necessary normalization of miRNA sequencing data can be automatically accounted for.
Using simulated data, as well as an even sparser experimental example set, that highlights variation and limitations of real-world NextGen sequencing data, we show that PSVDOS was superior to other approaches that separate normalization from dimension reduction, such as SVD on cube-rooted or relative frequency or the raw counts. Those normalization methods are commonly used to eliminate variation because of technical imperfection. PSVDOS correctly clustered samples on the basis of NextGen-derived miRNA expression profiles. This new approach should help the analysis of NextGen-based RNAseq data in general and miRNA-based classification of experimental and clinical samples in particular.
There are several future research directions worth pursuing. The current Poisson factor Model (3) uses row-specific (or cell line-specific) offset parameters to normalize the cell lines. More generally, the offset parameters can be allowed to depend on both the cell line and the miRNA, e.g. denoted as for the ith cell line and the jth miRNA. One can then impose some two-way analysis of variance model on the offsets to model effects of cell lines and miRNAs, such as
subject to some identifiability constraints, such as . Interaction terms can also be included if necessary.
Our framework makes use of the Poisson distributional nature of the sequencing data through the Poisson likelihood function. The likelihood approach is general and flexible enough that it can be extended to model other distributions. For example, researchers have noticed that some sequencing count data exhibit overdispersion with respect to Poisson distribution and propose to use negative-binomial distribution instead. See, among others, Anders and Huber (2010) and Robinson and Oshlack (2010) in the context of supervised clustering. An interesting direction for future research is to develop an appropriate factor model to address the overdispersion.
Funding: NIH (DE018304, CA019014 to D.P.D.) (in part); NIH/NIDA (1 RC1 DA029425-01); the Xerox Foundation UAC Award; NSF (CMMI-0800575, DMS-1106912 to H.S.) (in part) and PHS (P40 RR12317 to R.E.).
Conflict of Interest: none declared.