Home | About | Journals | Submit | Contact Us | Français |

Formats

Article sections

Authors

Related links

PeerJ. 2017; 5: e3797.

Published online 2017 September 8. doi: 10.7717/peerj.3797

PMCID: PMC5592911

Academic Editor: Jun Chen

Department of Biostatistics, School of Public Health, Nanjing Medical University, China

Feng Chen: nc.ude.umjn@nehcgnef

Received 2017 April 13; Accepted 2017 August 21.

Copyright ©2017 Yang et al.

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.

RNA sequencing (RNA-Seq) enables the measurement and comparison of gene expression with isoform-level quantification. Differences in the effect of each isoform may make traditional methods, which aggregate isoforms, ineffective. Here, we introduce a variance component-based test that can jointly test multiple isoforms of one gene to identify differentially expressed (DE) genes, especially those with isoforms that have differential effects. We model isoform-level expression data from RNA-Seq using a negative binomial distribution and consider the baseline abundance of isoforms and their effects as two random terms. Our approach tests the global null hypothesis of no difference in any of the isoforms. The null distribution of the derived score statistic is investigated using empirical and theoretical methods. The results of simulations suggest that the performance of the proposed set test is superior to that of traditional algorithms and almost reaches optimal power when the variance of covariates is large. This method is also applied to analyze real data. Our algorithm, as a supplement to traditional algorithms, is superior at selecting DE genes with sparse or opposite effects for isoforms.

The availability of massively parallel transcriptome sequencing technology has improved our understanding of the central dogma processes of transcription and translation and the molecular mechanisms of complex diseases (Koboldt et al., 2013; Modelska, Quattrone & Re, 2015; Wang, Gerstein & Snyder, 2009). RNA sequencing (RNA-Seq) has shown that coding genes are stochastically spliced into different transcripts (called isoforms), including partial exons or selected exons (Kalsotra & Cooper, 2011; Kanitz et al., 2015; Oshlack, Robinson & Young, 2010; Pan et al., 2008). This process is referred to as alternative splicing (AS). The different sequence characteristics of isoforms may exhibit different expression patterns, and these differences further influence the translation of proteins and affect cellular phenotypes (Li et al., 2014).

The elementary problem of transcriptomic data analysis is accurate identification of differentially expressed (DE) genes (Garber et al., 2011). Emerging software packages and pipelines for such analyses have been designed and developed, including methods that rely on gene-level measurement, such as DESeq, edgeR and the two-stage passion model (TSPM), and others based on isoform-level measurements, such as Cuffdiff2, IsoDE and EBSeq (Al Seesi et al., 2014; Anders & Huber, 2010; Lehmann et al., 2011; Li & Tibshirani, 2013; Robinson, McCarthy & Smyth, 2010; Trapnell et al., 2013). The gene-level quantification methods assume that the sum of isoform expression levels constitutes the expression of one gene, which may ignore the variability of different transcripts (Trapnell et al., 2013; Trapnell et al., 2010). Thus, the development of a method for considering isoform-level measurements is of increasing importance. For example, considering both the splicing structure and expression levels of isoforms, Cuffdiff2 uses the beta negative binomial distribution to define DE isoforms (Trapnell et al., 2013). Unfortunately, testing for individual isoforms separately neglects the relationships between each isoform (Dialsingh, Austin & Altman, 2015).

Currently, the idea of a set test, which supposes that some related variables form one set and tests the set, likely overcomes the drawbacks of gene-level measurement methods and the disadvantage of testing isoforms individually. The theoretical basis is that the score statistic of the variance component in a generalized linear mixed model (GLMM) follows the mixture chi-squared distribution with the null hypothesis (Lin, 1997). The sequencing kernel association test (SKAT), a popular algorithm in genome-wide association studies (GWASs), assumes that the effects of each locus are in the same functional region as random effects and uses the score statistics to test the set (Wu et al., 2011). Since the estimation process is under the null hypothesis, this method reduces computation time. The test for the effect of gene set (TEGS), which regresses gene expression against phenotypes, is widely used to select DE genes in a microarray platform. The random effect and the residual error of the model are due to the disease effect and the correlation between genes, respectively (Huang & Lin, 2013). The set test is effectively applied to analyze other types of high-dimensional genomic data (Ionita-Laza et al., 2013; Wu et al., 2016).

Considering the disadvantages of traditional algorithms in selecting heterogeneous genes and the accessibility of the set test, we apply the idea of the set test to identify DE genes in isoform-level measurements. This study involved three parts. First, the relationship between the expression level of one gene and the phenotype is formulated as a GLMM with the assumption of a Poisson and negative binomial (NB) distribution. Second, we study the empirical and theoretical distributions of score statistics and measure their statistical performance for different parameter settings. Third, we analyze level 3 mRNA-Seq data on lung squamous cell carcinoma (LUSC) from The Cancer Genome Atlas (TCGA). Comparisons with traditional algorithms are described in the ‘Simulations’ and ‘Real data analysis’.

Assume that there are *N*subjects in the studied sample, and suppose that for all individuals there is one sequenced gene with *p* isoforms. For the *i*th individual, *Y*_{i1}, *Y*_{i2}, …, *Y*_{ip} denotes the RNA-seq data of one gene. The vector **Y** is assumed as a Poisson or NB distribution. With the assumption of independent samples, *x*_{i} is the covariate of individual *i*. For example, its value is 0 or 1 in a case-control study. We assume *α*_{j} to be a random term that follows a normal distribution. Its variance represents the heterogeneity of the baseline abundance of the isoforms. This distinct assumption is the difference between gene-level measurement methods and isoform-level measurement methods. The disease effect is also defined as a random effect. Based on the above assumptions, a GLMM of two random terms is constructed as follows:

where $g\left(.\right)$ is a monotonic differentiable link function, such as $g\left(x\right)=log\left(x\right)$ for the Poisson regression. If *x*_{i} is equal to zero in a case-control study, *μ* + *α*_{j} indicates the expression of the *j*th isoform in the control group, and *β*_{j} indicates the disease effect.

The matrix form is written as follows:

$\mathbf{Y}={\left({\mathbf{Y}}_{1}^{T},{\mathbf{Y}}_{2}^{T},\dots ,{\mathbf{Y}}_{N}^{T}\right)}^{T}$ is an *N* × *p* vector consisting of *N***Y**_{i} vectors (**Y**_{i} = (*Y*_{i1}, *Y*_{i2}, …, *Y*_{ip})^{T}). *μ* is also an *N* × *p* vector, in which all elements are $\mu \cdot \mathbf{K}={\left({\mathbf{I}}_{p},\dots ,{\mathbf{I}}_{p}\right)}^{T}$ is an *Np* × *p* matrix in which **I**_{p} is the *p*-dimensional identity matrix, $\alpha ={\left({\alpha}_{1},{\alpha}_{2},\dots ,{\alpha}_{p}\right)}^{T}$, $\mathbf{X}={\left({x}_{1}{\mathbf{I}}_{p},{x}_{2}{\mathbf{I}}_{p},\dots ,{x}_{N}{\mathbf{I}}_{p}\right)}^{T}$, and $\beta ={\left({\beta}_{1},{\beta}_{2},\dots ,{\beta}_{p}\right)}^{T}$.

In fact, the test of the disease effect is performed to test the variance component of the second random effect. As the basic assumption of mixed models, the third term in the above equation follows a normal distribution ($\beta \sim N\left(0,\phantom{\rule{3.30002pt}{0ex}}{\tau}^{2}{\mathbf{I}}_{p}\right)$) (Gad & El Kholy, 2012). Therefore, the test of *β* equals the test of its variance component *τ*. The final null hypothesis (*H*_{0}) of the model is written as *τ* = 0. The idea of a score test is used to test *τ*. Based on the theory of score statistics in a GLMM and the special assumption of this model, the null distribution and parameter estimation of the statistic *U* is as follows:

First, the matrix form of the first order derivate of the log-likelihood is as follows:

$$\frac{\partial l\left(\alpha ,\tau \right)}{\partial \tau}=\frac{1}{2}\left({\left(\mathbf{Y}-\mu -\mathbf{K}\alpha \right)}^{T}{\Delta}^{-\mathbf{1}}\mathbf{WX}{\mathbf{X}}^{T}{\Delta}^{-\mathbf{1}}\mathbf{W}\left(\mathbf{Y}-\mu -\mathbf{K}\alpha \right)-tr\left({\mathbf{W}}_{0}\mathbf{X}{\mathbf{X}}^{T}\right)\right)$$

where Δ^{−1}, **W** and **W**_{0} are block diagonal matrixes, and $\mathbf{W}=E\left({\mathbf{W}}_{0}\right)$. Each block of the three matrices is a diagonal matrix whose diagonal elements are similar. Assuming ${\gamma}_{i}=E\left({y}_{i}\right)$, their diagonal elements are ${\delta}_{i}=1\u2215{g}^{\prime}\left({\gamma}_{i}\right),{w}_{i}=V{\left({\gamma}_{i}\right)}^{-1}{\delta}_{i}^{2}\text{and}{w}_{0i}={w}_{i}+{e}_{i}\left({y}_{i}-{\gamma}_{i}\right)$, where ${e}_{i}=\left({V}^{\prime}\left({\gamma}_{i}\right){g}^{\prime}\left({\gamma}_{i}\right)+V\left({\gamma}_{i}\right){g}^{\u2033}\left({\gamma}_{i}\right)\right)\u2215\left({V}^{2}\left({\gamma}_{i}\right){\left({g}^{\prime}\left({\gamma}_{i}\right)\right)}^{3}\right)$. With different variances and the same link function, Δ^{−1} of Poisson and NB distribution is the same, but **W** and **W**_{0} are different. For the Poisson assumption, the product of Δ^{−1} and **W** is an identity matrix, and **W** = **W**_{0}. For the NB assumption, we take the product of Δ^{−1} and **W** as Φ, which has diagonal elements of $1\u2215\left(1+\varphi {\gamma}_{i}\right)$ . *ϕ* is the scale parameter of an NB. The diagonal elements of the block of **W**_{0} are ${w}_{0i}={\gamma}_{i}\u2215\left(1+\varphi {\gamma}_{i}\right)+\left(\varphi {\gamma}_{i}\left({y}_{i}-{\gamma}_{i}\right)\right)\u2215{\left(1+\varphi {\gamma}_{i}\right)}^{2}$.

Second, the statistic *U* depends on the first derivate of the log-likelihood function. The formula for the *U* statistic depends on an assumption. Under the Poisson assumption, *U* is equal to ${U}_{Pois}=\frac{1}{2}\left({\left(\mathbf{Y}-{g}^{-1}\left(\stackrel{\u02c6}{\mu}+\mathbf{K}\stackrel{\u02c6}{\alpha}\right)\right)}^{T}\mathbf{X}{\mathbf{X}}^{T}\left(\mathbf{Y}-{g}^{-1}\left(\stackrel{\u02c6}{\mu}+\mathbf{K}\stackrel{\u02c6}{\alpha}\right)\right)-tr\left(\stackrel{\u02c6}{\mathbf{W}}\mathbf{X}{\mathbf{X}}^{T}\right)\right)$, while under the NB assumption, *U* is formulated as ${U}_{NB}=\frac{1}{2}\left({\left(\mathbf{Y}-{g}^{-1}\left(\stackrel{\u02c6}{\mu}+\mathbf{K}\stackrel{\u02c6}{\alpha}\right)\right)}^{T}\stackrel{\u02c6}{\Phi}\mathbf{X}{\mathbf{X}}^{T}\stackrel{\u02c6}{\Phi}\right.\left.\left(\mathbf{Y}-{g}^{-1}\left(\stackrel{\u02c6}{\mu}+\mathbf{K}\stackrel{\u02c6}{\alpha}\right)\right)-tr\left({\stackrel{\u02c6}{\mathbf{W}}}_{\mathbf{0}}\mathbf{X}{\mathbf{X}}^{T}\right)\right)$. We estimate $\stackrel{\u02c6}{\mu}$ and $\stackrel{\u02c6}{\alpha}$ of the model under *H*_{0}.

Third, the chi-square statistic is shown as ${\chi}^{2}=U\tilde{I}{\left(\stackrel{\u02c6}{\alpha}\right)}^{-1}U$. The formula of the information matrix is as follows:

$$\tilde{I}={I}_{\tau \tau}-{I}_{\tilde{\alpha}\tau}{I}_{\tilde{\alpha}\tilde{\alpha}}^{-1}{I}_{\tilde{\alpha}\tau}$$

where ${I}_{\tau \tau}=E\left(\frac{\partial l}{\partial \tau}\frac{\partial l}{\partial \tau}\right),{I}_{\tilde{\alpha}\tau}=E\left(\frac{\partial l}{\partial \tilde{\alpha}}\frac{\partial l}{\partial \tau}\right),{I}_{\tilde{\alpha}\tilde{\alpha}}=E\left(\frac{\partial l}{\partial \tilde{\alpha}}\frac{\partial l}{\partial \tilde{\alpha}}\right)$, and $\tilde{\alpha}=\left(\mu ,\alpha \right)$. Then, the formula is rewritten as follows:

$$g\left(E\left(\mathbf{Y}\right)\right)=\tilde{\mathbf{K}}\tilde{\alpha}+\mathbf{X}\beta$$

where $\tilde{\mathbf{K}}={\left({\tilde{\mathbf{I}}}_{p},\dots ,{\tilde{\mathbf{I}}}_{p}\right)}^{T},{\tilde{\mathbf{I}}}_{p}=\left(1,{\mathbf{I}}_{p}\right)$.

Finally, suppose **A** = **K**^{T}**K** with a diagonal element of *a*_{ii}. *a*_{ii} consists of a vector, denoted as $\tilde{\mathbf{a}}$. Let *κ*_{2i}, *κ*_{3i} and *κ*_{4i} be the cumulants of **Y**_{i}. With the assumption of an exponential family of distributions, the relationships between the three cumulants are as follows:

$$\begin{array}{c}\hfill {\kappa}_{2i}=V\left({\gamma}_{i}\right)\hfill \\ \hfill {\kappa}_{3i}={V}^{\prime}\left({\gamma}_{i}\right)V\left({\gamma}_{i}\right)\hfill \\ \hfill {\kappa}_{3i}=\left({V}^{\u2033}\left({\gamma}_{i}\right)V\left({\gamma}_{i}\right)+{\left({V}^{\prime}\left({\gamma}_{i}\right)\right)}^{2}\right)V\left({\gamma}_{i}\right).\hfill \end{array}$$

Let **R** ^{Np×Np} be a block diagonal matrix. In each block, the diagonal elements and non-diagonal elements are calculated as ${r}_{ii}^{j}={w}_{i}^{4}{\delta}_{i}^{-4}{\kappa}_{4i}+2{w}_{i}^{2}+{e}_{i}^{2}{\kappa}_{2i}-2{w}_{i}^{2}{\delta}_{i}^{-2}{e}_{i}{\kappa}_{3i}$ and ${r}_{i{i}^{\prime}}^{j}=2{w}_{i}{w}_{{i}^{\prime}},i\ne {i}^{\prime}$, respectively. **C** ^{Np×Np} is composed of a *p* diagonal matrix, *C*^{j}, whose diagonal elements are ${c}_{ii}^{j}={w}_{i}^{4}{\delta}_{i}^{-4}{\kappa}_{3i}+2{w}_{i}^{2}+{e}_{i}^{2}{\kappa}_{2i}-2{w}_{i}^{2}{\delta}_{i}^{-2}{e}_{i}{\kappa}_{3i}$. Then, the elements of the information matrix are as follows:

$$I}_{\tau \tau}=\frac{1}{4}{\mathbf{J}}^{T}\left(\mathbf{ARA}\right)\mathbf{J},\phantom{\rule{10.00002pt}{0ex}}{I}_{\tilde{\alpha}\tau}=\frac{1}{2}\tilde{\mathbf{K}}\mathbf{C}\tilde{\mathbf{a}},\phantom{\rule{10.00002pt}{0ex}}{I}_{\tilde{\alpha}\tilde{\alpha}}={\tilde{\mathbf{K}}}^{T}\mathbf{W}\tilde{\mathbf{K}$$

where **J** is an *Np* vector, and its elements are 1. The difference in variances leads to different expressions for **R** and **C**. For the Poisson assumption, the diagonal and off-diagonal elements of **R** are ${r}_{ii}^{j}={\gamma}_{i}+2{\gamma}_{i}^{2}$ and ${r}_{i\tilde{i}}=2{\gamma}_{i}{\gamma}_{\tilde{i}},i\ne \tilde{i}$, respectively. The diagonal elements of **C** are ${c}_{ii}^{j}={\gamma}_{i}$. For the NB assumption, the diagonal and off-diagonal elements of **R** are ${r}_{ii}^{j}=\frac{2{\gamma}_{i}^{3}{\varphi}^{2}+3{\gamma}_{i}^{3}+4{\gamma}_{i}\varphi +2{\gamma}_{i}^{2}+{\gamma}_{i}}{{\left(1+{\gamma}_{i}\varphi \right)}^{3}}$ and ${r}_{i\tilde{i}}=2\frac{{\gamma}_{i}}{1+{\gamma}_{i}\varphi}\frac{{\gamma}_{\tilde{i}}}{1+{\gamma}_{\tilde{i}}\varphi},i\ne \tilde{i}$, respectively. The diagonal elements of **C** are ${c}_{ii}^{j}=\frac{{\gamma}_{i}+{\gamma}_{i}^{2}\varphi}{{\left(1+{\gamma}_{i}^{2}\varphi \right)}^{2}}$.

The permutation algorithm generates the empirical distribution. The two different assumptions cause different score statistics. For the Poisson assumption, the statistic is ${U}_{Pois}=\frac{1}{2}\left({\left(\mathbf{Y}-{g}^{-1}\left(\stackrel{\u02c6}{\mu}+\mathbf{K}\stackrel{\u02c6}{\alpha}\right)\right)}^{T}\mathbf{X}{\mathbf{X}}^{T}\left(\mathbf{Y}-{g}^{-1}\left(\stackrel{\u02c6}{\mu}+\mathbf{K}\stackrel{\u02c6}{\alpha}\right)\right)-tr\left(\stackrel{\u02c6}{\mathbf{W}}\mathbf{X}{\mathbf{X}}^{T}\right)\right)$. For the NB assumption, the statistic is equal to ${U}_{NB}=\frac{1}{2}\left({\left(\mathbf{Y}-{g}^{-1}\left(\stackrel{\u02c6}{\mu}+\mathbf{K}\stackrel{\u02c6}{\alpha}\right)\right)}^{T}\stackrel{\u02c6}{\Phi}\mathbf{X}{\mathbf{X}}^{T}\stackrel{\u02c6}{\Phi}\left(\mathbf{Y}-{g}^{-1}\left(\stackrel{\u02c6}{\mu}+\mathbf{K}\stackrel{\u02c6}{\alpha}\right)\right)-tr\left({\stackrel{\u02c6}{\mathbf{W}}}_{\mathbf{0}}\mathbf{X}{\mathbf{X}}^{T}\right)\right)$. The hypothesis testing of *U* follows two steps: (a) construction of the empirical distribution by shuffling the label and (b) calculation of the *P* value of the original statistic *U*. The estimation of *μ* and *α* is similar to the above. To improve robustness, the default setting of the number of permutations is 5,000.

We perform simulations to examine the type I error and power of the proposed score statistics, *U*, for identifying differential expression under a range of scenarios. The parameter settings are based on the analysis of real data and references (Lehmann et al., 2011). We assume that the RNA-Seq data follow NB and Poisson distributions. Five parameters are involved: the variance of *α* (*s*), the variance of the disease effect (*l*), the number of isoforms (*p*), the logarithm of expression levels ($exp\left(M\right)$) and the dispersion parameter (*d*). *s* shows the heterogeneity of the baseline abundance of each isoform. *l* refers to the heterogeneity of the disease effect. The other three parameters describe the characteristics of the expression levels. The variance of *α* varies from 0.25 to 1.75 in steps of 0.75. The variance of *β* varies from 0.20 to 1.00 in steps of 0.4. The number of isoforms can be 2, 4 or 8. *d* is set at 0.5 or 2. We let $exp\left(M\right)$ be 2.5 or 5. The sample size is fixed at 40. In the type I error simulations, the effect size is set to zero (*l* = 0).

We compare our approach with three traditional algorithms: DESeq, edgeR and TSPM (Anders & Huber, 2010; Lehmann et al., 2011; Robinson, McCarthy & Smyth, 2010). The sum of isoforms is assumed to represent the expression level of one gene in these methods. Therefore, the loop number is the gene number in simulations. When *l* is equal to 0, the proportion of significant genes is the false positive rate.

We download the LUSC Batch 101 mRNA sequencing data from TCGA (Network, 2012). The sample size is 12 for the case-control traits. The number of genes is 20,533. Each sample has four files: *gene expression*, *normalized gene expression*, *isoform expression* and *normalized isoform expression*. We only used the *gene expression* and *isoform expression* datasets. These data are attached as File S1. To utilize the advantages of isoVCT, the candidate genes are the genes showing a higher expression level than the total sample, with isoforms that exhibit heterogeneous effect sizes. Finally, we select 6,134 genes.

This algorithm is completed using the Microsoft R Open (v3.3.0; Microsoft Corp., Seattle, WA, USA). The functions *glmer* and *glmer.nb* in the *lme4* package fit the mixed models in our algorithm. The function *ginv* in the MASS package is employed to calculate the generalized inverse of the singular matrix. The packages *DESeq* and *edgeR* are both from Bioconductor. In consideration of the potential computing time of the simulations, the *doParallel* package is used for parallel computation.

This algorithm provides self-adaptation for real data analysis. If the fitness of the NB assumption fails or the dispersion parameter is close to one, the Poisson assumption is used. Due to the application of the idea of the variance component test, our method is referred to isoVCT. The distribution of the score test statistic is fitted via theoretical and empirical methods, which we term isoVCT-The and isoVCT-Emp, respectively. The R program of isoVCT and the simulation code are attached as File S2.

The results for the empirical type I error for the NB assumption are shown in Table 1 and Fig. 1. *s*, *l* and *p* may be unrelated to type I error rates, but *ϕ* may show an inverse correlation to the type I error rate. isoVCT-The is exactly conservative in specific scenarios. The type I error rate of isoVCT-Emp is close to 0.05, which might mean that the permutation can fit the distribution of *U* in a small sample size setting. DESeq can control type I errors; however, the type I error rates of edgeR and TSPM are both dispersed, especially when *ϕ* = 0.5. Generally, isoVCT can control type I error rates around the nominal level for the NB assumption.

The results regarding empirical power for the NB assumption are shown in Tables 2, ,33 and Fig. 2. Three findings of this analysis are as follows: (a) *M* exhibits a positive correlation with power; (b) *p* and *s* exhibit a negative correlation with power; and (c) *l* also exhibits a positive correlation with power, but its increase is much sharper than those of *p* and *M*. Our results show that isoVCT-Emp is more advantageous than traditional methods, especially when the effect size is small and the effect heterogeneity is large.

The results of the type I error and empirical power under the Poisson assumption are shown in File S3.

The results of real data analysis are shown in Fig. 3. Only 4,422 genes are fit to the GLMM. The three traditional algorithms, DESeq, edgeR and TSPM, define 364, 259 and six DE genes, respectively. The intersection of DESeq and edgeR is 221 genes, representing a high proportion of the DE genes that these algorithms define. TSPM is almost ineffective for these genes.

The results of isoVCT are as follows. First, isoVCT-Emp defines 263 DE genes; of these genes, isoVCT-Emp specifically selected 242 DE genes, and only a small portion of these genes is also found by the other algorithms. Second, the DE genes identified by isoVCT-Emp included those that are identified by isoVCT-The. In general, isoVCT is superior at selecting heterogeneous genes.

Here, we propose isoVCT, a variance component score test for testing the coefficients of covariates to select DE genes. Simulations and real data analysis both suggest that this method is advantageous in selecting weak effects or heterogeneous genes. The results of the simulations are discussed from the following two aspects. First, isoVCT controls type I errors in each setting under any of the distribution assumptions. The type I error rate of the empirical distribution is almost controlled. However, the type I error rate of isoVCT-The is strictly controlled, which is likely caused by an inaccurate estimation of the *U* statistic in the small sample size setting. The type I error rates of edgeR and TSPM increase in some settings, especially with the NB assumption (Yang et al., 2015). Second, the power of isoVCT is higher than that of other methods; furthermore, the strength is evident for the small *l* and NB assumption. isoVCT-Emp is superior to other methods in each setting,especially for *l* = 0.6. The small sample size setting may cause the inverse proportion between *p* and the power of isoVCT. Nevertheless, the powers of five algorithms are high for the Poisson assumption.

In real data analysis, the Venn diagram of DE genes directly indicates the relationships among the four algorithms. The fact that DESeq and edgeR use the same distribution assumption and the same idea for testing likely results in the number of intersections representing a majority of the DE genes identified by these algorithms. TSPM defines the smallest number of DE genes. However, the use of different ideas leads to the different result of isoVCT. isoVCT-Emp specifically defines 242 heterogeneous or small effect size genes. For example, *CASP7* and *STAT6* are related to LUSC (Dubey & Saini, 2015; Lee et al., 2009).

Furthermore, isoVCT exhibits three innovations. First, isoVCT derives the empirical and theoretical distribution of the variance component score statistics in the framework of a GLMM and evaluates the performance of score statistics. Second, the random effect α represents the correlation of isoforms in the framework of the GLMM. Third, isoVCT further verifies the effectiveness of the set test in RNA-Seq data and supplies a new view of biological functions with isoform information.

However, isoVCT has some limitations. In the derivation of the score statistic, the random term *α* is regarded as a fixed parameter, which may cause isoVCT-The to be quite conservative. For the NB assumption, the power of isoVCT is very low in some settings, because the estimation of the dispersion parameter may be biased. The methods of weighted likelihood and quasi-likelihood likely overcome this drawback in estimating the NB mixed model (Lund et al., 2012).

In conclusion, isoVCT, a supplement to DESeq or edgeR, is powerful and robust at selecting small effect and heterogeneous genes.

10.7717/peerj.3797/supp-3#### File S3

**Summary of simulations in Poisson assumption:**

Click here for additional data file.^{(36K, docx)}

This work was supported by the National Natural Science Foundation of China (No. 81502888, 81473070 and 81373102), the Jiangsu Shuangchuang Plan, the Science and Technology Development Fund Key Project of Nanjing Medical University (2014NJMUZD003 and 2016NJMUZD014), and the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

The following grant information was disclosed by the authors:

National Natural Science Foundation of China 815028888147307081373102.

Science and Technology Development Fund Key Project of Nanjing Medical University 2014NJMUZD0032016NJMUZD014.

Competing Interests

The authors declare there are no competing interests.

Author Contributions

Sheng Yang conceived and designed the experiments, performed the experiments, wrote the paper, prepared figures and/or tables.

Fang Shao analyzed the data, contributed reagents/materials/analysis tools.

Weiwei Duan and Yang Zhao reviewed drafts of the paper.

Feng Chen conceived and designed the experiments.

Data Availability

The following information was supplied regarding data availability:

The raw data has been supplied as Supplemental Files.

Al Seesi et al. (2014) Al Seesi S, Tiagueu YT, Zelikovsky A, Mandoiu II. Bootstrap-based differential gene expression analysis for RNA-Seq data with and without replicates. BMC Genomics. 2014;15(Suppl 8):S2. doi: 10.1186/1471-2164-15-S8-S2. [PMC free article] [PubMed] [Cross Ref]

Anders & Huber (2010) Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biology. 2010;11 doi: 10.1186/gb-2010-11-10-r106. Article R106. [PMC free article] [PubMed] [Cross Ref]

Dialsingh, Austin & Altman (2015) Dialsingh I, Austin SR, Altman NS. Estimating the proportion of true null hypotheses when the statistics are discrete. Bioinformatics. 2015;31:2303–2309. doi: 10.1093/bioinformatics/btv104. [PMC free article] [PubMed] [Cross Ref]

Dubey & Saini (2015) Dubey R, Saini N. STAT6 silencing up-regulates cholesterol synthesis via miR-197/FOXJ2 axis and induces ER stress-mediated apoptosis in lung cancer cells. Biochimica Et Biophysica Acta. 2015;1849:32–43. doi: 10.1016/j.bbagrm.2014.10.002. [PubMed] [Cross Ref]

Gad & El Kholy (2012) Gad AM, El Kholy RB. Generalized linear mixed models for longitudinal data. International Journal of Probability and Statistics. 2012;1:41–47. doi: 10.5923/j.ijps.20120103.03. [Cross Ref]

Garber et al. (2011) Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nature Methods. 2011;8:469–477. doi: 10.1038/nmeth.1613. [PubMed] [Cross Ref]

Huang & Lin (2013) Huang YT, Lin X. Gene set analysis using variance component tests. BMC Bioinformatics. 2013;14:210. doi: 10.1186/1471-2105-14-210. [PMC free article] [PubMed] [Cross Ref]

Ionita-Laza et al. (2013) Ionita-Laza I, Lee S, Makarov V, Buxbaum JD, Lin X. Sequence kernel association tests for the combined effect of rare and common variants. American Journal of Human Genetics. 2013;92:841–853. doi: 10.1016/j.ajhg.2013.04.015. [PubMed] [Cross Ref]

Kalsotra & Cooper (2011) Kalsotra A, Cooper TA. Functional consequences of developmentally regulated alternative splicing. Nature Reviews Genetics. 2011;12:715–729. doi: 10.1038/nrg3052. [PMC free article] [PubMed] [Cross Ref]

Kanitz et al. (2015) Kanitz A, Gypas F, Gruber AJ, Gruber AR, Martin G, Zavolan M. Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data. Genome Biology. 2015;16 doi: 10.1186/s13059-015-0702-5. Article 150. [PMC free article] [PubMed] [Cross Ref]

Koboldt et al. (2013) Koboldt DC, Steinberg KM, Larson DE, Wilson RK, Mardis ER. The next-generation sequencing revolution and its impact on genomics. Cell. 2013;155:27–38. doi: 10.1016/j.cell.2013.09.006. [PMC free article] [PubMed] [Cross Ref]

Lee et al. (2009) Lee WK, Kim JS, Kang H, Cha SI, Kim DS, Hyun DS, Kam S, Kim CH, Jung TH, Park JY. Polymorphisms in the Caspase7 gene and the risk of lung cancer. Lung Cancer. 2009;65:19–24. doi: 10.1016/j.lungcan.2008.10.022. [PubMed] [Cross Ref]

Lehmann et al. (2011) Lehmann BD, Bauer JA, Chen X, Sanders ME, Chakravarthy AB, Shyr Y, Pietenpol JA. Identification of human triple-negative breast cancer subtypes and preclinical models for selection of targeted therapies. Journal of Clinical Investigation. 2011;121:2750–2767. doi: 10.1172/JCI45014. [PMC free article] [PubMed] [Cross Ref]

Li et al. (2014) Li HD, Menon R, Omenn GS, Guan Y. The emerging era of genomic data integration for analyzing splice isoform function. Trends in Genetics. 2014;30:340–347. doi: 10.1016/j.tig.2014.05.005. [PMC free article] [PubMed] [Cross Ref]

Li & Tibshirani (2013) Li J, Tibshirani R. Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-Seq data. Statistical Methods in Medical Research. 2013;22:519–536. doi: 10.1177/0962280211428386. [PMC free article] [PubMed] [Cross Ref]

Lin (1997) Lin X. Variance component testing in generalised linear models with random effects. Biometrika. 1997;84:309–326. doi: 10.1093/biomet/84.2.309. [Cross Ref]

Lund et al. (2012) Lund SP, Nettleton D, McCarthy DJ, Smyth GK. Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates. Statistical Applications in Genetics and Molecular Biology. 2012;11:307–314. doi: 10.1515/1544-6115.1826. [PubMed] [Cross Ref]

Modelska, Quattrone & Re (2015) Modelska A, Quattrone A, Re A. Molecular portraits: the evolution of the concept of transcriptome-based cancer signatures. Briefings in Bioinformatics. 2015;16:1000–1007. doi: 10.1093/bib/bbv013. [PMC free article] [PubMed] [Cross Ref]

Network (2012) Network CGAR. Comprehensive genomic characterization of squamous cell lung cancers. Nature. 2012;489:519–525. doi: 10.1038/nature11404. [PMC free article] [PubMed] [Cross Ref]

Oshlack, Robinson & Young (2010) Oshlack A, Robinson MD, Young MD. From RNA-seq reads to differential expression results. Genome Biology. 2010;11 doi: 10.1186/gb-2010-11-12-220. Article 220. [PMC free article] [PubMed] [Cross Ref]

Pan et al. (2008) Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature Genetics. 2008;40:1413–1415. doi: 10.1038/ng.259. [PubMed] [Cross Ref]

Robinson, McCarthy & Smyth (2010) Robinson MD, McCarthy DJ, Smyth GK. edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–140. doi: 10.1093/bioinformatics/btp616. [PMC free article] [PubMed] [Cross Ref]

Trapnell et al. (2013) Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nature Biotechnology. 2013;31:46–53. doi: 10.1038/nbt.2450. [PMC free article] [PubMed] [Cross Ref]

Trapnell et al. (2010) Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, Van Baren MJ, Salzberg SL, Wold BJ, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology. 2010;28:511–515. doi: 10.1038/nbt.1621. [PMC free article] [PubMed] [Cross Ref]

Wang, Gerstein & Snyder (2009) Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics. 2009;10:57–63. doi: 10.1038/nrg2484. [PMC free article] [PubMed] [Cross Ref]

Wu et al. (2016) Wu C, Chen J, Kim J, Pan W. An adaptive association test for microbiome data. Genome Medicine. 2016;8 doi: 10.1186/s13073-016-0302-3. Article 56. [PMC free article] [PubMed] [Cross Ref]

Wu et al. (2011) Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. American Journal of Human Genetics. 2011;89:82–93. doi: 10.1016/j.ajhg.2011.05.029. [PubMed] [Cross Ref]

Yang et al. (2015) Yang S, Guo L, Shao F, Zhao Y, Chen F. A systematic evaluation of feature selection and classification algorithms using simulated and real miRNA sequencing data. Computational and Mathematical Methods in Medicine. 2015;2015 doi: 10.1155/2015/178572. Article 178572. [PMC free article] [PubMed] [Cross Ref]

Articles from PeerJ are provided here courtesy of **PeerJ, Inc**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |