Brief introduction to Wu et al’s method
Wu et al’s method is based on the original method by Jiang & Wong
]. The original method assumes that the reads are uniformly sampled from the whole transcripts and models the reads count on a specific exon as a Poisson random variable with parameter
. Here, m
is the number of isoforms of the specific gene, θi
is the expression level of the i
-th isoform, w
is the total reads count in this sample, lj
is the length of the j
-th exon and (aij
) is a matrix that indicates the gene structure, with aij
1 or 0 indicating that exon j
is included or excluded in isoform i
. Thus the likelihood function of isoform expression based on the observation of j
-th exon is
. Assuming that exons are independent, we can get the joint log likelihood function of the whole gene as:
The setting of aij
1 for all exons of an isoform implies that reads are distributed uniformly across the whole isoform. To compensate for non-uniform distribution, Wu et al’s method takes nonparametric models of read distribution across an isoform
]. It can deal with different kinds of read distribution. The nonparametric read models used include a global bias curve (GBC) for all genes and a local bias curve (LBC) for each gene. The GBC is used to capture the general trend of non-uniform read distribution with regard to the relative position in a gene, shared by all genes in the dataset, and LBC reflects the distribution pattern specific to each gene. Methods for estimating the GBC and LBC curves are described in
]. Based on GBC curve and LBC curves, we get two corrected gene structure matrices: (GBM)ij
, short for Global Bias Matrix and Local Bias Matrix respectively. The corrected structure matrices are weighted indicator matrices, instead of the 0–1 indicator matrices (aij
). Weights in these two matrices reflect the bias tendency of corresponding exons in this gene. We mix these two matrices as the bias-corrected gene structure which is denoted as (bij
where α is a weight parameter indicating the relative importance of (GBM)ij matrix verse the (LBM)ij matrix in the final gene structure matrix.
By replacing the 0–1 indicator gene structure matrix (aij) with the weighted gene structure matrix (bij), we re-define the log-likelihood function as:
Further, we can get the gradient of this log-likelihood function by taking derivates for each θi:
The isoform expression levels can be estimated by maximizing this log-likelihood function. The log-likelihood function has been proved concave in previous work
] and global optimum can be found by proper optimization algorithm.
Procedures of NURD
The input data of our NURD are the read-mapping file and gene annotation file. The output is isoform expression of each gene. Figure
shows the detailed flowchart of the algorithm.
Figure 1 Flowchart of NURD algorithm. Using NURD to estimate the isoform expression mainly consists of five steps: (1) read mapping, which is not a part of NURD, but the preparation for it; (2) getting gene annotation; (3) counting reads based on gene annotation (more ...)
Using NURD to estimate isoform expression typically involves the following five steps.
This procedure is not a part of our algorithm, but is a preparation step for it. There have been many read-mapping tools published, such as Bowtie
] or Tophat
]. The current implementation of NURD requires that the reads mapping file be in the SAM format.
2 Gene annotation.
Gene annotation file can be downloaded from a database or assembled by another software such as Cufflinks
]. The current implementation of NURD requires that the gene annotation file be in the GTF format or refflat format. We extract the basic information of all genes from the gene annotation file, including gene names, number of isoforms and isoform names, number of exons of each gene, length of each exon, and gene structure. Gene structure information tells which exons are contained by each isoform.
3 Reads counts. Using reads mapping file and gene annotation file, NURD gets read counts for all exons of all genes in the annotation file.
4 Log-likelihood function. Based on the information we get in step 2 and step 3, NURD calculates the GBC of the whole data, LBC for each gene and the bias-corrected gene structure matrix, and furthermore get the corrected log-likelihood function.
5 Expression estimation.
After the log-likelihood functions are calculated, the major task for NURD is to infer the best expression estimations that maximize the log-likelihood functions. The optimization algorithm we use here is the binary search algorithm, which is very effective and widely used in dealing with search problem
Implementation of the NURD software
The key step for the efficient implementation of the method is the optimization of the log-likelihood functions for all genes and isoforms. We use the binary interval search technique for the optimization.
Binary search for single-isoform genes
First, we will illustrate how to use the binary interval search technique to optimize the log-likelihood function if one gene has only one isoform.
The log-likelihood function has been proved to be concave in our previous work
], so the optimization problem can be transformed to finding the point where the gradient function is equal to zero. Since the objective function is concave and the gene has only one isoform, the corresponding gradient function is a univariate monotone function, in which situation binary interval search
can be used. Obviously, the log-likelihood function is a real number function and the search space is a real number interval, so the objective of the algorithm is to find a very short interval to cover the optimum point. We initialize the search algorithm with a large enough interval and after each step of binary search algorithm, the length of the interval will shrink to half of the previous interval’s length. As the algorithm goes on, the length of interval will exponentially decrease. As a result, given the precision limit ϵ which is a small real number, the running time complexity of finding the interval covering the optimum point with the gradient equal to zero is O(log(1/ϵ)) .
Gradient ascending algorithm
] is another technique that is widely used in optimization problem. The binary search algorithm has advantages over the gradient ascending algorithm: Binary search algorithm guarantees to converge to the optimum point in O(log(1/ϵ)), which is really a short time and is fixed given the precision ϵ, while time complexity of gradient ascending algorithm usually depends on step length and the shape of the optimized function. If the step size is not proper, it can be sometime difficult for gradient ascending algorithms to converge. Binary search algorithm doesn’t need to find the proper step size. In some kind of gradient ascending algorithm, there need be another procedure called line search
for finding a proper step size.
A limitation of binary search compared with gradient ascending algorithm is that the binary search is not as general as the latter for it requires the optimized function to be concave. This is not a problem for the problem in NURD as the log-likelihood function in NURD has been proved concave.
Coordinate binary search for multi-isoform genes
Because binary search is a 1-dimension search technique, it can only handle the optimization problem of univariate functions. When estimating the expressions of multi-isoform genes, the objective function will be a multivariate function and we need to use coordinate binary search technique. The strategy of coordinate binary algorithm is described in the following pseudo code:
In the innermost loop, we hold all the expressions of isoforms fixed except for some θi and the log-likelihood function degenerates to a 1-dimension function. In each step of innermost loop, we maximize the log-likelihood with respect to θi, given the expressions of other isoforms. Because the log-likelihood function is concave, the 1-dimension objective function is also concave, in which case binary search is suitable.
The pseudo code clearly shows that after each step of the innermost loop, the log-likelihood function will ascend in some degree. As the log-likelihood function is concave, the global optimum will be found after a number of iterations.
The coordinate optimization in 2-dimension case is illustrated in Figure
Figure 2 Illustration of 2-dimension coordinate binary search algorithm. This is a visualization of a bivariate concave function with its contours. The maximum point is (1, 2). The arrowed line segments show the procedures of coordinate binary search algorithm (more ...)
Usage of the software
NURD is implemented in C++ and the runtime environment is the Linux systems. The source code is available for free for academic use at http://bioinfo.au.tsinghua.edu.cn/software/NURD/
. After getting the source code, one can compile it and get the executable file by simply using make
command. The software can be efficiently executed with command line inputs. The acceptable gene annotation file format is GTF and refFlat. The acceptable read mapping file is SAM. Short reads should be mapped to genome reference.
The current implementation has not taken the characteristic of paired-end sequencing into consideration, which means NURD regards paired reads as two independent single end reads.