IsoformEx based on non-negative least squares
The estimation method of IsoformEx is based on weighted non-negative least squares. Figure shows the algorithm flow chart and an example to predict isoform expression levels from short read data using expression levels of exon slices. A transcript block consists of transcript isoforms of a gene and other transcripts overlapping them that belong to other genes. For each transcript block, IsoformEx collects slices of exons and computes values of RPKM [
29] (reads per kilobase of exon per million mapped reads) of all exon slices and splice junctions. When we computed RPKM of splice junctions, the length of splice junction was fixed at 54. For splice junction mapping, the 32 bp tag should have at least five nucleotide matches from either side of splice junction.
Both constitutive exons and alternative exons were used for obtaining slices of exons. Thus, we build an exon structure matrix (A
exon) of the transcript block, where A
exon (
i,j)
= 1 when the
j-th slice is a part of the
i-th transcript, otherwise A
exon (
i,j)
= 0. It also builds a splice junction structure matrix (
Asj) of the transcript block, of which element
Asj (
i,j) = 1 when the
j-th junction is a part of the
i-th transcript, otherwise
Asj (
i,j) = 0. After combining the exon structure matrix and splice junction structure matrix, and combining a RPKM column vector (b
exon) of exon slices and a RPKM vector (
bsj) of splice junctions, we solve a weighted non-negative least squares problem for each transcript block,
where A = [Aexon Asj
] is the combined exon structure matrix, b = [bexon;b {sj}] is the combined column vector having all RPKM values, x is the solution vector having transcript isoform expression levels, and W is the weight diagonal matrix whose diagonal elements are the corresponding weights of exon slice or splice junction. The observed short reads in an exon slice is actually the sum of short reads generated from different transcript isoforms having the exon slice. We identify the contribution of individual transcripts when exon slices are shared by multiple transcripts. But, sometimes, the number of reads of an exon slice is small just because the exon slice is short. A weighting scheme was used since the confidence level of the number of reads in smaller exon slice is lower. In addition, the total concentration of transcript isoforms should be zero or positive. Thus, we constructed a weighted non-negativity-constraint optimization problem. This problem can be solved by a non-negative least squares algorithm. After estimating expression levels of transcripts, we estimated an expression level for each gene by the sum of the expression levels of the individual transcript isoforms of the gene.
The non-negative least squares problem has a unique solution when
A is full rank and the algorithm converges to the solution [
30]. When
A is rank-deficient, the code of
lsqnonneg in Matlab (Matlab2009a, Natick, MA) solves the non-negative least squares problem with a linearly independent subset of the columns of
ATand leaves all other values of
x at zero. This is a particular solution to the minimization problem subject to non-negativity constraints. Here is the simplest example of rank-deficient
A generated from four transcripts having three constitutive exons and two splice junctions:
The sum of the first and second rows of
A is same as the sum of the third and fourth rows of
A. The second row vector equals the third plus the fourth minus the first,
i.e. A(2,:) =
A(3,:) +
A(4,:) -
A(1,:). The maximal number of linearly independent columns of
A is 3, i.e. rank(
A) = 3. The rank of 4 × 5 matrix
A is smaller than min(4, 5) = 4, so
A is rank-deficient. Such non-identifiable cases were already discussed in the previous works [
20,
31]. If a larger transcript block has above four transcripts, the bigger structure matrix is also rank deficient and the problem does not have a unique solution. A more complicated example having four transcripts, four constitutive exons and four splice junctions follows:
The sum of the first and third rows of
A is same as the sum of the second and fourth rows of
A. The rank of
A is 3, so
A is rank-deficient. Although these examples contain only constitutive exons without any alternative exon, having only constitutive exons does not guarantee non-identifiability. The simple rule for detecting non-identifiability of a transcript block is testing the rank-deficiency of structure matrix
A and convergence of a non-negative least squares problem. Although it is theoretically possible to have non-identifiable cases, the frequency of non-identifiable cases was lower than 2% of clusters in our estimation of transcript expression levels in MCF7 cell line, whereas the percentages of non-identifiable models under the Affymetrix Exon 1.0 ST array (four probes per targeting exons) and the Affymetrix Human Exon Junction array (HJAY) (eight probes per targeting exons or splice junctions) were 74% and 31% for 2256 alternative spliced human genes [
31].
When the length of exon slice is smaller than the tag length of 32 bp, no reads can be detected. We excluded such small exon slices for accurate estimation. Even if the length of an exon slice is larger than the tag length, the smaller exon tends to have the smaller number of mapped reads. We defined a weight saturation curve to represent the confidence level of RPKM with respect to lengths of slices,
i.e. w = 1-exp(-
x/100), where
x is the length of an exon slice (see Additional File
1 Figure S3). The exon slice length confidence level starts from 0 at length = 0, gradually increases to 1.0 from around length = 1000 bp. When length = 500 bp, the length confidence level is already reached to
w≈0.99. If an exon slice is only used for a transcript in the block, the exon slice is called as a discriminative exon slice and is highly weighted if the exon slice length confidence is larger than 0.5. Thus, a discriminative exon slice whose length is very small was not highly weighted; instead, it is lowly weighted due to low slice length confidence. As for splice junctions, the junction length confidence was fixed (
w≈0.42) since we used a fixed length of splice junction (54 bp) and the same saturation curve for the confidence level. Basically, RPKMs for splice junctions were lowly weighted, but discriminative splice junctions were highly weighted. The confidence level for each slice was multiplied to the corresponding column of the structure matrix
A and the vector element of RPKMs so as to construct a weighted non-negative least squares problem. As for discriminative exon slices, the confidence level was increased by
w =
w +
we, where
we = 10 is the additional weight for discriminative exon slices. As for discriminative splice junctions, the confidence level was increased by
w =
w +
ws, where
ws = 2 is the addition weight for discriminative splice junctions. When a transcript has discriminative splice junctions as well as discriminative exon slices, the confidence levels of discriminative splice junctions were not increased so that expression level of the transcript can be determined by more reliable RPKM values of discriminative exon slices.