High-throughput sequencing of RNA, known as RNA-seq, is a promising new approach to transcriptome profiling. RNA-seq has a greater dynamic range than microarrays, which suffer from non-specific hybridization and saturation biases. Transcriptional subsequences spanning multiple exons can be directly observed, allowing more precise estimation of the expression levels of splice variants. Moreover, unlike traditional expression arrays, RNA-seq produces sequence information that can be used for genotyping and phasing of haplotypes, thus permitting inferences to be made about the expression of each of the two parental haplotypes of a transcript in a diploid organism.
The first step in RNA-seq experiments is the preparation of cDNA libraries, whereby RNA is isolated, fragmented and synthesized to cDNA. Sequencing of one or both ends of the fragments then takes place to produce millions of short reads and an associated base call uncertainty measure for each position in each read. The reads are then aligned, usually allowing for sequencing errors and polymorphisms, to a set of reference chromosomes or transcripts. The alignments of the reads are the fundamental data used to study biological phenomena such as isoform expression levels and allelic imbalance. Methods have recently been developed to estimate these two quantities separately but no approaches exist to make inferences about them simultaneously to estimate expression at the haplotype and
isoform ('haplo-isoform') level. In diploid organisms, this level of analysis can contribute to our understanding of cis
] and epigenetic effects such as genomic imprinting [2
]. We first set out the problems of isoform level expression, allelic mapping biases and allelic imbalance, and then propose a pipeline and statistical model to deal with them.
Isoform level expression
Multiple isoforms of the same gene and multiple genes within paralogos gene families often exhibit exonic sequence similarity or identity. Therefore, given the short length of reads relative to isoforms, many reads map to multiple transcripts (Table ). Discarding multi-mapping reads leads to a significant loss of information as well as a systematic underestimation of expression estimates. For reads that map to multiple locations, one solution is to distribute the multi-mapping reads according to the coverage ratios at each location using only single-mapping reads [3
]. However, this does not address the problem of inferring expression levels at the isoform level.
Multi-mapping reads. Approximate proportion of reads mapping to multiple Ensembl transcripts or genes in human using 37 bp single-end or paired-end data obtained from HapMap individuals.
Essentially, the estimation of isoform level expression can be done by constructing a matrix of indicator functions Mit = 1 if region i belongs to transcript t. The 'regions' may for now be thought of as exons or part exons, though we later define them more generally. Using this construction it is natural to define a model:
are the (unobserved) counts of reads from region i
of transcript t
is a normalization constant used when comparing experiments, μt
is a parameter representing the expression of transcript t
is the effective
length of region i
(that is the number of possible start positions for reads in the region). This model can be fit using an expectation maximization (EM) algorithm, since the Xit
are unobserved but their sums across transcripts
This model has been used by [4
] in their POEM software, with i
representing exons. Their method does not use reads that span multiple exons or reads that map to multiple genes. The same model has been used in [5
], with i
representing exons or part exons, or regions spanning exon junctions, enabling good estimation of isoform expression within genes. They do not, however, include reads mapping to multiple genes. The RSEM method [6
] employs a similar model, but models the probability of each read individually, rather than read counts. This method allows reads to come from multiple genes as well as multiple isoforms of the same gene. The modeling of individual reads allows RSEM to accommodate general position-specific biases in the generation of reads. However, two recent papers [7
] have shown that deviations from uniformity in the generation of reads are in great part sequence rather than position-dependent for a given experimental protocol and sequencing platform. Furthermore, the computational requirements of modeling individual reads increasing proportionately with read depth, which, in the case of RSEM, is exacerbated further by the use of computationally intensive bootstrapping procedures to estimate standard errors. None of the above methods are compatible with paired-end data. A recently published method, Cufflinks [9
], focuses on transcript assembly as well as expression estimation using an extension of the [5
] model that is compatible with paired-end data. However, this method does not model sequence-specific uniformity biases and uses a fixed down-weighting scheme to account for reads mapping to more than one transcription locus, meaning that the abundances of transcripts in different regions are estimated independently.
Studies of imbalances between the expression of two parental haplotypes have mostly been restricted to testing the null hypothesis of equal expression between two alleles at a single heterozygous base, typically with a binomial test [1
]. However, as transcripts may contain multiple heterozygotes, a more powerful approach is to assess the presence of a consistent imbalance across all the heterozygotes in a gene together. This has been done on a case-by-case basis using read pairs that overlap two heterozygous SNPs [11
] while [12
] propose an extension to the binomial test for detecting allelic imbalance that takes into account all SNPs and their positions in a gene. However, this approach, which is a statistical test rather than a method of quantifying haplotype-specific expression, assumes imbalances to be homogeneous along genes and thus does not take into account the possibility of asymmetric imbalances between isoforms of the same gene.
Allelic mapping biases
Aligners usually have a maximum tolerance threshold for mismatches between reads and the reference. Reads containing non-reference alleles are less likely to align than reads matching the reference exactly, so genes with a high frequency of non-reference alleles may be underestimated. Ideally, aligners would accept ambiguity codes for alleles that segregate in the species (cf. Novoalign [13
]), but no free software is currently able to do this. A possible workaround is to change the nucleotide at each SNP to an allele that does not segregate in the species, as has been proposed to remove biases when estimating allelic imbalance [10
]. However, in the context of gene expression analysis, this leads to even greater underestimation of genes with many non-reference alleles and an increase in incorrect alignments to homologous regions. Instead, we propose aligning to a sample-specific transcriptome reference, constructed from (potentially phased) genotype calls.
In this paper we present a new pipeline, including a novel statistical method called MMSEQ, for estimating haplotype, isoform and gene specific expression. The MMSEQ software is straightforward to use, fully documented and freely available online [14
] and as part of ArrayExpressHTS [15
]. Our pipeline exploits all reads that can be mapped to at least one annotated transcript sequence and reduces the number of alignments missed due to the presence of non-reference alleles. It is compatible with paired-end data and makes use of inferred insert size information to choose the best alignments. Our method permits estimating the expression of the two versions of each heterozygote-containing isoform ('haplo-isoform') individually and thus it can detect asymmetric imbalances between isoforms of the same gene. Our software further takes into account sequence-specific deviations from uniform sampling of reads using the model described in [8
] but can flexibly accommodate other models. We validate our method at the isoform level with a simulation study, comparing our results to RSEM's, and applying it to a published Illumina dataset consisting of lymphoblastoid cell lines from 61 HapMap individuals [16
]. We validate our method at the haplo-isoform level by showing we can deconvolve the expression estimates of haplo-isoforms on the non-pseudoautosomal (non-PAR) region of the X chromosome using a pooled dataset of two HapMap males. We further apply our method to a published dataset of F1
initial and reciprocal crosses of CAST/EiJ (CAST) and C57BL/6J (C57) inbred mice [2
] and demonstrate that MMSEQ is able to detect parental imbalance between the two haplotypes of each isoform.