Genetic variation in gene expression is an important determinant of human phenotypic variation; a number of studies have elucidated genome-wide patterns of heritability and population differentiation and are beginning to unravel the role of gene expression in the etiology of disease1-5
. Interrogation of the transcriptome in these studies has been greatly facilitated by the use of microarrays, which quantify transcript abundance by hybridization. However, microarrays possess several limitations and recent advances in transcriptome sequencing in second generation sequencing platforms have now provided single nucleotide resolution of gene expression providing access to rare transcripts, more accurate quantification of abundant transcripts (above the signal saturation point of arrays), novel gene structure, alternative splicing and allele-specific expression6-14
. Although RNA-Seq studies have addressed issues of transcript complexity, they have not yet addressed how genetic studies can benefit from this increased resolution to reveal novel effects of sequence variants on the transcriptome.
To understand the quantitative differences in gene expression within a human population as determined from second generation sequencing, we sequenced the mRNA fraction of the transcriptome of lymphoblastoid cell lines (LCLs) from 60 CEU individuals (from CEPH - Centre d’Etude du Polymorphisme Humain) using 37-bp paired-end Illumina sequencing. Each individual’s transcriptome was sequenced in one lane of an Illumina GAII analyzer and yielded 16.9±5.9 (mean ± SD) million reads which were then mapped to the NCBI36 assembly of the human genome (Supp. Fig 1) using MAQ16
. We subsequently filtered reads which had low mapping quality, mapped sex chromosomes or mitochondrial DNA and were not correctly-paired which yielded 9.4±3.3 million reads. On average, 86% of the filtered reads mapped to known exons in EnsEMBL version 54)17
and 15% of read pairs spanned more than one exon. Evaluation of sequence and mapping quality measures was preformed in order to ensure that the data quality is acceptable for analysis (Supp. Fig 2 – see also methods).
We quantified reads for known exons, transcripts and whole genes. Read counts for each individual were scaled to a theoretical yield of 10 million reads and corrected for peak insert size across corresponding libraries. Each quantification was filtered to exclude those with missing data for >10% of the individuals. For exons, this resulted in data for 90,064 exons for 10,777 genes. Of these, 95% had on average more than 10 reads, 38% more than 50 reads and 20% had a mean quantification of ≥100 reads (Supp. Fig 3). For transcript quantification, new methods needed to be developed to map reads into specific isoforms18, 19
. We developed a methodology, called the FluxCapacitor, to quantify abundances of annotated alternatively spliced transcripts (see Methods). Using this method, we obtain relative quantities for 15967 transcripts from 11674 genes. For each individual, we compared whole gene read counts to array intensities generated with Illumina HG-6 version 2 microarrays. Correlations coefficients between RNA-Seq and array quantities and among RNA-Seq samples were high and consistent with previous studies20
(Supp. Fig 4 and 5). Finally, we explored whether the correlation structure of abundance among exons could facilitate the development of a framework that will allow the imputation of abundance values for exons that are not screened, given a set of reference RNA-Seq samples. This is the same principle as using the correlation structure (Linkage Disequilibrium) of genetic variants to impute variants from a reference to any population sample of interest21
. For each of the 10,777 genes, we assessed the pairwise correlation of all exons and on average, any two pairs of exons within a gene were moderately correlated (mean Pearson’s correlation R2
= 0.378 ± 0.261) (Supp. Fig. 6). This correlation increased with increase in total number of reads present in each exon. It is worth noting that the average correlation coefficient between SNPs within the same recombination hotspot interval in HapMap3 is r2
= 0.326±0.174, suggesting that the correlation structure within genes is stronger and likely more accessible by imputation methodologies than SNPs, however this needs to be assessed in a tissue-specific context.
Association of gene expression measured by RNA-Seq with genetic variation was evaluated in cis
with the use of 1.2 million HapMap3 SNPs (methods previously described22
). We evaluated association in exons, transcripts and genes and determined the unique number of genes containing a significant association through permutation23
(). RNA-Seq eQTLs, significant at 0.001 and 0.01 permutation thresholds, replicate significantly (46% for 0.01 and 81% for 0.001) in the array data for the same SNP-gene combinations, as indicated by the enrichment in low p-values () and the effect sizes are of very similar magnitude (Supp. Fig. 7). Overall, the number of genes with eQTLs eQTL at the 0.01 permutation threshold using exon quantification was higher than the number of genes discovered by arrays for the same sample of individuals (836 genes vs. 539 genes at the 0.01 permutation threshold), even when normalized for the number of genes tested (Supp. Table 1), suggesting that increased resolution contributed to the identification of larger number of genetic regulatory effects. The RNA-Seq exon eQTLs were mainly enriched in the higher abundance classes relative to array eQTLs and whole gene eQTLs (Supp. Fig. 8-10). This is likely due to two reasons: (i) exons capture genetic effects in splicing complexity, which is higher in higher abundance genes (Spearman Rank Correlation between abundance and number of transcipts in EnsEmbl, p < 2.2 × 10−16
); (ii) saturation of intensity signal above a certain abundance level in arrays but not in RNA-Seq data. RNA-Seq exon eQTLs have lower representation in low abundance genes suggesting that rare transcripts are not well quantified at this level of coverage. Finally, we performed eQTL analysis of 102 well-quantified long noncoding RNAs (not overlapping any known protein-coding gene, see Methods), and found 6 with significant eQTLs (), highlighting that regulatory variation extends beyond well characterized protein coding genes.
Table 1 eQTL Discoveries. eQTL discoveries for genes, transcripts, exons, splicing events and long non-coding RNAs for each of the 2 sequencing based quantifications (by-transcript and by-exon) and matching array samples are shown using Spearman Rank correlation. (more ...)
Array association p-values for RNA-Seq significant eQTLs
To replicate our eQTL discoveries, we compared associations between our study and those obtained from sequencing the transcriptomes of an African population (Pickrell, submitted for the same issue)24
. We assessed the p-value distribution of matching CEU associations given the top associated SNP for 500 genes from the African population (Supp. Fig. 11). We estimated that ~33% of these signals were shared (p < 0.0001 assessed by permutations). This result indicates the robustness of the eQTL discovery of the two transcriptome sequencing based studies and, given the degree of differentiation of the two populations, the magnitude of replication is consistent with past array studies for the same samples22
As previously observed22
, we have detected enrichment of eQTLs around the transcription start site (TSS) (Supp. Fig 12). We have further investigated the discovery rate and distribution of eQTLs given an exon’s location in multi-exonic genes. We identified increased number of discoveries for the first, second and last exon compared to any middle exons (). We find that we make more discoveries for the last exon than for the first exon. When we assess the distribution of significant eQTLs around the 5′ end of the exon of interest, we find that significant eQTLs when found associated with the last exon are closer to the last exon than any other exon followed by first exons, second exons and middle exons (Supp. Fig. 13). This is consistent with our understanding of expression modulating effects within the 3′ UTR and upstream region of genes25
Exon eQTLs by exon relative location
Transcriptome sequencing allows the quantification of allele-specific expression (ASE)26-28
. We found an average of 4000 heterozygote confirmed HapMap3 SNP positions per individual, which could be used to assess ASE. Of these we assessed the proportion where both alleles were detected in the sequencing of that individual as a function of mapping quality using Samtools (Supp. Fig. 14)29
. At MAQ mapping quality 10, we find that 72% of heterozygote sites have from both alleles detectable at least once. As expected, this fraction slightly reduces with increasing mapping quality. Furthermore, we find 41% of the heterozygotes have more than 6 reads. We tested for ASE after correcting for reference to non-reference differential mapping for each library (Supp. Fig. 15). We tested the relationship between eQTLs and ASE by first phasing double heterozygotes for both. We found that as the number of reads increased, the correlation between the eQTL effect size and the strength of ASE increased (Supp. Fig 16). Reads were then summed across individuals to assess the 1-sided ASE binomial p-value distribution with respect to eQTL phasing. We found that for 0.01 and 0.001 significant eQTLs, the tail of the ASE p-value distribution was enriched. For exons without eQTLs, both tails of this distribution were enriched (Supp Fig 17), which highlights the presence of other non-genetic or rare genetic factors that affect ASE.
To investigate if ASE signals could be marking recent rare eQTLs, undetected through standard genotypic association, we selected SNPs heterozygous in 6 or more individuals in exons with no evidence for an eQTL (exons not significant at permutation threshold of 0.05), and examined patterns of haplotype homozygosity between individuals that shared a significant ASE signal (at p<0.05) with those that did not. Haplotype homozygosity assesses the length of perfectly shared alleles on a haplotype as a proxy for the age of a haplotype30
. We calculated haplotype homozygosity comparing those haplotypes that had an ASE signal to each other and then separately with those that did not have an ASE signal and found greater haplotype homozygosity for haplotypes sharing a common ASE signal (). This differentiation was highly significant when only 2-3 individuals had significant ASE (Wilcoxon Paired test, p = 0.00039) and disappeared when 4 or more individuals had significant ASE (Wilcoxon Paired test, p = 0.55), consistent with the idea that these rare ASE effects are a result of recent and rare eQTL variants. We also assessed the direction of effect for these potential rare eQTL haplotypes and found no significant bias in the direction of effect for new mutations (48.5% increased expression for 2-3 individuals compared to 47.1% for haplotypes shared in 4 or more individuals). These results highlight the potential of using second generation sequencing to identify rare regulatory haplotypes.
Haplotype homozygosity for shared ASE haplotypes versus shared and unshared ASE haplotypes
We have further investigated features of the genetic basis of alternative splicing. First, we performed association between known variants affecting splicing signals with their respective genes and exons; in total, we tested 963 variants for 788 genes. We compared associations for gene RNA-Seq quantification and arrays and found similar enrichment (8.30% versus 8.51% true positives). We stratified splice variants in donor and acceptor variants and tested against abundance of exons 5′ and 3′ to the intron they are residing. For donor variants we found a large enrichment (3.17 fold) of associations with the 5′ exon relative to the 3′ exon, while for acceptor variants we found large enrichment of associations with the 3′ exon relative to the 5′ exon (7.02 fold), consistent with them affecting the inclusion/exclusion of the associated exon in the mature transcript. We further hypothesize that if genetic variants are effecting transcript-specific expression, we should be able to detected heterogeneity in the transcript distribution found between chromosomes within an individual. To verify this hypothesis we tested for heterogeneity in paired-end insert sizes, used as proxy to heterogeneity in the transcript distribution. We compared reads over one allele relative to the other in significant ASE SNPs vs. non-significant ASE SNPs for positions with at least 50 reads to have adequate comparable transcript distributions, which resulted in 901 heterozygote positions. We found a significant enrichment in transcript distribution (insert size) heterogeneity (KS p-value < 0.05) over significant ASE SNPs relative to non-significant ASE SNPs (Supp Fig 18 and example in ). Of the heterozygotes, 235 were significant for ASE and of those 105 had significant transcript distribution heterogeneity; this corresponded to 72 of 105 genes that contained an ASE significant heterozygote. Visual inspection suggested that such heterogeneity is driven both by differential structure in internal exons as well as alternative 3′ends of genes. Genotypic associations with mean insert size and 3′ ends of genes showed enrichment in low p-values suggesting the presence of genetic variants affecting such processes (Suppl Figures 19 and 20). Finally, we assessed the effect of genetic variants on events that contribute to alternative isoforms (e.g. inclusion/exclusion of exons) derived from the FluxCapacitor quantification. We found that of 6600 quantified events 110 are significant at the 0.01 permutation threshold (). Of these 41% were exon skipping, 17% were due to an alternative acceptor, 13% were double or triple exon skipping, 6% were alternative donors, 5% were mutually exclusive exons and 5% were retained introns. This analysis suggests extensive genetic variation in the determination of isoform diversity and transcript structure, which is expected to have direct consequences in protein sequence diversity.
Allelic alternative splicing effects
Our study and ref 24
have described the first attempts of interrogation of genetic effects on the transcriptome using 2nd
generation sequencing technologies. The increasing accessibility of sequencing has increased our ability to resolve new features of regulatory complexity. We have confirmed the feasibility of interrogating eQTLs in population transcriptome sequencing and have discovered more eQTLs than with array data for the same population sample. Furthermore,, despite relatively low sequencing depth, the association signals were well replicated across populations. We have also identified the potential and power of such studies in resolving rare regulatory haplotypes. Finally, we have uncovered a variety of genetic effects influencing isoform abundance and transcript structure. As sequencing technologies continue to increase the depth and breadth of the interrogation of the genome and the transcriptome, it is anticipated that our understanding of finer scale cellular processes will become more detailed and robust.