Recent genotyping, exome and genome sequencing projects generated an extraordinary list of genetic variants across human populations and diseases. Most of these variants may not have a significant function. Thus far, identification of the functional variants remains a significant challenge. Ultimately, the biological impact and functionality of these variants need to be examined and validated experimentally. Nevertheless, bioinformatic predictions that can narrow down the search for causal/functional variants are in high demand to guide experimental studies effectively. Here, we presented methods to analyze RNA-Seq data that enabled identification of genes and alternatively processed regions whose expression is under the regulation of genetic variants. Compared to previous methods (e.g. eQTL analysis), our approach utilizes RNA-Seq data of a single subject to provide insights that were only possible using massive-scale parallel expression assays of a large number of subjects. In addition, different from eQTL and other recent ASE studies, our approach not only identifies ASE patterns, but also predicts the functional mechanisms of genetic variants in specific categories of cis-regulation of gene expression, which provides essential information to facilitate discovery of causal variants as shown by our experimental studies.
The strength of our methods rooted from the unique advantages of RNA-Seq. First, RNA-Seq provides mRNA sequence information at single-nucleotide resolution. With enough read coverage, RNA-Seq can potentially interrogate all expressed SNVs of a gene, thereby providing a powerful tool for ASE studies. The simultaneous quantification of single-nucleotide expression and exon/gene expression is another advantage of RNA-Seq because a single data set can provide not only allelic expression of SNVs, but also whole-gene and alternative isoform expression. In this sense, RNA-Seq is a cost-effective approach for studies of genetic controls of gene expression.
To demonstrate the utilities of the methods, we analyzed RNA-Seq data of two different types of cancer samples. Read-mapping bias of the alternative alleles of SNVs (20–22
) was removed using our previously developed mapping strategy (10
). Our results suggest that 26–45% of genes demonstrated ASE patterns in the studied cancer cells at an FDR of ~5%. We also demonstrated that the cis
-regulatory mechanisms underlying ASE may be inferred from RNA-Seq data for hundreds of genes in each sample at FDRs <20%. We chose parameters in this analysis such that a relatively relaxed FDR was reached since these predictions provide candidate events that can be further examined for disease relevance (such as in GWAS results) and molecular validations. Thus, a relatively large repository of candidates to start with may be beneficial. Nevertheless, the parameters can be adjusted (e.g. a smaller P-
value cutoff of the Fisher’s exact test, Supplementary Methods
) if a lower FDR is desired. In addition, approaches other than FDR analysis (e.g. Bonferroni correction) may be used to account for multiple hypothesis testing, which may change the number and statistical stringency of the final results.
The ASE profiles of shared SNVs of the two data sets overlap significantly, despite their substantial difference in the types of cells and diseases involved. This observation confirms that genetic factors play an important role in gene regulation. On the other hand, the prevalence of ASE differs between the two samples, with the breast cancer data showing a much higher percentage of genes with ASE. This difference may be explained by the considerable difference in their genetic backgrounds or, possibly, expression or functional difference of trans
-acting factors regulating the ASE patterns. For the small number of SNVs predicted to be under allele-specific regulation at the levels of whole gene or mRNA processing in both samples, their associated categories of potential cis
-regulatory mechanisms are highly concordant in the two samples (Supplementary Table S4
). This finding indicates that studies of ASE in a specific cell type may be extrapolated to infer functional roles of disease-associated SNPs or mutations even if the cell type is not directly related to the disease. Indeed, we found many common genes between those with allele-specific regulation and those reported in recent GWASs. Combining the two samples, 182 such genes were involved in GWAS disease association, confirming that cis
-regulation is an essential aspect in the function of genetic variants.
Although RNA-Seq allows de novo
identification of biallelically expressed SNVs, we utilized known SNVs in the respective samples to avoid the complication by RNA editing events (10
). Although whole-genome sequencing data are not yet available for many samples, knowledge of heterozygous SNVs can be readily obtained from exome sequencing or microarray analysis. With the extraordinary improvement in high-throughput technologies in recent years, an unprecedented amount of transcriptome sequencing data is becoming available. Bioinformatic analyses that can examine and integrate such data sets to address a wide variety of biological questions are highly desirable. In this study, we demonstrated that analyses of RNA-Seq data revealed a large number of allele-specific events potentially associated with different types of cis
-regulatory mechanisms of gene expression. Such studies may provide a solid foundation to facilitate further investigations of the genetic basis of human diseases.