We start by assembling a set of sequence variants from the 1000 Genomes Project for the NA12878 individual. We then generated deeply sequenced ChIP-Seq data sets for cFos, cMyc, JunD, Max, and RNA Polymerase II for the GM12878 cell line. We also created a matching deeply sequenced RNA-Seq data set for the same cell line. We combined these data sets with previously published matching data sets for RNA Polymerase II, RNA Polymerase III, NfκB, and CTCF (Kasowski et al, 2010
; McDaniell et al, 2010
; Raha et al, 2010
). We summarize these data sets in (see Materials and methods for further details).
GM12878 RNA-Seq and ChIP-Seq data sets
Determining allele-specific behavior from functional genomic data alone
Intuitively if one has performed a deeply sequenced functional genomic experiment such as RNA-Seq or ChIP-Seq from a single individual, it should be possible to determine allele-specific behavior solely from the sequences obtained. The first step in this approach would be to determine the SNPs and other sequence variants directly from the sequence reads obtained. This might be true for certain regions sequenced at great depth; however, since functional genomic data (e.g., reads from a ChIP-Seq experiment) cover the genome with greatly varying sequence depth due to the nature of the functional assay. Thus, the accuracy of SNP (and other variant) calling from functional genomics data will necessarily vary across the genome. Conversely, the average sequencing depth across the genome for conventional genomic DNA sequencing is nearly uniform (with some differences to repeated regions and compositional biases).
We find that the accuracy of de novo
SNP calling using reads from a functional genomic sequencing experiment such as RNA-Seq using a standard SNP caller package (e.g., SNVmix; Shah et al, 2009
) is not as good as we would need for determining allele-specific behavior (see Supplementary Table 1
for the results of de novo
SNP calling of heterozygous SNPs). Any significant amount of miscalling of heterozygous SNPs will (obviously) lead to ill determined allele-specific behavior. There are a number of possible explanations for such miscalls; the very events we would like to find, SNPs within regions showing ASE could potentially appear as homozygous using only the RNA-Seq sequence reads. If one experimentally only obtains sequences from a region that is expressed on one allele (due to ASE) then there is no way to know that any base within that region is polymorphic. Second, RNA editing could also lead to variations in RNA sequences that are not present at the DNA level. Finally, sequencing of RNA involves additional experimental steps like usage of reverse transcriptase that can increase chance of mis-sequencing.
Obviously, determining short indels from sequenced functional genomic data would be even harder and SVs would be nearly impossible. Thus, while it might be possible to determine certain sequence variants from the functional genomic sequence reads, in order to generate a comprehensive set of polymorphic sites as well as other forms of sequence variation it is necessary to have an independently determined set from sequenced genomic DNA (such as from the 1000 Genomes Consortium).
Building an individual diploid reference genome for NA12878
It might not seem obvious but for a number of reasons reconstruction of a diploid personal genome sequence and using it instead of the reference genome is a critical step preceding allele-specific analysis. First, using reference genome introduces biases in read mapping—reads originated from non-reference allele are more susceptible to mismapping since, when aligned to the reference allele, they contain at least one mismatch (in case of SNPs) or gap (in case of indels)—the reference bias effect, that is, both alleles are not treated equally by default. Second, expression or binding in regions of genome SV could be misinterpreted as ASE or ASB. For example, duplication of an allele in the studied genome will double binding signal for the allele while signal for the allele on another haplotype will be unchanged. Last, but not least, SNP calling in the regions of SV is likely to be less precise and contain more false positives compared with non-SV regions (The 1000 Genomes Project Consortium, 2010
). Thus, we construct a personal diploid genome of NA12878 (see Materials and methods), by utilizing genomic variations (see for summary statistics) determined in the framework of The 1000 Genomes Project Consortium (2010)
and, additionally, SVs determined by sequencing of fosmid clones (Kidd et al, 2008
Statistics on variants used to construct personal genome of NA12878
To accomplish this, we have developed a tool—vcf2diploid—for personal genome construction, which constitutes the first part of the AlelleSeq pipeline (see ). The tool uses as input VCF files with all the SNPs, indels, and SVs available for an individual of interest and outputs fasta sequences for each allele for each chromosome, along with equivalence map files (see and Supplementary Figure 1
) that map nucleotide positions between paternal, maternal, and reference haplotypes. It is important to be able to map annotation (i.e., genes) from the reference genome to the personal genome sequences. This is done using chain files, which facilitate the mapping of annotated regions between genomes using the liftOver tool (Rhead et al, 2010
). This is particularly important for RNA-Seq where we also build maternal and paternal versions of the gene annotation (including, most importantly, splice-junction library) by mapping the GENCODE annotation (GENCODE 3c annotation is available from the UCSC Genome Browser; Harrow et al, 2006
) onto the personal diploid genome.
Figure 1 (A) Construction of a personal genome by vcf2diploid tool is made by incorporating personal variants into the reference genome. Personal variants may require additional pre-processing, that is, filtering, genotyping, and/or phasing. The output is the (more ...)
The constructed diploid genome of NA12878 was different in 3 962 637 (~0.14%) bases from the reference for paternal and in 3 947 162 (~0.14%) for maternal alleles. The software package to perform personal genome sequence construction (the vcf2diploid tool and associated source code), the actual diploid sequence for NA12878, splice-junction sequences and personalized gene annotation for NA12878 and corresponding equivalence maps (between the maternal and paternal sequences as well as the reference genome, NCBI36/hg18) are available from http://alleleseq.gersteinlab.org
. The diploid sequence for NA12878 is a valuable resource for anyone performing any sequence-based analysis on this genome. The GM12878 cell lines are a primary tier one cell line under detailed investigation by the ENCODE Consortium. It should be also noted that a constructed personal genome is only as good and as complete as the variants used in construction. In light of this, the diploid genome of NA12878 that is presented here, is not perfect, but we believe it is the best possible sequence to date since it includes the most comprehensive set of variants. We intend to update this assembly as a resource, as sequence variants are even more accurately determined.
In order to assess the effect of the differences between the maternal and paternal sequences compared with using the reference genome sequence on functional genomic data, we aligned the reads from the Pol II and CTCF ChIP-Seq data for GM12878 against each of the three sequences using BOWTIE (Langmead et al, 2009
; see Supplementary Figure 2
). In , we compare the Pol II reads that align to each of the three genome sequences (reference, maternal, and paternal haplotypes). We observe that by allowing up to two mismatches more reads (0.3% for paternal and 0.4% for maternal) align to the correct NA12878 as compared with the reference genome sequence (NCBI36). The major difference in numbers for paternal/maternal and reference haplotypes is due to reads that map to one haplotype but not the other. Namely, only about 0.1–0.2% of reads that map to the reference cannot be mapped to paternal or maternal haplotype, while a significantly higher fraction of reads (~0.5%) map to the paternal or maternal genome and cannot be mapped to the reference. For paternal and maternal haplotypes, unmapped reads and reads with different mapping locations contribute roughly equally to the differences in overall mapping, presumably mostly due to short indels and SVs. We also see similar results for the same analysis done to the reads for CTCF ChIP-Seq (see Supplementary Table 2
). This demonstrates that it is important to use a correctly assembled personal genome for aligning reads when performing an allele specificity analysis.
Comparison of read mappings to reference genome and paternal and maternal haplotypes of GM12878
Similarly, transcription factor binding sites also overlapped more when aligned to the maternal and paternal genomes of NA12878, rather than the reference sequence. For this comparison, we used the set of independently mapped reads for all three genome sequences to determine binding sites using PeakSeq (Rozowsky et al, 2009
), and performed a pair-wise nucleotide overlap of the binding sites between the three genome sequences (Supplementary Table 3
). In addition, we observe that the differences in binding sites, among the three genomes, are greater than the underlying differences in read mapping.
Determining ASE and ASB
The second part of the AlleleSeq pipeline determines ASE using RNA-Seq data and ASB using ChIP-Seq data. After the maternal- and paternal-derived haploid sequences are constructed, sequenced reads are aligned against the maternal and paternal sequences separately using BOWTIE (Langmead et al, 2009
). Locations of mapping are determined by selecting the best alignment to both genome sequences. Read counts for the maternal and paternal alleles are then generated at each heterozygous SNP nucleotide positions, and ASE/ASB events are reported by applying a binomial test followed by correction for multiple hypothesis testing. SNP positions that by read-depth analysis (Abyzov et al, 2011
) are determined to be potentially in a CNV (the read depth of genomic DNA reads in a 1-kb window around the SNP is either <1 or >3) are excluded (see Materials and methods). We correct for multiple hypothesis testing by estimating the false-discovery rate (FDR) by explicit simulation of the number of false positives given an even null background (i.e., no allele-specific events)—see for a schematic of the second part of the pipeline (see Materials and methods for further technical details). We also align reads to the maternal and paternal splice-junction libraries and determine splice-junction ASE SNPs in a similar way.
Results for GM12878 RNA-Seq and ChIP-Seq data
We start our study of allele-specific phenomena by first focusing on analyses of individual events that occur within single experimental data set. We then analyze the coordination between binding and expression in a pair-wise manner using direct correlation. Finally, we investigate higher order coordination of ASB and ASE using a regulatory network framework that will allow us to study the agreement between multiple regulatory interactions and target expression simultaneously.
We summarize the results of the AlleleSeq pipeline applied to the RNA-Seq data and ChIP-Seq data for GM12878 in . In the upper half of the table, we present the results for all the autosomes and in the lower half for chromosome X (with all the transcription factor combined). In the second column of , we list the number of genomic elements (genes, novel TARs, and binding sites) that are accessible for allele-specific behavior—that is, those that we could have detected allele-specific activity in. This is the set of genomic elements that contain at least one heterozygous SNP and are sequenced at sufficient depth in order to detect allele-specific activity, see Materials and methods for further details. The third column shows that number of genomic elements that we determine to show allele-specific behavior. The fourth column shows the fraction of genomic elements that are accessible for allele-specific behavior that do show either ASE or ASB. Finally for allele-specific events that can be phased we report those that are maternal or paternal.
List of ASE and ASB events for each data set (a) only autosomes (b) only chr X
We observe that ~19.4% of all autosomal GENCODE genes that are accessible for allele-specific behavior exhibit ASE. We similarly find that 21.6% of accessible heterozygous SNPs within splice junctions of genes also show ASE. Similarly, we find that 9.3% of autosomally expressed accessible novel TARs show ASE, we expect this number to be lower than genes as novel TARs correspond to exons of genes. We found that for genes that contained two or more heterozygous SNP showing allele-specific behavior, >81% of the time all the SNPs would show consistent ASE from the same allele (significantly greater than expected by chance), some of the exceptions could be due to allele-specific alternate splicing. For the transcription factors, the fraction of accessible autosomal binding sites that exhibit allele-specific behavior varies between 2% (for cMyc) and 11% (for Pol II). A possible explanation for the difference between binding and expression allele specificity is that even though these ChIP-Seq data sets have been sequence quite deeply (see ), in order to have comparable power with the RNA-Seq data one would need to sequence an order or magnitude or two further. The number of overlapping sequence reads in binding site peaks for ChIP-Seq data depends on the efficiency of the antibody used and for most ChIP-Seq data sets we do not have sufficient read depth within a binding site as compared with the read depth within exons of highly expressed genes. As expected when restricted to the autosomes, both ASE for genes and novel TARs and ASB for all the transcription factors and polymerases are evenly divided between the maternal and paternal alleles.
In the lower half of , we present similar results for chromosome X. Since NA12878 is female there are two copies of chromosome X. We first observe that almost 80% of the accessible genes on chromosome X exhibit allele-specific behavior and 95% of these are expressed on the maternal copy. This is consistent with our knowledge of X-chromosome inactivation (Lyon, 1961
; Goto and Monk, 1998
) where one copy of the two copies is shut off. There are four genes on chromosome X that show ASE on the paternal copy; however, all of these are located in the pseudo-autosomal region of chromosome X which is known to escape X-chromosome inactivation (these include Xist, SLC25A6 and SFRS17A). We observe similar results on chromosome X for the allele-specific behavior of novel TARs as well as transcription factor binding where a greater fraction of sites exhibit allele-specific behavior compared with the autosomes and almost all are expressed on the maternal copy. It is interesting to note that most of the novel transcription and binding that shows paternal allele specificity are also in the pseudo-autosomal region similar to Xist and possibly have an associated functional role.
We make available the complete list of SNPs that show ASB or ASE in VCF format as a resource from our website http://alleleseq.gersteinlab.org
. We imagine that this file may be a useful resource for researchers interested in focusing on allele-specific sites in less deeply sequenced functional genomic experimental data sets. This might be valuable even if the functional genomic experiment was not performed on the GM12878 cell line as regions to investigate for allele-specific behavior.
Figure 2 For each heterozygous SNP location covered at a depth greater than six, we can compute the fraction of reads derived from the alternative allele relative to the reference sequence. We then plotted the distribution of alternative allele fraction for all (more ...)
There are a number of technical issues associated with determining allele-specific behavior including various biases that can be introduced as part of the analysis. We investigate some of these potential effects in detail below:
- Reference bias: In order to assess whether our pipeline has some residual bias toward the reference allele versus the alternate, we can plot the fraction of reads from the alternative allele for each heterozygous SNP location sequenced sufficiently deeply. If there were no bias, we would expect that this distribution would be symmetric having as many reference allele-specific locations as for the alternate allele. In , we plot the alternative allele fraction distribution for the RNA-Seq data, Pol II, and the other transcription factors combined. We first observe that the overall distributions are fairly symmetric and that the allele-specific events (indicated in blue) are also symmetric—this indicates that there is no residual bias toward or against the reference allele. In Supplementary Figure 3, we observe similar distributions for the fractions of reads from the maternal allele for each heterozygous SNP location that could be phased and that was sequenced sufficiently deeply in the appropriate assay.
- Correlation with SNP quality (genotype likelihood scores): SNPs determined by The 1000 Genomes Project Consortium (2010) each have a genotype likelihood score. In Supplementary Figure 4, we plot the distribution of all heterozygous SNPs and the subset of ASE SNPs versus this genotype likelihood score. We see a slight enrichment of ASE SNPs will lower scores; however, the majority of SNPs from both distributions have the highest possible score. For comparison, non-synonymous SNPs also show a similar distribution.
- Relation to genome duplications (effect of
Degner et al, 2009): It has been reported by Degner et al (2009) that heterozygous sites showing apparent allele-specific behavior can be caused by regions in the genome that have been duplicated. Thus, even though there might be a similar number of reads coming from each allele, only one of the alleles would have uniquely mapping reads which would lead to seemingly allele-specific behavior (see Supplementary Figure 5 for a schematic comparing region without a duplication to regions that have been duplicated). In order to assess the size of this effect on our results, we used the Pol II ChIP-Seq reads mapped uniquely and independently to each of the maternal and paternal genomes. This is as opposed to the normal mapping procedure in the AlleleSeq pipeline, where we map independently to both haplotypes and then select the allele with the better mapping location. At each heterozygous SNP location determined to show ASB we computed the haplotype fraction, the fraction of reads mapped to one allele over the sum of reads mapped to both alleles (we choose the haplotype fraction corresponding to the allele that has the greater fraction, see Supplementary Figure 5). For sites that have not been duplicated the haplotype fraction should have a value close to 0.5, while for duplicated regions exhibiting the Degner effect the fraction would be close to 1 (where all the uniquely mapped reads would come from one allele). In Supplementary Figure 6, we plot the distribution of haplotype fractions for all Pol II ASB sites. We observe that only a minority of the sites (<15%) shows a significant skew in the haplotype fraction toward one haplotype (a fraction >0.6). As valid ASB sites that contain additional proximal sequence variants (such as additional SNPs or indels) would also exhibit a fraction biased toward one haplotype, we conclude that this is an upper bound on the size of this effect and do not consider it significant.
- Modified binomial test: In order to assess the effect of aligning reads against the constructed diploid genome sequence for NA12878 versus using the reference genome sequence we perform the following analysis. For the RNA-Seq reads, we also aligned the reads against the reference genome and determined ASE using an even binomial distribution (threshold applied in a similar manner). As an additional comparison, we also applied the methodology of Montgomery et al (2010) where a skewed binomial distribution is used with the reads aligned against the reference genome (they composite for the reference bias induced by mapping to the reference genome by modifying the binomial distribution). Similar to , we plotted the distribution of all expressed heterozygous SNPs (ASE SNPs in blue) for these two methods in Supplementary Figure 7. Using the naive methodology with an even binomial we see the skew of the ASE SNPs toward the reference genome which is largely removed using the modified binomial test. When comparing our set of 5862 ASE SNPs determined using the personal genome we find that only 69% are shared with those determined using the naive approach. Using the modified binomial methodology from Montgomery et al (2010), we see an improvement (75% in common); however, we still are detecting a significant number of ASE sites that were missed aligning to the reference genome and only modifying the binomial test versus using the correct diploid genome to align against.
- Comparison with
McDaniell et al, (2010): We have also compared the ASB sites for the CTCF ChIP-Seq data from the AlleleSeq pipeline against those from McDaniell et al (2010). Restricting the comparison with common heterozygous SNP between the two analysis (McDaniell et al, 2010 used an earlier set of SNP calls for NA12878 from The 1000 Genomes Project Consortium, 2010) we find that greater using a P-value threshold of 0.01 on their results >85% of the ASB SNPs are in common.
- Allele-specific elements using heterozygous indels: The AlleleSeq pipeline determines allele-specific behavior for genomic elements (transcribed regions or binding sites) that contain heterozygous SNPs. It is also possible to determine allele-specific behavior for genomic elements that contain a heterozygous indel. In Supplementary Table 4, we show the results for transcribed exons and novel TARs as well binding sites for Pol II and CTCF that can be determined to show allele-specific behavior using a heterozygous indel to distinguish the haplotypes.
Correlation of ASB in binding sites containing known motifs
In our analysis of ASB events, heterozygous SNPs are initially only used for distinguishing between the maternal and paternal alleles (presumably allele-specific behavior also occurs in genomic regions not containing heterozygous SNPs). However, in some locations the heterozygous SNP might be the causative reason for the difference in binding between the alleles, this would most likely occur in ASB sites where the heterozygous SNP is within a known DNA binding motif for a transcription factor. In order to see how ASB is correlated with perturbations to known DNA binding motifs, we compared the allele that is bound against the allele that matches better with the known motif. Thus, we first scanned ASB sites for known motifs, position weight matrices (PWMs) obtained from TRANSFAC (Matys et al, 2006
) and JASPAR (Portales-Casamar et al, 2010
) (see Materials and methods for further details). We correlated the nucleotide of the allele, which is preferentially bound with the fitness score of the PWM. We observed a number of significant correlations between the PWM score for both alleles and the allele that is bound. The allele that is bound tends to correspond to the better match to the known PWM. In particular, we see this for the known cMyc motif within both cMyc and Max binding sites (see ). This is interesting as we observe a correlation between the allele-specific behavior of cMyc motifs with Max binding sites, indicating that these transcription factors tend to significantly regulate the same target genes. We also include in Supplementary Figure 8
additional examples of the correlation between motif fitness score and the allele being bound for CTCF binding sites containing CTCF motifs and cMyc motifs within Pol II binding sites.
Figure 3 We plot the difference of motif scores (see Materials and methods) between the maternal and paternal alleles against the fraction of maternally derived reads for ASB SNPs overlapping motifs within binding sites. Here, we plot this for ASB SNPs in cMyc (more ...)
Correlation of ASE with protein structural fitness
Some heterozygous SNPs within genes can result in one allele losing its ability to function as a transcript (i.e., heterozygous loss of function). Additionally, non-synonymous SNPs within the protein-coding sequence can cause a difference in the structure fitness of the protein coded from each allele. We studied the coordination between these effects and the genes that show ASE.
We first investigated the overlap between genes that exhibit ASE with genes that show loss of function on one allele due to a premature stop codon, a frame-shift or a disruption of an intron–exon splice site (all caused by heterozygous SNPs). We find four cases of genes that show ASE as well as heterozygous loss of function and in all four cases the allele that is expressed is opposite to the allele that has lost its ability to code for a protein. We speculate that in some of these cases the transcript from the allele suffering from a disruption might be degraded due to non-sense-mediated decay, which, in turn, might cause the ASE from the other allele.
Since some heterozygous SNPs that show ASE are within the protein-coding sequence of genes, it is natural to ask whether the allele that is expressed could track with allele-dependent structural changes (for SNPs in non-synonymous positions in the protein-coding sequence). However, it is not clear that we expect to find a correlation between structural fitness and ASE, as many of these SNPs are not selected for in the human population in any case. In order to assess whether the allele that is expressed (for genes showing ASE) is correlated with the allele containing mutations deleterious to protein structure we performed the following analysis. We compared the occurrences of ASE SNPs within genes where the SNP corresponds to a non-synonymous substitution within the protein-coding sequence of the gene. Using the tool SIFT (Ng and Henikoff, 2003
), we can compare the preference of the allele that is expressed with the allele whose amino-acid sequence shows better structural fitness. We find that 20 of the 37 genes that meet these criteria show expression on the allele that has the protein sequence that has better fitness. While we find slightly more genes where the allele with better structural fitness occurs on the same allele that is expressed, this result is not significant.
Correlating ASB with ASE
In the upper panel of , we present an example of the gene SKA3 on chromosome 13 which has multiple heterozygous SNPs within exons which show consistent maternal ASE which agrees with the maternal ASB of another heterozygous SNP within a Pol II binding site proximal to the 5′ end of the gene. In this example, we see coordinated maternal binding of Pol II with expression of the gene. In the lower panel of , we present a similar example of a novel transcribed region on chromosome 4 where we see coordinated paternal binding of Pol II with the paternal expression of the novel TAR.
Figure 4 Examples showing ASE and ASB for a gene (SKA3 on chromosome 13) and a novel TAR (on chromosome 4). Paternal SNPs exhibiting either ASE or ASB are indicated in blue and corresponding maternal SNPs are indicated in red. We also indicate the region of enriched (more ...)
These two examples show how ASB is coordinated with ASE for a known gene and a novel TAR. To investigate this trend, we assess to what extent allele-specific behaviors detected using heterozygous SNPs are coordinated on a genome-wide scale. In , we tabulate the number of genes and novel TARs that have evidence for ASE and for proximal ASB. We also tabulate the total counts of genes that have a proximal binding site where both the gene and binding are jointly accessible for detecting allele-specific behavior. We perform a similar calculation for novel TARs and their proximal sites.
Association of transcriptional factor binding (for Pol II & Pol II, CTCF, and the other TFs combined) and expression of genes and novel TARs
In , we present the tabulated counts for Pol II & Pol III and CTCF separately from all the other transcription factors combined. We find a number of genes (74 genes proximal to Pol II & Pol III sites, 29 genes proximal to CTCF binding sites, and 44 genes with proximal transcription factor sites other than CTCF) and novel TARs (55, 8, and 15 for Pol II & Pol III, CTCF, and remaining transcription factors, respectively) in which we see both ASB proximal to ASE. We separate CTCF from the remaining transcription factors because of its function as an insulator. While these numbers might seem relatively small, they reflect the low chance of having both a proximal binding site as well as an expressed gene with both of them jointly accessible for the detection of allele-specific activity. This underscores the need to sequence deeply and use a comprehensively determined set of SNPs or else we would have significantly fewer genes to be able to compare ASB and ASE.
In order to assess the degree of coordinated allele-specific behavior for the 74 genes that exhibit ASE that have a proximal ASB Pol II binding site we performed the following analysis. For each gene, we tabulated the allele-specific behavior of the most significant ASE SNP versus the most significant ASB SNP (if there are more than one significant heterozygous allele-specific SNP present). In , we tabulate maternal and paternal ASB of binding sites against maternal and paternal ASE of proximal genes (we do this separately for Pol II & Pol III, CTCF, and the remaining transcription factors combined). We see that there is a statistically significant coordination between ASB of Pol II & Pol III to genes that exhibit ASE (Fisher's exact test, P-value=1.4e−3). Similarly, as seen in there is also a statistically significant correlation between the 45 genes that exhibit ASE with ASB for all the combined transcription factors excluding CTCF (Fisher's exact test, P-value=1.8e−5). We do not however, see a significant correlation of ASB with ASE for CTCF which is probably due to its role as an insulator.
We tabulate the ASB SNPs proximal to genes (within 2.5 kb of the TSS to the TTS of the gene) containing ASE SNPs
Further coordination of allele-specific behavior
As a further way to assess the coordination of allele-specific events within genes, we combined all the ASE and ASB SNPs that occurred within a gene (from 2.5 kb upstream of the transcription start site (TSS) to the transcription termination site (TTS) including introns). Using only genes that contained >10 SNPs showing ASE or ASB we could compute the fraction of SNPs that show maternal specificity. Ideally, if all SNPs were perfectly coordinated then the fraction would be either zero or one. In the first panel of , we see the actual distribution, most genes do show a high degree of coordination. Under a random null (where each ASE or ASB event could be maternal or paternal with equal chance) for the same genes each with the same number of SNPs, we would expect a null distribution of maternal fraction computational simulated in panel two of . Using a Kolmolgorov–Smirnov test, we find significantly more coordination of ASB and ASE SNPs in genes than compared with the random null expectation (P-value=8.45e−8; see the third panel of ).
Figure 5 We compare the degree of coordination in the maternal or paternal preference of ASB and ASE SNPs within a gene, to that of a random null distribution. All genes that contain 10 or more such SNPs across all our GM12878 data sets are included. Using this (more ...)
Representing ASE and ASB behavior on a regulatory network
In the previous analysis, we showed that binding and expression were correlated in a pair-wise manner. Next, we would like to investigate the coordination of allele-specific behavior between multiple factors and expression simultaneously. This is hard to represent in a two-dimensional correlation plot; thus, we have developed a simplified representation of ASE and ASB in a regulatory network framework. Looking at the occurrences of network motifs (Milo et al, 2002
) in this framework allows us to measure the coordination of allele-specific behavior between multiple transcription factors and target expression.
The network shown in represents the regulatory network of six transcription factors (cMyc, Max, cFos, JunD, NfκB, and CTCF) and two polymerases (Pol II and Pol III). The network discretizes the ASB events into allele-specific regulation of target genes and novel TARs and their ASE. The edges in the network represent ASB of the TF or polymerase to the target gene or novel TAR (red edges indicate predominantly maternal regulation, blue edges indicate paternal regulation, and gray edges indicate allele-specific regulation that could not be phased). ASE of the target genes is indicated by the color of the target gene or novel TAR (red—maternal, blue—paternal, and gray—unphased allele-specific behavior). Thus, the network contains all information on the allele specificity of the regulation and the expression of the targets. One can observe that there is a clear agreement between the allele specificity of the regulation and the expression of the target. When we tabulate the maternal/paternal regulation edges with maternal/paternal expressed genes or novel TARs (see first part of ) we find that they are highly coordinated (P
-value <1e−3, Fisher's exact test). Furthermore, we can scan the network for coordinated regulation using multiple-input motifs (MIMs) and single-input motifs (SIMs) (Milo et al, 2002
). MIMs are network motifs where at least two transcription factors are regulating the same target gene or novel TAR, while SIMs are motifs where a single transcription factor regulates a pair of targets. In the second part of , we tabulated the number of MIMs where the pairs of TFs (or polymerase) exhibit both maternal or both paternal regulation with the maternal or paternal expression of the target genes or novel TAR.
Figure 6 This figure shows a regulatory network of genes and novel TARs that are regulated by TFs in an allele-specific manner. The TFs are represented by green triangles, while the genes and novel TARs are represented by squares and circles, respectively. The (more ...)
Number of transcription factors (or polymerases) that maternally or paternally regulate GENCODE genes or novel TARs that are maternally or paternally expressed
We find that for MIMs the regulatory allele-specific behavior is highly coordinated with the ASE of the target gene or novel TAR (P-value <1e−3, Fisher's exact test). As we can see in , most MIMs correspond to the coordinated regulation of Polymerase II and a transcription factor of a target gene or novel TAR. In the third part of , we count the occurrences of SIMs where a TF (or polymerase) that exhibits maternal or paternal regulation for each of its targets, which can each be either maternally or paternally expressed. We similarly see a significant degree of coordination of allele-specific expression and regulation in SIMs as with MIMs.