Variation in transcriptional regulation of gene expression plays a significant role in determining the diversity of human phenotypes. The components of transcriptional control include
cis-acting regulatory elements that may act across large genomic distances, hundreds or thousands of Kb away from the transcript they regulate
[18]. Studies of ASE indicate that allele-specific differences among transcripts within an individual can affect up to 30% of loci and, at the population level, ~30% of expressed genes show evidence of
cis regulation by common variants
[6]. In population studies, an even larger proportion of genes showed ASE that could not be mapped, which could be ascribed to rare genetic variants or epigenetic effects
[8]. However, it is also possible that some distal regulatory regions have escaped detection because they are located far from the regulated transcript. First, because of large distance they could have been left unexplored, and second, because the mapping could have failed, if tests required knowledge of the chromosomal phase. Sample size, the reliability of genotyping and the accuracy and completeness of AI ascertainment, will affect the outcome of all tests. Because genotype-based test is independent of chromosomal phasing and phasing errors, its mapping efficacy is also unaffected by genetic distance separating regulatory site from the regulated transcript. Phasing errors are unavoidable, even when using best haplotype-phase inference algorithms
[23]. Their number increases with increasing chromosomal distances and with the number of recombination hotspots in between. They may be also more frequent in admixed individuals and in newly studied populations with unknown haplotype catalogs
[24]. The most accurate algorithms, such as PHASE require very long computation times (on a regular 2 GHz computer), which may extend from days to months for sets of hundreds of thousands of SNPs in a hundred genotyped individuals
[16],
[25]. While this was not an issue with our simulated data sets of 50 diploid individuals and an average of about 500 SNPs (θ

=

100), it still took more than 50 min on 2.67 GHZ processor. Faster programs exist and, for example, it takes about 2 min run to phase the same data set using ShapeIT
[25]. Regrettably, the speed is reached at the expense of accuracy which varies as a function of the sample size and the amount of markers
[26]. In other words, using genotype test is less computationally demanding, we gain in time and in accuracy when phasing errors are an issue.
Our analyses on real data were carried out in very well phased CEU individuals from the HapMap project, where phasing was additionally improved by using child-parental trios. In most cases the haplotype-based binomial test worked equally well or even better than the genotype-based test as judged by significance levels. However, while both tests “found” the PTGER4
cis-regulatory segment located almost 200 kb upstream of this gene (), our binomial and linear regression test failed to identify such a region more than 500 kb downstream of the TTC39b transcript (). When we rephased the genotype data from using fastPhase
[16] (but not PHASE
[17] or Shape-IT
[25]), the upstream regulatory segment of the PTGER4 transcript also became “invisible” in the haplotype based tests. Therefore, when chromosomes are well phased these tests can be expected to lead to the same or similar overlapping results (e.g. ). Importantly, these two examples ( and ) illustrate well the problem of locating regulatory variants/regions from ASE data. An informative marker whose alleles are at least partly consistent with the direction of up and down transcription control may be revealed as significant. The chances that this happens are increased when many such markers are used (or when many transcripts are tested with highly informative markers) and haplotype-based tests seem to be more vulnerable to this kind of error. Lack of statistical significance in the genotype-based vs. haplotype-based tests of a number of SNPs representing the informative markers zone, as in , strongly suggests that these do not indicate the location of the regulatory region but rather reflect a partial overlapping in heterozygosity and phase between marker and regulatory sites. In the reverse case, lack of statistical significance in the haplotype-based vs. the genotype-based test may also suggest a different genetic mechanism. For example, in the case of an imprinted locus, when one of the parental chromosomes is silenced, a signal of AI will be observed
[27]. This is observed in the SNRPN locus (
Figure S19) reported to be imprinted
[28], and in the L3MBTL locus (
Figure S20) where haplotype based analyses failed and the contingency test revealed as significant the informative markers and other SNPs in their linkage group. Likewise, an “artificial” AI signal could also reflect random mono-allelic expression in a fraction of individuals (cell lines), due to aberrant methylation of the genome
[27]. In other words, combining the results of haplotype and genotype-based tests may provide leads to AI-causing mechanisms other than due to genetic variation within regulatory elements. In we listed selected SNPs found by us, which were earlier reported in either different GWAS or expression studies by others. For example, PTGER4 rs7720838 was found associated with the risk of Crohn's disease
[14]. The rs1384883 SNP from LRRIQ3 was reported in a GWAS of blood pressure and hypertension
[29], while other SNPs associated with the ASE of the LRRIQ3 transcript were reported in gene expression studies
[30],
[31]. The SNP rs751173 (transcript 404053) was reported in a study of susceptibility to cutaneous nevi
[32], these associated with AI of the transcript 404105 were highlighted in a GWAS on late-onset Alzheimer disease, with rs2180566 found in the promoter of DEFB123
[33]. In turn, LTA locus with rs2844484 and rs2239704 was found in studies of cancer susceptibility and the risk of ischemic stroke
[34]–
[36]. All the remaining sites were earlier identified in studies of gene expression in the context of eQTL mapping. Thus our findings here can be considered as confirmatory. Interestingly, however, the three SNPs listed in the context of TTC39b, and found in the larger cluster of significant sites based on the data shown in , were reported in the context of the PSIP1 transcript, about 150 kb upstream from TTC39b
[7]. Likewise, rs1963273 identified in the context of the FMO1 transcript
[37] was found here to be linked to AI within FMO4 (
Figure S10) and SNPs listed for BAT2 were previously reported in LD with CSNK2B transcription
[37],
[38]. Do these results represent examples of synchronized transcription control, as could be suspected in the case of a gene cluster involving FMO1 and FMO4, or are they due to experimental artefacts partly caused by phasing errors?
In contrast to other analyses, which may consider the intensity of the ASE signal
[6], the tests introduced here are based on a categorical classification of the individuals studied as AI or non-AI. The mapping of regulatory variation thus critically depends on the quality of AI measurement as well as on the number of intergenic informative marker SNPs serving to ascertain AI status of the sampled individuals. Detecting or confirming the presence of AI is not the same as mapping regulatory variants. For the first it is sufficient to demonstrate that two parental copies of a gene are differentially expressed. Mapping requires sufficiently large samples where ideally all AI expressing individuals can be detected. Power and FPR of the mapping tests depend upon the characteristics of the polymorphic sites in LD within a regulatory segment (). These characteristics, which include their allelic frequencies, genealogical positions and r
2 relative to the
R site, change with the increasing r frequency (
Figure S21 and
Table S2). Selecting simulations for the presence of a derived allele above certain frequency level eliminates a portion of coalescence trees representing particular genealogical histories that cannot “accommodate” sites with a derived allele above certain frequency level. While among 2000 simulated genealogies all “carry” a site with a derived allele frequency of ~0.15, only 897 (45%) genealogies carried sites with a derived allele frequency of ~0.85 (
Table S2). This leads to a progressive distortion (as compared to neutral expectation) of allelic frequency spectra of SNPs in LD with the
R site at increasing frequency of its r allele (
Figure S22), which affects the proportions of significant SNPs in each position category between the tests. Knowing the number of AI individuals we may estimate heterozygosity and thus relative R and r allele frequencies. The knowledge of the expected genealogical positions of SNPs that are tightly linked with the
R site allows us to better understand differences between outcomes of different tests (
Table S2). When combined with the analysis of the regulatory region haplotypes it may be also useful in finding the regulatory site itself.
Systematic use of genotype-based tests in concert with haplotype-based tests may be the most advisable mapping strategy. Unfortunately, haplotype-based tests will always suffer from phasing uncertainty inherent to the data itself, especially when the number of samples precludes the use of computationally slow but more reliable phasing algorithms. Using the genotype-based test, the phasing step can be simply postponed saving time and related costs. The best outcomes can be expected with high quality data maximizing ascertainment of AI individuals. The utility of the genotype-based test will increase with the application of new sequencing methods that improve transcript quantification thus providing more reliable assessment of AI status. Without phasing uncertainty, the genotype-based test should pave the road to the identification of
cis-regulatory variants that could have escaped detection due to their distal location
[4],
[18]. Finally, the current trend of functional genomics based on next-generation sequencing makes it possible to interrogate allelic functional effects beyond transcription
[8]. Any inheritable phenotypes that can be categorized such as AI here, identifying the underlying heterozygotes, can be mapped using presented protocols. The genotype test can also be extended to compare phenotyped individuals that may represent different genotype combinations. Our approaches can be generalized to map for causes of differential DNA-protein interactions or active chromatin, both shown to be inheritable in recent studies
[39],
[40].