|Home | About | Journals | Submit | Contact Us | Français|
Contributed intellectually to the interpretation of results: PRB OH WH XL. Conceived and designed the experiments: PRB XL. Analyzed the data: PRB RM LL AH JL. Wrote the paper: PRB.
Gene expression quantitative trait loci (eQTL) are useful for identifying single nucleotide polymorphisms (SNPs) associated with diseases. At times, a genetic variant may be associated with a master regulator involved in the manifestation of a disease. The downstream target genes of the master regulator are typically co-expressed and share biological function. Therefore, it is practical to screen for eQTLs by identifying SNPs associated with the targets of a transcript-regulator (TR). We used a multivariate regression with the gene expression of known targets of TRs and SNPs to identify TReQTLs in European (CEU) and African (YRI) HapMap populations. A nominal p-value of <1×10−6 revealed 234 SNPs in CEU and 154 in YRI as TReQTLs. These represent 36 independent (tag) SNPs in CEU and 39 in YRI affecting the downstream targets of 25 and 36 TRs respectively. At a false discovery rate (FDR)=45%, one cis-acting tag SNP (within 1 kb of a gene) in each population was identified as a TReQTL. In CEU, the SNP (rs16858621) in Pcnxl2 was found to be associated with the genes regulated by CREM whereas in YRI, the SNP (rs16909324) was linked to the targets of miRNA hsa-miR-125a. To infer the pathways that regulate expression, we ranked TReQTLs by connectivity within the structure of biological process subtrees. One TReQTL SNP (rs3790904) in CEU maps to Lphn2 and is associated (nominal p-value=8.1×10−7) with the targets of the X-linked breast cancer suppressor Foxp3. The structure of the biological process subtree and a gene interaction network of the TReQTL revealed that tumor necrosis factor, NF-kappaB and variants in G-protein coupled receptors signaling may play a central role as communicators in Foxp3 functional regulation. The potential pleiotropic effect of the Foxp3 TReQTLs was gleaned from integrating mRNA-Seq data and SNP-set enrichment into the analysis.
Phenotypic differences between populations have been shown to be associated with variation in genes, the epigenome, the environment and quantitative traits. Gene expression has been used as a quantitative phenotypic trait to locate regions in the genome that have polymorphisms governing differential transcription within populations , , , . This type of inference termed expression quantitative trait loci (eQTL) analysis has been used in genome-wide association studies (GWAS) to map single nucleotide polymorphisms (SNPs) to regions that affect gene expression . Recently it has been shown that SNPs associated with a phenotypic trait are more likely to be eQTLs . The advantage of understanding the contribution of genetic variations on the expression of genes has major implications on the manner in which pharmaceuticals are personalized for an individual and how complex diseases are investigated.
A typical eQTL approach entails modeling the expression of a single gene as a response variable with the genotypes of a single SNP as the predictor variable. Variants of eQTL modeling take the form of a pathway, network component, sparse factor, cluster or the average of a group of co-expressed genes as the response variable and/or predict the expression according to a set of SNPs selected by LASSO, canonical correlation or interval mapping , , , , , , , . The goal is to determine if there are “eQTL hotspots”  where a SNP leads to widespread changes in the expression of genes that are coordinately regulated. Hallmark examples of the power of eQTL analysis for determination of population differences are illustrated by several recent bodies of work. For example, several investigators have demonstrated the robustness of eQTLs to discern variation in gene expression between populations due to environmental exposures or geographic ancestry , , , . Others have shown that gene expression can vary according to particular genotypes, chemical agents and factors such as tissue type, gender, genotype and age , , , , , . Also, many have successfully linked genetic variants to transcriptional patterns within ethnic groups  although batch effects and biological noise confounding the differences between the populations can distort the interpretation of the results , . For instance, the SCAN database  is a catalogue of the association of a given SNP to variations in gene expression between Yoruba in Ibadan, Nigeria (African: YRI) and CEPH-Utah residents with ancestry from northern and western Europe (European: CEU) HapMap populations . These variants can be cis- or trans-acting whereby the effect is situated proximal to the expressed gene or it is located elsewhere in the genome, respectively. Although it has been suggested that SNPs residing in transcription factors (TFs) have no significant attributable effect on gene expression variation , it is unknown whether a variant that affects the genes regulated by a TF operates through a system of regulated pathways. Therefore, a more comprehensive way to better understand the genetic component of variation in gene expression within and between populations is to address the problem on a systems biology level. In other words, on a genome-wide scale, simultaneously model the expression of genes that are downstream targets (DSTs) of a transcript-regulator (TR) (Figure 1). A TR can be a TF, a cofactor, a complex, a microRNA or combination of these which are involved in the regulation of transcription and govern signaling pathways.
Li et al.  demonstrated the practicality of utilizing pathways as a response variable to associate SNPs between two genotype categories. They identified several genome-wide pathway regulators that seem to mediate gene expression differences. The advantages of this type of TReQTL approach are that the covariance of the DSTs is used in the modeling, co-regulation of the genes is inferred and the eQTL scan is genome-wide. Hence, scanning for TReQTLs is more likely to reveal variants that confer differences in gene expression between populations through genetically-wired regulatory mechanisms. We used a multivariate linear regression to model the DSTs of TRs with SNPs typed in CEU and YRI populations. The DSTs of the TRs were obtained from the TRANSFAC® knowledgebase - a biological resource that catalogs experimentally-proven binding and regulation of genes by various TRs , . For all TRs with two or more DSTs, we detected 234 SNPs in CEU and 154 in YRI representing 36 and 39 independent (tag) SNPs as TReQTLs and affecting the DSTs of 25 and 36 TRs respectively. The expression of the DSTs of 24 TRs was associated with SNPs in both populations. Mapping to within 1 kb of a gene and controlling for multiple testing revealed one cis-acting tag SNP in each population as a TReQTL. In CEU, a TReQTL SNP was found to be associated with the DSTs of the X-linked breast cancer suppressor Foxp3 but is not significant in the YRI dataset. The Foxp3 TReQTL SNPs were overrepresented in evolutionary conserved regions (ECRs) of the genome in CEU and enriched in splice junctions (SJs) in YRI.
Typical expression quantitative trait loci (eQTL) analyses take the form of a strategy where a single gene is used as a response variable and individual single nucleotide polymorphisms (SNPs) the predictor variable to determine if there is association of a particular phenotype with a variant. The correlation of co-regulated genes is not taken into consideration. We used a multivariate approach to leverage the covariance of the gene expression of downstream targets (DSTs) of a transcript-regulator (TR) to perform genome-wide associations for SNPs that are potentially linked to changes in gene expression across genotypes. The genotype data (phase-II, release 24, forward strand, non-redundant) from the 60 Yoruba in Ibadan, Nigeria (African: YRI) and from the 60 CEPH-Utah residents with ancestry from northern and western Europe (European: CEU) populations were obtained from the International HapMap Project . Gene expression data from the profiling of Epstein-Barr virus (EBV)-transformed lymphoblastoid cell lines from the individuals in each CEU and YRI HapMap population  were obtained from the National Center for Biotechnology Information Gene Expression Omnibus (GEO) database under accession number GSE10824. Figure 1 illustrates that using this model, TReQTLs can be identified which are associated with the downstream targets of TRs. The genetic variation attributed to the association are imbedded, and therefore discovered in the network of regulatory pathways that govern the co-regulation behind the phenotypic trait. The TReQTLs may be within a single gene (cis), spread across several genes (trans) or located in regions of unknown biological function. In addition, the case may be that several TReQTLs for the DSTs of TRs may share the same variants or portions of the same variants. To investigate the regulatory component of TReQTLs, we first sort out to determine if two populations (YRI and CEU) had shared or varied signaling transduction mechanisms robust enough for a more refined association analysis.
Using the TRANSFAC ,  and TRANPATH  databases of components of signal transduction and regulatory pathways respectively, 2,743 TRs were mapped from the approximately 9,000 probe sets on the Affymetrix Human HG-Focus Target GeneChip Array, 1,438 signaling pathways were identified as comprised of at least one of the TRs and 78 TRs were mapped to one or more pathways. As shown in Figure 2, 333 TRs were determined to have two or more DSTs. These regulate 1,931 DSTs. The TRs consist of transfactors (TFs), cofactors, complexes and miRNAs. Three TFs, all stimulating proteins (Sp), regulate more than 60 DSTs. The median value for the TR DSTs is 3 and the mean is 5.8.
To determine the extent of co-regulation of genes within CEU and YRI populations, the correlation of the gene expression of pairs of the genes which are DSTs of TRs was assessed. Significance (p-value<or≈0.05) was determined from a non-parametric distribution of correlation scores obtained from 10,000 random cases. The correlation score is the sum of the squares of the Pearson correlations (r) among all pairs of genes determined to be DSTs of a TR (see the Materials and Methods section for more detail). For illustrative purposes, Table 1 reports the significance of the correlation of DSTs of only those TRs where disease-causing SNPs are located in the TR target site. The disease-causing SNPs were obtained from the NHGRI GWAS Catalog (www.genome.gov/gwastudies) and mapped to the TRANSFAC position weight matrix consensus sequence for the binding of at least one of the DSTs of the TR. Interferon-stimulated gene factor 3 (ISGF3), X-box binding protein 1 (XBP1) and hepatocyte nuclear factor 4-alpha (HNF4-α) are significant in the CEU and YRI populations. Signal transducers and activators of transcription −1 (STAT1), activating transcription factor 1 (ATF1) and peroxisome proliferator activating receptor gamma (PPAR-γ) are significant in YRI only whereas upstream transcription factor 1 (USF1), the Sp1Sp3 complex and the retinoid X receptor alpha (RXR-α):PPAR-γ complex are significant in CEU only.
To search for eQTLs that are tied to genes which are co-regulated in a given population, a multivariate linear regression was used to model the gene expression of the DSTs of TRs as response variables and the genotypes of SNPs as the predictor variables. The analysis was restricted to i) the 333 TRs which were found in the TRANSFAC database to have two or more DSTs (1,931 of the 8,399 unique UniGene transcripts represented by probe sets on the microarrays) and ii) to approximately 1.5 million SNPs on the autosomal chromosomes that passed the filtering criteria (see Materials and Methods) and were in common between CEU and YRI (common set), the 416,160 SNPs among the 1.5 million common set at a minor allele frequency (MAF)>=0.05 and linkage disequilibrium (LD) r2>0.5 (tag set) and the 184,616 independent (tag) SNPs which are within 1 kb of a gene (cis-acting set). For multiple testing correction, we used the 6.1×107 p-values from the cis-acting set to control the false discovery rate (FDR=45%). For the three TRs with 60 or more DSTs, the modified F-statistic , ,  was used to obtain the nominal p-value for the TReQTL. A preliminary analysis considered a nominal p-value less than 1×10−6 for detecting TReQTLs. Although this cut-off is extremely high and subject to many false positives, we were initially interested in the overall robustness of the method to screen for putative associations. As illustrated in Figure 3, the TReQTLs for the CEU and YRI populations are widely different with 234 and 154 SNPs detected in CEU and YRI respectively. These represent 36 tag SNPs in CEU and 39 in YRI affecting the DSTs of 25 and 36 TRs respectively (Supplemental Materials Table S1). None of the TReQTL SNPs in the two populations overlap. At an FDR of 45%, two cis-acting tag SNPs (one in each population) are considered TReQTLs. In CEU, the SNP (rs16858621) in the pecanex-like 2 (Pcnxl2) gene was highly associated with the DSTs of the cAMP responsive element modulator (CREM) transfactor whereas in YRI, the SNP (rs16909324) was linked to the targets of miRNA hsa-miR-125a.
As shown in Figure 4, there are a few cases where the SNPs are mapped relative to a TR (i.e.<2 Mb). In CEU, the DSTs of TRs alpha-CBF (T00081) and ENKTF-1 (T00255) possessed 13 and 4 TReQTL SNPs respectively but are not displayed as these DST genes have not been characterized and hence, have no genomic location. Four tag SNPs were associated with the DSTs of the HIF2A:arnt complex (T10852) in CEU whereas 2 tag SNPs were associated with the DSTs of miRNA hsa-miR-125a (T09819) in YRI. Interestingly, in CEU, one tag SNP (rs16858621) was associated (p-value<5×10−7) with the DSTs of miRNA has-miR-15a (T09712) and TF CREM (T01803) both of which regulate Ccnd1 , . Thus, presumably, this represents a case where a SNP may affect a master regulator that controls not only a TF but a miRNA as well both of which share the role of regulating a common gene. In YRI, there are several cases where a tag SNP is associated with the DSTs of more than one TR. Although several sets of DSTs of TRs were found to have a fair number of significant tag SNPs mapped to them, two miRNAs (hsa-let-7e (T09710) in CEU and hsa-miR-200a (T09837) in YRI) have hotspots (SNPs in a region affecting multiple transcripts ) associated with the variation of expression of their DSTs according to the genotypes at the alleles. When restricting the comparison of the populations to the ~1.5 million SNPs in common, several of the TReQTL overlapped between CEU and YRI. The expression of the DSTs of 24 TRs was associated with SNPs in both populations (Table 2). These were SNPs mapped within or in proximity to genes involved in transcription regulation, cell communication, transport, kinase activity, growth and development. Interestingly, several of the TReQTL SNPs in CEU are mapped to pseudogenes.
Each TReQTL is comprised of a SNP, a TR and a set of DSTs. Each constituent is associated with a gene. We mapped the constituents, except for cases where the TR is a miRNAs, to genes and then determined the GO biological process term each was annotated to. The collection of terms was then used to construct a GO biological process subtree. All ancestors of a term were included in the subtree. Our adjusted cohesion score (ACS) is an ad hoc way to i) measure the amount of connectivity between terms, ii) account for the significance of the TReQTL and iii) consider the average number of paths per term. As listed in Table 3, the top ranked TReQTLs have the more cohesive subtrees and are more descriptive with respect to the term with most paths associated with it (The full list is in supplemental material Table S2 (CEU) Table S3 (YRI)). These associations within the subtree can lead to new insight into the possible role of the TReQTL SNP in the pathophysiology of diseases. For instance, in YRI, the rs12258754 allelic variant is associated with the DSTs of activating transcription factor 3 (Atf3) and produced a subtree with vascular smooth muscle cell (VSMC) contraction as the granular biological process node (Figure 5A). Interestingly, in CEU, four SNPs associated with the variation of expression for the DSTs of miRNAs hsa-mir-181b-1 (MI0000270) and hsa-mir-181b-2 (MI0000683) are mapped to the peptidyl-prolyl cis-trans isomerase A-like pseudo-gene and generated a subtree with synaptic transmission as the most descriptive biological process term (Figure 5B).
Many of the variants map to Online Mendelian Inheritance in Man (OMIM) associated disease genes (data not shown). For instance, one TReQTL in the CEU population is associated with the DSTs of the X-linked breast cancer suppressor gene Foxp3 (T04280) transfactor . Foxp3 belongs to the Forkhead box family of genes, is located on chromosome X and is essential for the production and normal function of regulatory T-cells. As shown in Figure 6, interleukin 2 (Il2) and colony stimulating factor 2 (CSF2), the DSTs of Foxp3, are two cytokines whose gene expression co-regulation (correlation=0.56) is significantly associated with the variants of tag SNP rs3790904 (p-value=8.1 10−7) which maps to the latrophilin homolog 1 (Lphh1/Lphn2) G-protein couple receptor (GPCR) gene. This association is not significant in YRI (p-value=0.89). Other significant SNPs in CEU that are linked to the DST of Foxp3 map to an additional GPCR gene (Lphn3) as well as a membrane ion channel (KCNJ1), a phosphatidic acid phosphatase type 2 enzyme (Ppapdc1a) and an uncharacterized gene. An interaction network of Foxp3, its DSTs and genes that the TReQTLs map to revealed tumor necrosis factor (TNF) and NF-kB as central hubs along with carcinoembryonic antigen-related cell adhesion molecule 3 (CEACAM3), Tgf-beta, and hepatocyte growth factor (HGF) as connectors (Figure 7). Interesting enough, negative regulation of NF-kappaB TF activity, cAMP response element-binding (CREB) activity and T-cell cytokine production/positive regulation of regulatory T-cell differentiation are biological processes within the Foxp3 TReQTL subtree that are highly connected (Data not shown).
SNPs in multi-species/evolutionary conserved regions (ECRs) may imply similarity of function across species  and those within splice junctions (SJs) or splicing enhancers may play a role in gene expression regulation through exon splicing ,  . In other words, genetic variants that are related to gene expression differences between populations are more likely to be in genomic regions conserved across species and/or possibly involved in regulating transcription by altering splice forms of transcript messages. As an anecdotal example, we used 472 SNPs from the union of the Foxp3 TReQTL SNPs between CEU (n=233) and YRI (n=239) with a more liberal nominal p-value threshold <1×10−4 to determine over-representation within ECRs and SJs. As displayed in Table 4, using SNP-set enrichment analysis , the SNPs are significantly over-represented in 5-way ECRs for CEU (p-value=0.006) but not YRI (p-value=0.9) and enriched in SJs for YRI (p-value=1×10−4) but not CEU (p-value=0.9). Fisher exact tests confirm significance of the over-representation of these TReQTL SNPs in 5-way ECRs for CEU (p-value=0.0079) and in SJs for YRI (p-value=0.0001).
Genetic and transcriptional variations are important key factors in the evolution of biology and the dispensation of diseases. Single nucleotide polymorphisms (SNPs) are one type of DNA sequence alteration that is commonly used as a marker for tracking genetic variation. The allelic frequency of a SNP at a given locus can vary between populations and the genotype may code for a SNP that results in a particular phenotype, trait or disease , , , . Within populations and under certain biological conditions genes are coordinately regulated by transcript-regulators (TRs) such as transcription factors (TFs), cofactors, complexes of TFs and miRNAs (Table 1). These co-expressed genes often times share biological functions and work in concert to mediate cellular events such as biological processes and molecular pathways. Although it has been shown that TFs do not harbor trans-acting variants , coupling coordinately regulated genes as a quantitative trait for a loci (eQTL) with the genotype of SNPs as a genome-wide association study (GWAS) can presumably help to elucidate variation in gene expression (TReQTLs) on a genomic and systems biology scale that code for particular phenotypes and complex diseases .
Tailoring the GWAS eQTL analysis by considering genes with coordinated expression is of added value to reveal master regulators of transcriptional genetic variation (Figure 1). We used a multivariate linear regression with the gene expression of known downstream targets (DSTs) of TRs (Figure 2) as the response variable and individual SNPs as predictor variables to identify TReQTLs in European (CEU) and African (YRI) HapMap populations. At a nominal p-value threshold of <1×10−6 we discovered 234 SNPs in CEU and 154 in YRI as putative TReQTLs (Figure 4). These represent 36 and 39 independent (tag) SNPs in CEU and YRI affecting the DSTs of 25 and 36 TRs respectively. Two SNPs (one in each population) are cis-acting TReQTLs (within 1 kb of a gene) at a false discovery rate (FDR) of 45%. One of them, a SNP in the pecanex-like 2 (Pcnxl2) gene was found in CEU to be highly associated with the DSTs of the cAMP responsive element modulator (CREM) transfactor whereas in the YRI dataset, a SNP was linked to the DSTs of miRNA hsa-miR-125a. Although the FDR may seem abnormally high and one would expect at least one if not both of the TReQTLs to be false positives, it can be misleading as others have demonstrated that adjusting for biases which arise from correlations in eQTL analysis is a major challenge and a substantial overestimation of the number of false positives , , .
Interestingly enough, the gene expression of the DSTs of 24 TRs was associated with SNPs (albeit different ones) in both populations (Table 2) but the majority differed (Figures 3 and and4).4). The overlap in the TReQTLs probably reflects the ubiquity of certain basic biological processes such as transcription regulation, cell communication, transport, kinase activity, growth and development. On the otherhand, one TReQTL tag SNP (rs3790904) in the CEU population is associated (p-value=8.1×10−7) with the DSTs of the X-linked breast cancer suppressor gene Foxp3 (Figure 6) but is not significant in YRI (p-value=0.89). The interaction network of the Foxp3 TReQTL in CEU revealed that tumor necrosis factor (TNF), NF-kappaB and variants in G-protein coupled receptors (GPCR) signaling may play a central role as communicators in Foxp3 functional regulation (Figure 7). Although the Foxp3 tumor suppressor is biologically relevant in the pathogenesis of breast cancer, some have shown that SNPs in the germline of the gene are not associated with the risk of the disease . Our TReQTL analysis reveals other potentially interesting loci which might be causative in the etiology of complex diseases.
Another difference between the two populations based on the TReQTLs was the connectivity of the underlining Gene Ontology (GO) biological processes that the genes of the TReQTL represent (Figure 5). In CEU, several SNPs associated with the variation of expression for the DSTs of two miRNAs (hsa-mir-181b-1 (MI0000270) and hsa-mir-181b-2 (MI0000683)) are mapped to the peptidyl-prolyl cis-trans isomerase pseudo-gene and yields a subtree with synaptic transmission as the more cohesive descriptive GO term (Table 3). The activity of this enzyme has been suggested to be necessary for memory formation and may be involved in complex neurodegenerations such as Alzheimer's disease . In YRI, a SNP (rs12258754) controlling the variation of expression for the DSTs of activating transcription factor 3 (Atf3) yielded a subtree with vascular smooth muscle cell (VSMC) contraction as the more descriptive GO term (Table 3). Although much is not currently known about the function of Atf3 in VSMCs , mutations in the actin, alpha 2 (Acta2) smooth muscle gene have been shown to result in a variety of vascular diseases . Transcriptional networks such as these have been recently shown to be hubs with high connectivity and association with controlling higher-ordered biological function such as lipogenesis, lipid trafficking and surfactant homeostasis . Our approach embraces this strategy by using the SNPs within the TReQTLs as an adjudicator for the identification of master regulators of these genetic networks. Although it is expected that a TR and its DSTs will share a common signaling pathway, what is not certain is that the SNP associated with the eQTL from the TR and DSTs will reside near or in a gene with biological functionality that forms a cohesive GO biological process subtree. Bear in mind that it is not known where the true regulating TR associated with a candidate TReQTL actually exerts its biological functionality and to date, there is no independent data set with gene expression and genotype calls from another sample of the YRI and CEU populations to replicate our results. However, once the genotype data from Idaghdour et al.  are made publicly available, we will be able to use it to determine if our TReQTLs can discern between Moroccan populations according to geographical locations, regional differences and ancestry. Furthermore, in depth functional analyses on TR targets will presumably shed light on these TReQTL regulatory networks and perhaps biologically confirm our results.
McCauley et al.  reported that SNPs in multi-species conserved sequences (MCS) are useful as markers linking to complex diseases. Recent evidence suggests that SNPs that influence alternative splicing are enriched within splice junctions (SJs) or disrupt splicing enhancers , . Our analysis of Foxp3 TReQTLs revealed SNPs overrepresented within 5-way (human, mouse, chimp, rhesus monkey and dog) evolutionary conserved regions (ECRs) in CEU and in SJs of YRI defined by RNA-Seq mapping (Table 4). These results support the notion that genomics, genetics and transcriptomics play an intricate role in sustaining population diversity and structure . It would be interesting to determine how environmental factors, population structure and geographical differences affect transcript abundance as a quantitative trait when co-regulation of gene expression is considered.
Although the identification of TReQTLs is useful for determining genetic variants regulating gene expression, there are limitations to the approach and guidelines with interpretation of the results. First, there is a paucity of information about the genes which TRs control. We restricted our analysis to only 333 TRs with two or more DSTs known at a given time to be regulated by TRs. This does not capture the full array of genetic variants which might contribute to the gene expression differences between the two populations. However, as advances in functional genomics leads to improved knowledge about gene regulation and biological function on a genome-wide scale, the discovery of TReQTLs should advance and be more informative. In addition, the study of the transcript-regulation of genes by miRNA is in its infancy and there is a small number of miRNAs known to regulate genes. Furthermore, our analysis only tested the association of a single SNP with sets of coordinately expressed genes. It is very likely that the variation in expression is due to the synergistic effect of two or more SNPs. In fact, there may be other mediators of complex diseases other than SNPs acting alone or symbiotically. Finally, our work relied on samples from immortalized lymphoblastoid cell lines (LCLs) and not from a disease state. Therefore, it is debatable whether or not the genetic associations of SNPs with gene expression in LCLs will carry over to tissue samples from organs . However, there is some indication, albeit a paucity of evidence, that the DNA repair capacity of LCLs from breast cancer samples is significantly lower than control subjects , that tumor-infiltrating Foxp3+ regulatory T cells can distinguish between high-risk breast cancer patients and those at risk of a late relapse  and that a fraction of eQTLs derived from the analysis of UK Adult Twin registry LCLs gene expression and genotype data overlap with those identified in a HapMap population . Despite the caveats noted above, the advantages of associating genetic markers such as SNPs to quantitative traits such as co-regulated genes is promising and of value as an additional strategy when investigating the role of a genetic variant and master regulators in the etiology of a complex diseases.
Genotype data (phase-II, release 24, forward strand, non-redundant) from the 60 Yoruba in Ibadan, Nigeria (African: YRI) and from the 60 CEPH-Utah residents with ancestry from northern and western Europe (European: CEU) populations were obtained from the International HapMap Project . SNPs with a call rate <95%, minor allele frequency (MAF)<0.05, or Hardy-Weinberg equilibrium  p-value<0.05 within each population separately were removed and we restricted our analysis to autosomal markers only. About 2 million SNPs in CEU and ~2.2 million SNPs in YRI were retained after filtering. The approximately 1.5 million SNPs in common between the two populations after filtering (common set) were used for TReQTL preliminary analysis using a nominal p-value threshold of <1×10−6. To account for correlated SNPs, we used the LRTag approach ,  with linkage disequilibrium (LD) correlation (r2)>0.5 and MAF>=0.05 to tag 416,160 independent SNPs (tag set) of the 1.5 million in the common set. In addition, for multiple testing correction, we used the p-values of the tag SNPs that are within 1 kb of a gene (cis-acting set) to obtain an FDR. For the SNP association portion of the study, we focused the sample set on the 60 CEU and 55 YRI individuals that had corresponding gene expression data.
Gene expression data from the profiling of Epstein-Barr virus (EBV)-transformed lymphoblastoid cell lines from the individuals in each CEU and YRI HapMap population were acquired from assaying total RNA on Affymetrix Human HG-Focus Target GeneChip Array  containing about 9,000 probe sets and representing about 8,600 UniGenes (NetAffx array annotation release 29, March 30, 2009 UniGene build 219). The data are available in the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database ,  under accession number GSE10824. The data were log2 transformed and normalized using the robust multi-array average method . The gene expression data from probe sets that mapped to the same UniGene cluster were averaged resulting in 8,399 unique UniGene transcripts represented by probe sets on the array.
We obtained raw sequence reads from the whole transcriptome sequencing of the CEU and YRI populations , . These reads were mapped to the reference human genome (hg18, NCBI 36) using the Bowtie program . Splice junctions (SJs) were mapped using the Tophat program . Transcript abundance was calculated by the Cufflinks program  and normalized using the FPKM (Fragment Per Kilobase of Exon per Million reads) method . At a nominal p-value<1×10−4 there are a total of 7,149 and 7,040 TReQTL SNPs in the CEU and YRI populations respectively. Each SNP was evaluated for its presence in SJs in each individual from both populations. The total number of SNPs in each population was calculated based on their presence or absence in any individual of the population.
The Affymetrix GeneChip array probe sets were collapsed into approximately 8,600 UniGene transcript clusters using the March 30, 2009 release of UniGene (build 219). The gene expression data from probe sets that mapped to the same UniGene cluster were averaged resulting in 8,399 UniGene transcripts. The UniGene downstream targets (DSTs) of transcript-regulators (TRs: transcription factors (TFs), miRNAs, cofactors and complexes) were obtained from the March 26, 2010 release (version 2010.1) of the TRANSFAC® database , . TRs were mapped to signaling pathways using the June 26, 2009 release (version 2009.2) of the TRANSPATH® database . Significance of signal transduction pathway profiling was determined as previously described . Briefly, for each population individually, significant TRs were based on a Group Correlation Score
defined as the sum of the squares of the Pearson correlations (r) among all pairs of genes i and j determined to be DSTs of the TR. Significant pathways made up of TRs are based on an Exclusive Group Correlation Score
defined as the sum of the squares of r over all pairs of genes i and j in a pathway that do not share any TR. This eliminates the contribution of the co-expression of DSTs that share TRs. The p-value for a score was determined from a non-parametric distribution of correlation scores obtained from random cases (B=10,000 reshuffles of the genes) and the number of times (n) one of these permuted scores is greater than the observed correlation score. Thus, For both correlation scores, GCS and EGCS, this null hypothesis keeps the structure and overlap of all pathways fixed, but changes the identity of the genes.
For each population and TR, we performed a genome-wide scan by regressing the log2 expression levels of the DSTs on each SNP genotype (Z) (coded 0, 1, and 2 representing the number of minor alleles) separately across the genome. The following multivariate linear regression (MVR) model was used
where Yij denotes the log2 expression levels of the DST j (j=1,…, m) for a TR for subject i (i=1,…,n), m is the number of DSTs of the TR, Zi is a SNP genotype, åij is an error and εi=(εi1,…, εim) follows a multivariate normal distribution with mean 0 and covariance Σ. To test for the null hypothesis of association between a SNP and a TR, we performed the likelihood ratio test for testing the null hypothesis H0: β11=…=β1m=0, which follows a chi-square distribution with m degrees of freedom for m<n. Let the chi-square test statistic D=−2(ln(likelihood null model)–ln(likelihood full model)) where the null model is the MVR model without the genotypes corresponding to the SNP and the full model is the MVR model with the genotypes corresponding to the SNP. The p-value for each association of a SNP and set of DSTs for a TR was obtained from the distribution of D≈χ2 with degrees of freedom=m. For cases where m>=n, an approximate F-statistic , ,  was used in order to avoid situations where the covariance matrix from the MVR model is not full rank. In these cases, the p-value for statistical significance of each association of a SNP and a set of DSTs for a TR was assessed by permuting the n rows and n columns of the F-statistic G (Gower's centered) matrix (1×106 times) and determining the number of times one of these bootstrapped scores is greater than the observed score. We fit model (3) by regressing the DSTs of a TR on each SNP separately across the genome. For multiple testing correction, we used the 6.1×107 p-values from the regression of the DSTs of the 333 TRs on the 184,616 independent (tag) SNPs that are within 1 kb of a gene (cis-acting set) to control the FDR  at 45%.
To determine whether a set of SNPs representing TReQTLs are enriched within the genome, a variation of gene set enrichment analysis  was used. Rather than enrich for SNPs within pathways as previously described , we test for enrichment of SNPs within particular genomic regions. The regions of interest are either evolutionary conserved regions (ECRs) or splice junctions (SJs). The rationale is that variants which are related to gene expression differences between populations are more likely to be in genomic regions conserved across species and\or possibly involved in regulating transcription by altering splice forms of transcript messages. The ECRs are from 5-way (human, mouse, chimp, rhesus monkey and dog) and 17-way (human, chimp, macaque, mouse, rat, rabbit, dog, cow, armadillo, elephant, tenrec, opossum, chicken, frog, zebrafish, Teraodon, and Fugu) conservation scores for the +/− 20 kbp flanking regions of the genes. Overlapping chromosomal location intervals for phastCon scores  were segmented and the highest conservation score within the interval was obtained. SJs were mapped using Tophat . Transcript abundance was calculated by Cufflinks  and normalized using FPKM  (see the RNA-Seq methods section). For each TR, given a combined set L of SNPs associated with significant TReQTLs within the CEU or YRI population, their corresponding p-values and the +1 or −1 indication (flag) of the assignment of the SNP either within or not within genomic region set Si, an enrichment score (ES) is obtained by the ranking of the SNPs in ascending order (most significant to least significant), and then summing up the assignment flags. The ES is calculated by screening this list from the top to the bottom and increasing (decreasing) a running sum Kolmogorov–Smirnov-like statistic  when encountering or not encountering a SNP in a genomic region. A normalized ES (NES) is obtained by accounting for the size of the genomic region set Si. The p-value for enrichment was determined from a non-parametric distribution of NESs obtained from random cases (10,000 reshuffles of the assignment of the SNP (either within or not within a genomics region)) and the number of times one of these permuted scores is greater than the observed NES. Significance of enrichment was also confirmed using a two-tailed Fisher's exact (parametric) test.
The ranking of TReQTL SNPs was performed by first measuring the cohesion of GO biological process terms in TReQTL i:
where Ni is the number of nodes (biological process terms) represented in TReQTLi and μi is the number of the edges between nodes. The edges were derived from the structure of GO subtree for each TReQTLi created from the biological process terms of the gene that the SNP maps to or is in close proximity, those of the TRs (excluding miRNAs) and the DST genes. The cohesion measure αi is then weighted by the p-value of TReQTLi to give a weighted rank. The weight is computed as −2log10(p-value). Finally, the number of paths and the number of biological process terms within the subtree were used to derive of the adjusted cohesion score
where for the ith TReQTL, Mi is the maximum number of paths of a biological process term and Pi is the number of paths.
Ingenuity Pathway Analysis (IPA) software version 8.8 and canonical pathway content version 3204 were used to build gene interactions from Foxp3, its DSTs (CSF2 and Il2) and the significant TReQTL genes from CEU and mapped on chromosome 1 (KCNJ1, LPHN2 and LPHN3).
The TReQTLs for the CEU and YRI populations. Tab-delimited text file.
Gene Ontology biological process subtrees from the CEU TReQTLs. Tab-delimited text file.
Gene Ontology biological process subtrees from the YRI TReQTLs. Tab-delimited text file.
We would like to thank Yuehua Cui and Shaoyu Li for pseudo F-statistics details, Dmitri Zaykin for discussions about combining SNPs with gene expression for genetical genomics and Frank Dai at the NIEHS Computational Biology Facility for multiprocesser computing resources. We are extremely grateful to Lan Liu and Tao Jiang for providing us with the LRTag genome-wide and across populations SNP tagging application. We also thank Steve Kleeberger, Min Shi and Greg Gibson for their critical review of the manuscript.
Competing Interests: Jun Lun and Ahsan Huda are employeed by SRA International Inc. as contractors for NIEHS. This does not alter the authors' adherence to all the PLoS ONE policies on sharing data and materials.
Funding: This research was supported in part by the Intramural Research Program of the National Institutes of Health (NIH) and National Institute of Environmental Health Sciences (NIEHS) [Z01 ES102345-04] and in part by the National Cancer Institute Grants R37CA76404 and P01CA134294. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.