Annotating and interpreting the results of genome-wide association studies (GWAS) remains challenging. Assigning function to genetic variants as expression quantitative trait loci is an expanding and useful approach, but focuses exclusively on mRNA rather than protein levels. Many variants remain without annotation. To address this problem, we measured the steady state abundance of 441 human signaling and transcription factor proteins from 68 Yoruba HapMap lymphoblastoid cell lines to identify novel relationships between inter-individual protein levels, genetic variants, and sensitivity to chemotherapeutic agents. Proteins were measured using micro-western and reverse phase protein arrays from three independent cell line thaws to permit mixed effect modeling of protein biological replicates. We observed enrichment of protein quantitative trait loci (pQTLs) for cellular sensitivity to two commonly used chemotherapeutics: cisplatin and paclitaxel. We functionally validated the target protein of a genome-wide significant trans-pQTL for its relevance in paclitaxel-induced apoptosis. GWAS overlap results of drug-induced apoptosis and cytotoxicity for paclitaxel and cisplatin revealed unique SNPs associated with the pharmacologic traits (at p<0.001). Interestingly, GWAS SNPs from various regions of the genome implicated the same target protein (p<0.0001) that correlated with drug induced cytotoxicity or apoptosis (p≤0.05). Two genes were functionally validated for association with drug response using siRNA: SMC1A with cisplatin response and ZNF569 with paclitaxel response. This work allows pharmacogenomic discovery to progress from the transcriptome to the proteome and offers potential for identification of new therapeutic targets. This approach, linking targeted proteomic data to variation in pharmacologic response, can be generalized to other studies evaluating genotype-phenotype relationships and provide insight into chemotherapeutic mechanisms.
The central dogma of biology explains that DNA is transcribed to mRNA that is further translated into protein. Many genome-wide studies have implicated genetic variation that influences gene expression and that ultimately affect downstream complex traits including response to drugs. However, because of technical limitations, few studies have evaluated the contribution of genetic variation on protein expression and ensuing effects on downstream phenotypes. To overcome this challenge, we used a novel technology to simultaneously measure the baseline expression of 441 proteins in lymphoblastoid cell lines and compared them with publicly available genetic data. To further illustrate the utility of this approach, we compared protein-level measurements with chemotherapeutic induced apoptosis and cell-growth inhibition data. This study demonstrates the importance of using protein information to understand the functional consequences of genetic variants identified in genome-wide association studies. This protein data set will also have broad utility for understanding the relationship between other genome-wide studies of complex traits.
Exome sequencing has been widely used in detecting pathogenic nonsynonymous single nucleotide variants (SNVs) for human inherited diseases. However, traditional statistical genetics methods are ineffective in analyzing exome sequencing data, due to such facts as the large number of sequenced variants, the presence of non-negligible fraction of pathogenic rare variants or de novo mutations, and the limited size of affected and normal populations. Indeed, prevalent applications of exome sequencing have been appealing for an effective computational method for identifying causative nonsynonymous SNVs from a large number of sequenced variants. Here, we propose a bioinformatics approach called SPRING (Snv PRioritization via the INtegration of Genomic data) for identifying pathogenic nonsynonymous SNVs for a given query disease. Based on six functional effect scores calculated by existing methods (SIFT, PolyPhen2, LRT, MutationTaster, GERP and PhyloP) and five association scores derived from a variety of genomic data sources (gene ontology, protein-protein interactions, protein sequences, protein domain annotations and gene pathway annotations), SPRING calculates the statistical significance that an SNV is causative for a query disease and hence provides a means of prioritizing candidate SNVs. With a series of comprehensive validation experiments, we demonstrate that SPRING is valid for diseases whose genetic bases are either partly known or completely unknown and effective for diseases with a variety of inheritance styles. In applications of our method to real exome sequencing data sets, we show the capability of SPRING in detecting causative de novo mutations for autism, epileptic encephalopathies and intellectual disability. We further provide an online service, the standalone software and genome-wide predictions of causative SNVs for 5,080 diseases at http://bioinfo.au.tsinghua.edu.cn/spring.
The detection of causative nonsynonymous single nucleotide variants (SNVs) is essential for the understanding of the pathogenesis of human inherited diseases. In this paper, we propose a statistical method called SPRING (Snv PRioritization via the INtegration of Genomic data) to combine six functional effect scores calculated by existing methods and five association scores derived from multiple genomic data sources to estimate the statistical significance that a nonsynonymous SNV is pathogenic for a query disease. We find that SPRING is effective in identifying disease-causing SNVs for diseases whose genetic bases are either partly known or completely unknown across a variety of inheritance styles. With real exome sequencing data, we show the qualified potential of SPRING in not only the detection of causative SNVs in simulation studies but also the identification of pathogenic de novo mutations for autism, epileptic encephalopathies and intellectual disability.
Phenotypes proximal to gene action generally reflect larger genetic effect sizes than those that are distant. The human metabolome, a result of multiple cellular and biological processes, are functional intermediate phenotypes proximal to gene action. Here, we present a genome-wide association study of 308 untargeted metabolite levels among African Americans from the Atherosclerosis Risk in Communities (ARIC) Study. Nineteen significant common variant-metabolite associations were identified, including 13 novel loci (p<1.6×10−10). These loci were associated with 7–50% of the difference in metabolite levels per allele, and the variance explained ranged from 4% to 20%. Fourteen genes were identified within the nineteen loci, and four of them contained non-synonymous substitutions in four enzyme-encoding genes (KLKB1, SIAE, CPS1, and NAT8); the other significant loci consist of eight other enzyme-encoding genes (ACE, GATM, ACY3, ACSM2B, THEM4, ADH4, UGT1A, TREH), a transporter gene (SLC6A13) and a polycystin protein gene (PKD2L1). In addition, four potential disease-associated paths were identified, including two direct longitudinal predictive relationships: NAT8 with N-acetylornithine, N-acetyl-1-methylhistidine and incident chronic kidney disease, and TREH with trehalose and incident diabetes. These results highlight the value of using endophenotypes proximal to gene function to discover new insights into biology and disease pathology.
Most contemporary GWAS studies have achieved increased power by increasing the size of the discovery sample to tens of thousands of individuals. An alternative approach for detecting the effects of novel loci is to measure phenotypes that more immediately reflect the effects of gene function. The metabolome consists of a collection of small molecules resulting from a variety of cellular and biologic processes, which can be considered intermediate phenotypes proximal to gene function. Here, we report a genome-wide association study identifying nineteen genetic loci influencing untargeted metabolomes traits among African Americans in the Atherosclerosis Risk in Communities (ARIC) Study. Fourteen genes mapped within nineteen loci, including twelve enzyme-encoding genes (KLKB1, SIAE, CPS1, NAT8, ACE, GATM, ACY3, ACSM2B, THEM4, ADH4, UGT1A and TREH), a transporter gene (SLC6A13) and a polycystin protein gene (PKD2L1). In addition, four potential disease-associated paths were identified, including two direct longitudinal predictive relationships: NAT8 with N-acetylornithine, N-acetyl-1-methylhistidine and incident chronic kidney disease, and TREH with trehalose and incident diabetes. These results highlight the value of using phenotypes proximal to gene function to promote novel gene discovery.
Cross-sectional studies have associated short telomere length with smoking, body weight, physical activity, and possibly alcohol intake; however, whether these associations are due to confounding is unknown. We tested these hypotheses in 4,576 individuals from the general population cross-sectionally, and with repeat measurement of relative telomere length 10 years apart. We also tested whether change in telomere length is associated with mortality and morbidity in the general population. Relative telomere length was measured with quantitative polymerase chain reaction. Cross-sectionally at the first examination, short telomere length was associated with increased age (P for trend across quartiles = 3×10−77), current smoking (P = 8×10−3), increased body mass index (P = 7×10−14), physical inactivity (P = 4×10−17), but not with increased alcohol intake (P = 0.10). At the second examination 10 years later, 56% of participants had lost and 44% gained telomere length with a mean loss of 193 basepairs. Change in leukocyte telomere length during 10 years was associated inversely with baseline telomere length (P<1×10−300) and age at baseline (P = 1×10−27), but not with baseline or 10-year inter-observational tobacco consumption, body weight, physical activity, or alcohol intake. Prospectively during a further 10 years follow-up after the second examination, quartiles of telomere length change did not associate with risk of all-cause mortality, cancer, chronic obstructive pulmonary disease, diabetes mellitus, ischemic cerebrovascular disease, or ischemic heart disease. In conclusion, smoking, increased body weight, and physical inactivity were associated with short telomere length cross-sectionally, but not with telomere length change during 10 years observation, and alcohol intake was associated with neither. Also, change in telomere length did not associate prospectively with mortality or morbidity in the general population.
Human chromosomes are capped by protective ends called telomeres. These ends are shortened during renewal of tissue and eventually become critically short, causing cells to become senescent or die. It is widely believed that lifestyle features such as smoking, obesity, physical inactivity, and possibly alcohol intake enhance shortening of telomeres. However, strong evidence to support such an interpretation is hard to find. We therefore tested whether these lifestyle factors are associated with telomere length change in 4,576 healthy individuals from the general population. Individuals had relative telomere length measured twice with a 10-year interval, and were then followed for mortality and morbidity for a further 10 years after the second measurement. We found change in telomere length to be more dynamic than previously believed, as we observed both shortening (in 56%) and lengthening (in 44%) among participants. Contrary to previous beliefs, we found telomere length change to be unaffected by lifestyle factors. Instead, we found the strongest association between past telomere length and age with change in telomere length over 10 years. Also, we found no association between change in telomere length and risk of all-cause mortality, cancer, chronic obstructive lung disease, diabetes mellitus, ischemic cerebrovascular disease, or ischemic heart disease.
Genetic variation in the major histocompatibility complex (MHC) affects CD4∶CD8 lineage commitment and MHC expression. However, the contribution of specific genes in this gene-dense region has not yet been resolved. Nor has it been established whether the same genes regulate MHC expression and T cell selection. Here, we assessed the impact of natural genetic variation on MHC expression and CD4∶CD8 lineage commitment using two genetic models in the rat. First, we mapped Quantitative Trait Loci (QTLs) associated with variation in MHC class I and II protein expression and the CD4∶CD8 T cell ratio in outbred Heterogeneous Stock rats. We identified 10 QTLs across the genome and found that QTLs for the individual traits colocalized within a region spanning the MHC. To identify the genes underlying these overlapping QTLs, we generated a large panel of MHC-recombinant congenic strains, and refined the QTLs to two adjacent intervals of ∼0.25 Mb in the MHC-I and II regions, respectively. An interaction between these intervals affected MHC class I expression as well as negative selection and lineage commitment of CD8 single-positive (SP) thymocytes. We mapped this effect to the transporter associated with antigen processing 2 (Tap2) in the MHC-II region and the classical MHC class I gene(s) (RT1-A) in the MHC-I region. This interaction was revealed by a recombination between RT1-A and Tap2, which occurred in 0.2% of the rats. Variants of Tap2 have previously been shown to influence the antigenicity of MHC class I molecules by altering the MHC class I ligandome. Our results show that a restricted peptide repertoire on MHC class I molecules leads to reduced negative selection of CD8SP cells. To our knowledge, this is the first study showing how a recombination between natural alleles of genes in the MHC influences lineage commitment of T cells.
Peptides from degraded cytoplasmic proteins are transported via TAP into the endoplasmic reticulum for loading onto MHC class I molecules. TAP is encoded by Tap1 and Tap2, which in rodents are located close to the MHC class I genes. In the rat, genetic variation in Tap2 gives rise to two different transporters: a promiscuous A variant (TAP-A) and a more restrictive B variant (TAP-B). It has been proposed that the class I molecule in the DA rat (RT1-Aa) has co-evolved with TAP-A and it has been shown that RT1-Aa antigenicity is changed when co-expressed with TAP-B. To study the contribution of different allelic combinations of RT1-A and Tap2 to the variation in MHC expression and T cell selection, we generated DA rats with either congenic or background alleles in the RT1-A and Tap2 loci. We found increased numbers of mature CD8SP cells in the thymus of rats which co-expressed RT1-Aa and TAP-B. This increase of CD8 cells could be explained by reduced negative selection, but did not correlate with RT1-Aa expression levels on thymic antigen presenting cells. Thus, our results identify a crucial role of the TAP and the quality of the MHC class I repertoire in regulating T cell selection.
Metabolic traits are molecular phenotypes that can drive clinical phenotypes and may predict disease progression. Here, we report results from a metabolome- and genome-wide association study on 1H-NMR urine metabolic profiles. The study was conducted within an untargeted approach, employing a novel method for compound identification. From our discovery cohort of 835 Caucasian individuals who participated in the CoLaus study, we identified 139 suggestively significant (P<5×10−8) and independent associations between single nucleotide polymorphisms (SNP) and metabolome features. Fifty-six of these associations replicated in the TasteSensomics cohort, comprising 601 individuals from São Paulo of vastly diverse ethnic background. They correspond to eleven gene-metabolite associations, six of which had been previously identified in the urine metabolome and three in the serum metabolome. Our key novel findings are the associations of two SNPs with NMR spectral signatures pointing to fucose (rs492602, P = 6.9×10−44) and lysine (rs8101881, P = 1.2×10−33), respectively. Fine-mapping of the first locus pinpointed the FUT2 gene, which encodes a fucosyltransferase enzyme and has previously been associated with Crohn's disease. This implicates fucose as a potential prognostic disease marker, for which there is already published evidence from a mouse model. The second SNP lies within the SLC7A9 gene, rare mutations of which have been linked to severe kidney damage. The replication of previous associations and our new discoveries demonstrate the potential of untargeted metabolomics GWAS to robustly identify molecular disease markers.
The concentrations of small molecules known as metabolites, are subject to tight regulation in all organisms. Collectively, the metabolite concentrations make up the metabolome, which differs amongst individuals as a function of their environment and genetic makeup. In our study, we have further developed an untargeted approach to identify genetic factors affecting human metabolism. In this approach, we first identify all genetic variants that correlate with any of the measured metabolome features in a large set of individuals. For these variants, we then compute a profile of significance for association with all features, generating a signature that facilitates the expert or computational identification of the metabolite whose concentration is most likely affected by the genetic variant at hand. Our study replicated many of the previously reported genetically driven variations in human metabolism and revealed two new striking examples of genetic variations with a sizeable effect on the urine metabolome. Interestingly, in these two gene-metabolite pairs both the gene and the affected metabolite are related to human diseases – Crohn's disease in the first case, and kidney disease in the second. This highlights the connection between genetic predispositions, affected metabolites, and human health.
Transcription factors (TFs) are fundamental controllers of cellular regulation that function in a complex and combinatorial manner. Accurate identification of a transcription factor's targets is essential to understanding the role that factors play in disease biology. However, due to a high false positive rate, identifying coherent functional target sets is difficult. We have created an improved mapping of targets by integrating ChIP-Seq data with 423 functional modules derived from 9,395 human expression experiments. We identified 5,002 TF-module relationships, significantly improved TF target prediction, and found 30 high-confidence TF-TF associations, of which 14 are known. Importantly, we also connected TFs to diseases through these functional modules and identified 3,859 significant TF-disease relationships. As an example, we found a link between MEF2A and Crohn's disease, which we validated in an independent expression dataset. These results show the power of combining expression data and ChIP-Seq data to remove noise and better extract the associations between TFs, functional modules, and disease.
Transcription factors (TFs) are crucial to the precise regulation of many cellular processes and thus, are responsible for many human phenotypes and diseases. Now that the ENCODE project has mapped hundreds of TFs to their genomic binding locations, extracting functional biological signals is the next step in understanding their role in disease. In this paper, we present a novel approach to identifying TF targets and use these targets to find regulatory relationships between TFs and diseases. We present a large open dataset of putative TF-TF interactions and TF-disease associations which includes known connections as well as novel ones. We validate the association of one of our novel TF-disease associations, MEF2A and Crohn's disease, suggesting that our approach generates testable disease association hypotheses. Integrating these datasets will be crucial for understanding phenotypes and complex diseases.
Identifying environmentally-specific genetic effects is a key challenge in understanding the structure of complex traits. Model organisms play a crucial role in the identification of such gene-by-environment interactions, as a result of the unique ability to observe genetically similar individuals across multiple distinct environments. Many model organism studies examine the same traits but under varying environmental conditions. For example, knock-out or diet-controlled studies are often used to examine cholesterol in mice. These studies, when examined in aggregate, provide an opportunity to identify genomic loci exhibiting environmentally-dependent effects. However, the straightforward application of traditional methodologies to aggregate separate studies suffers from several problems. First, environmental conditions are often variable and do not fit the standard univariate model for interactions. Additionally, applying a multivariate model results in increased degrees of freedom and low statistical power. In this paper, we jointly analyze multiple studies with varying environmental conditions using a meta-analytic approach based on a random effects model to identify loci involved in gene-by-environment interactions. Our approach is motivated by the observation that methods for discovering gene-by-environment interactions are closely related to random effects models for meta-analysis. We show that interactions can be interpreted as heterogeneity and can be detected without utilizing the traditional uni- or multi-variate approaches for discovery of gene-by-environment interactions. We apply our new method to combine 17 mouse studies containing in aggregate 4,965 distinct animals. We identify 26 significant loci involved in High-density lipoprotein (HDL) cholesterol, many of which are consistent with previous findings. Several of these loci show significant evidence of involvement in gene-by-environment interactions. An additional advantage of our meta-analysis approach is that our combined study has significantly higher power and improved resolution compared to any single study thus explaining the large number of loci discovered in the combined study.
Identifying gene-by-environment interactions is important for understand the architecture of a complex trait. Discovering gene-by-environment interaction requires the observation of the same phenotype in individuals under different environments. Model organism studies are often conducted under different environments. These studies provide an unprecedented opportunity for researchers to identify the gene-by-environment interactions. A difference in the effect size of a genetic variant between two studies conducted in different environments may suggest the presence of a gene-by-environment interaction. In this paper, we propose to employ a random-effect-based meta-analysis approach to identify gene-by-environment interaction, which assumes different or heterogeneous effect sizes between studies. Our approach is motivated by the observation that methods for discovering gene-by-environment interactions are closely related to random effects models for meta-analysis. We show that interactions can be interpreted as heterogeneity and can be detected without utilizing the traditional approaches for discovery of gene-by-environment interactions, which treats the gene-by-environment interactions as covariates in the analysis. We provide a intuitive way to visualize the results of the meta-analysis at a locus which allows us to obtain the biological insights of gene-by-environment interactions. We demonstrate our method by searching for gene-by-environment interactions by combining 17 mouse genetic studies totaling 4,965 distinct animals.
Recent high-throughput efforts such as ENCODE have generated a large body of genome-scale transcriptional data in multiple conditions (e.g., cell-types and disease states). Leveraging these data is especially important for network-based approaches to human disease, for instance to identify coherent transcriptional modules (subnetworks) that can inform functional disease mechanisms and pathological pathways. Yet, genome-scale network analysis across conditions is significantly hampered by the paucity of robust and computationally-efficient methods. Building on the Higher-Order Generalized Singular Value Decomposition, we introduce a new algorithmic approach for efficient, parameter-free and reproducible identification of network-modules simultaneously across multiple conditions. Our method can accommodate weighted (and unweighted) networks of any size and can similarly use co-expression or raw gene expression input data, without hinging upon the definition and stability of the correlation used to assess gene co-expression. In simulation studies, we demonstrated distinctive advantages of our method over existing methods, which was able to recover accurately both common and condition-specific network-modules without entailing ad-hoc input parameters as required by other approaches. We applied our method to genome-scale and multi-tissue transcriptomic datasets from rats (microarray-based) and humans (mRNA-sequencing-based) and identified several common and tissue-specific subnetworks with functional significance, which were not detected by other methods. In humans we recapitulated the crosstalk between cell-cycle progression and cell-extracellular matrix interactions processes in ventricular zones during neocortex expansion and further, we uncovered pathways related to development of later cognitive functions in the cortical plate of the developing brain which were previously unappreciated. Analyses of seven rat tissues identified a multi-tissue subnetwork of co-expressed heat shock protein (Hsp) and cardiomyopathy genes (Bag3, Cryab, Kras, Emd, Plec), which was significantly replicated using separate failing heart and liver gene expression datasets in humans, thus revealing a conserved functional role for Hsp genes in cardiovascular disease.
Complex biological interactions and processes can be modelled as networks, for instance metabolic pathways or protein-protein interactions. The growing availability of large high-throughput data in several experimental conditions now permits the full-scale analysis of biological interactions and processes. However, no reliable and computationally efficient methods for simultaneous analysis of multiple large-scale interaction datasets (networks) have been developed to date. To overcome this shortcoming, we have developed a new computational framework that is parameter-free, computationally efficient and highly reliable. We showed how these distinctive properties make it a useful tool for real genomic data exploration and analyses. Indeed, in extensive simulation studies and real-data analyses we have demonstrated that our method outperformed existing approaches in terms of efficiency and, most importantly, reproducibility of the results. Beyond the computational advantages, we illustrated how our method can be effectively applied to leverage the vast stream of genome-scale transcriptional data that has risen exponentially over the last years. In contrast with existing approaches, using our method we were able to identify and replicate multi-tissue gene co-expression networks that were associated with specific functional processes relevant to phenotypic variation and disease in rats and humans.
Personal genome analysis is now being considered for evaluation of disease risk in healthy individuals, utilizing both rare and common variants. Multiple scores have been developed to predict the deleteriousness of amino acid substitutions, using information on the allele frequencies, level of evolutionary conservation, and averaged structural evidence. However, agreement among these scores is limited and they likely over-estimate the fraction of the genome that is deleterious.
This study proposes an integrative approach to identify a subset of homozygous non-synonymous single nucleotide polymorphisms (nsSNPs). An 8-level classification scheme is constructed from the presence/absence of deleterious predictions combined with evidence of association with disease or complex traits. Detailed literature searches and structural validations are then performed for a subset of homozygous 826 mis-sense mutations in 575 proteins found in the genomes of 12 healthy adults.
Implementation of the Association-Adjusted Consensus Deleterious Scheme (AACDS) classifies 11% of all predicted highly deleterious homozygous variants as most likely to influence disease risk. The number of such variants per genome ranges from 0 to 8 with no significant difference between African and Caucasian Americans. Detailed analysis of mutations affecting the APOE, MTMR2, THSB1, CHIA, αMyHC, and AMY2A proteins shows how the protein structure is likely to be disrupted, even though the associated phenotypes have not been documented in the corresponding individuals.
The classification system for homozygous nsSNPs provides an opportunity to systematically rank nsSNPs based on suggestive evidence from annotations and sequence-based predictions. The ranking scheme, in-depth literature searches, and structural validations of highly prioritized mis-sense mutations compliment traditional sequence-based approaches and should have particular utility for the development of individualized health profiles. An online tool reporting the AACDS score for any variant is provided at the authors’ website.
Homozygous variant; Non-synonymous single nucleotide polymorphism; Personal genome interpretation; Variant prioritization; Protein structure analysis
Human telomeres are maintained by the shelterin protein complex in which TRF1 and TRF2 bind directly to duplex telomeric DNA. How these proteins find telomeric sequences among a genome of billions of base pairs and how they find protein partners to form the shelterin complex remains uncertain. Using single-molecule fluorescence imaging of quantum dot-labeled TRF1 and TRF2, we study how these proteins locate TTAGGG repeats on DNA tightropes. By virtue of its basic domain TRF2 performs an extensive 1D search on nontelomeric DNA, whereas TRF1’s 1D search is limited. Unlike the stable and static associations observed for other proteins at specific binding sites, TRF proteins possess reduced binding stability marked by transient binding (∼9–17 s) and slow 1D diffusion on specific telomeric regions. These slow diffusion constants yield activation energy barriers to sliding ∼2.8–3.6 κBT greater than those for nontelomeric DNA. We propose that the TRF proteins use 1D sliding to find protein partners and assemble the shelterin complex, which in turn stabilizes the interaction with specific telomeric DNA. This ‘tag-team proofreading’ represents a more general mechanism to ensure a specific set of proteins interact with each other on long repetitive specific DNA sequences without requiring external energy sources.
The major histocompatibility complex (MHC) region is strongly associated with multiple sclerosis (MS) susceptibility. HLA-DRB1*15:01 has the strongest effect, and several other alleles have been reported at different levels of validation. Using SNP data from genome-wide studies, we imputed and tested classical alleles and amino acid polymorphisms in 8 classical human leukocyte antigen (HLA) genes in 5,091 cases and 9,595 controls. We identified 11 statistically independent effects overall: 6 HLA-DRB1 and one DPB1 alleles in class II, one HLA-A and two B alleles in class I, and one signal in a region spanning from MICB to LST1. This genomic segment does not contain any HLA class I or II genes and provides robust evidence for the involvement of a non-HLA risk allele within the MHC. Interestingly, this region contains the TNF gene, the cognate ligand of the well-validated TNFRSF1A MS susceptibility gene. The classical HLA effects can be explained to some extent by polymorphic amino acid positions in the peptide-binding grooves. This study dissects the independent effects in the MHC, a critical region for MS susceptibility that harbors multiple risk alleles.
Multiple sclerosis (MS) is an inflammatory and neurodegenerative disease with a heritable component. Although it has been known for a long time that the strongest MS risk factor maps to the major histocompatibility complex (MHC) on chromosome 6, there are still many unresolved questions as to the identity and the nature of the risk variants within the MHC. Because the MHC has a complex structure, systematic investigation across this region has been challenging. In this study, we used state-of-the-art imputation methods coupled to statistical regression to query variants in the human leukocyte antigen (HLA) class I and II genes for a role in MS risk. Starting from available SNP genotype data, we replicated the strongest risk factor, the HLA-DRB1*15:01 allele, and were able to identify 11 independent effects in total. Functional studies are now needed to understand their mechanism in MS etiology.
Interactions between HLA class I molecules and killer-cell immunoglobulin-like receptors (KIR) control natural killer cell (NK) functions in immunity and reproduction. Encoded by genes on different chromosomes, these polymorphic ligands and receptors correlate highly with disease resistance and susceptibility. Although studied at low-resolution in many populations, high-resolution analysis of combinatorial diversity of HLA class I and KIR is limited to Asian and Amerindian populations with low genetic diversity. At the other end of the spectrum is the West African population investigated here: we studied 235 individuals, including 104 mother-child pairs, from the Ga-Adangbe of Ghana. This population has a rich diversity of 175 KIR variants forming 208 KIR haplotypes, and 81 HLA-A, -B and -C variants forming 190 HLA class I haplotypes. Each individual we studied has a unique compound genotype of HLA class I and KIR, forming 1–14 functional ligand-receptor interactions. Maintaining this exceptionally high polymorphism is balancing selection. The centromeric region of the KIR locus, encoding HLA-C receptors, is highly diverse whereas the telomeric region encoding Bw4-specific KIR3DL1, lacks diversity in Africans. Present in the Ga-Adangbe are high frequencies of Bw4-bearing HLA-B*53:01 and Bw4-lacking HLA-B*35:01, which otherwise are identical. Balancing selection at key residues maintains numerous HLA-B allotypes having and lacking Bw4, and also those of stronger and weaker interaction with LILRB1, a KIR-related receptor. Correspondingly, there is a balance at key residues of KIR3DL1 that modulate its level of cell-surface expression. Thus, capacity to interact with NK cells synergizes with peptide binding diversity to drive HLA-B allele frequency distribution. These features of KIR and HLA are consistent with ongoing co-evolution and selection imposed by a pathogen endemic to West Africa. Because of the prevalence of malaria in the Ga-Adangbe and previous associations of cerebral malaria with HLA-B*53:01 and KIR, Plasmodium falciparum is a candidate pathogen.
Natural killer cells are white blood cells with critical roles in human health that deliver front-line immunity against pathogens and nurture placentation in early pregnancy. Controlling these functions are cell-surface receptors called KIR that interact with HLA class I ligands expressed on most cells of the body. KIR and HLA are both products of complex families of variable genes, but present on separate chromosomes. Many HLA and KIR variants and their combinations associate with resistance to specific infections and pregnancy syndromes. Previously we identified basic components of the system necessary for individual and population survival. Here, we explore the system at its most genetically diverse by studying the Ga-Adangbe population from Ghana in West Africa. Co-evolution of KIR receptors with their HLA targets is ongoing in the Ga-Adangbe, with every one of 235 individuals studied having a unique set of KIR receptors and HLA class I ligands. In addition, one critical combination of receptor and ligand maintains alternative forms that either can or cannot interact with their ‘partner.’ This balance resembles that induced by malfunctioning variants of hemoglobin that confer resistance to malaria, a candidate disease for driving diversity and co-evolution of KIR and HLA class I in the Ga-Adangbe.
The improved characterisation of risk factors for rheumatoid arthritis (RA) suggests they could be combined to identify individuals at increased disease risks in whom preventive strategies may be evaluated. We aimed to develop an RA prediction model capable of generating clinically relevant predictive data and to determine if it better predicted younger onset RA (YORA). Our novel modelling approach combined odds ratios for 15 four-digit/10 two-digit HLA-DRB1 alleles, 31 single nucleotide polymorphisms (SNPs) and ever-smoking status in males to determine risk using computer simulation and confidence interval based risk categorisation. Only males were evaluated in our models incorporating smoking as ever-smoking is a significant risk factor for RA in men but not women. We developed multiple models to evaluate each risk factor's impact on prediction. Each model's ability to discriminate anti-citrullinated protein antibody (ACPA)-positive RA from controls was evaluated in two cohorts: Wellcome Trust Case Control Consortium (WTCCC: 1,516 cases; 1,647 controls); UK RA Genetics Group Consortium (UKRAGG: 2,623 cases; 1,500 controls). HLA and smoking provided strongest prediction with good discrimination evidenced by an HLA-smoking model area under the curve (AUC) value of 0.813 in both WTCCC and UKRAGG. SNPs provided minimal prediction (AUC 0.660 WTCCC/0.617 UKRAGG). Whilst high individual risks were identified, with some cases having estimated lifetime risks of 86%, only a minority overall had substantially increased odds for RA. High risks from the HLA model were associated with YORA (P<0.0001); ever-smoking associated with older onset disease. This latter finding suggests smoking's impact on RA risk manifests later in life. Our modelling demonstrates that combining risk factors provides clinically informative RA prediction; additionally HLA and smoking status can be used to predict the risk of younger and older onset RA, respectively.
Rheumatoid arthritis (RA) is a common, incurable disease with major individual and health service costs. Preventing its development is therefore an important goal. Being able to predict who will develop RA would allow researchers to look at ways to prevent it. Many factors have been found that increase someone's risk of RA. These are divided into genetic and environmental (such as smoking) factors. The risk of RA associated with each factor has previously been reported. Here, we demonstrate a method that combines these risk factors in a process called “prediction modelling” to estimate someone's lifetime risk of RA. We show that firstly, our prediction models can identify people with very high-risks of RA and secondly, they can be used to identify people at risk of developing RA at a younger age. Although these findings are an important first step towards preventing RA, as only a minority of people tested had substantially increased disease risks our models could not be used to screen the general population. Instead they need testing in people already at risk of RA such as relatives of affected patients. In this context they could identify enough numbers of high-risk people to allow preventive methods to be evaluated.
A multi-ethnic study demonstrates that the extrapolation of genetic disease risk models from European populations to other ethnicities is compromised more strongly by genetic structure than by environmental or global genetic background in differential genetic risk associations across ethnicities.
The vast majority of genome-wide association study (GWAS) findings reported to date are from populations with European Ancestry (EA), and it is not yet clear how broadly the genetic associations described will generalize to populations of diverse ancestry. The Population Architecture Using Genomics and Epidemiology (PAGE) study is a consortium of multi-ancestry, population-based studies formed with the objective of refining our understanding of the genetic architecture of common traits emerging from GWAS. In the present analysis of five common diseases and traits, including body mass index, type 2 diabetes, and lipid levels, we compare direction and magnitude of effects for GWAS-identified variants in multiple non-EA populations against EA findings. We demonstrate that, in all populations analyzed, a significant majority of GWAS-identified variants have allelic associations in the same direction as in EA, with none showing a statistically significant effect in the opposite direction, after adjustment for multiple testing. However, 25% of tagSNPs identified in EA GWAS have significantly different effect sizes in at least one non-EA population, and these differential effects were most frequent in African Americans where all differential effects were diluted toward the null. We demonstrate that differential LD between tagSNPs and functional variants within populations contributes significantly to dilute effect sizes in this population. Although most variants identified from GWAS in EA populations generalize to all non-EA populations assessed, genetic models derived from GWAS findings in EA may generate spurious results in non-EA populations due to differential effect sizes. Regardless of the origin of the differential effects, caution should be exercised in applying any genetic risk prediction model based on tagSNPs outside of the ancestry group in which it was derived. Models based directly on functional variation may generalize more robustly, but the identification of functional variants remains challenging.
The number of known associations between human diseases and common genetic variants has grown dramatically in the past decade, most being identified in large-scale genetic studies of people of Western European origin. But because the frequencies of genetic variants can differ substantially between continental populations, it's important to assess how well these associations can be extended to populations with different continental ancestry. Are the correlations between genetic variants, disease endpoints, and risk factors consistent enough for genetic risk models to be reliably applied across different ancestries? Here we describe a systematic analysis of disease outcome and risk-factor–associated variants (tagSNPs) identified in European populations, in which we test whether the effect size of a tagSNP is consistent across six populations with significant non-European ancestry. We demonstrate that although nearly all such tagSNPs have effects in the same direction across all ancestries (i.e., variants associated with higher risk in Europeans will also be associated with higher risk in other populations), roughly a quarter of the variants tested have significantly different magnitude of effect (usually lower) in at least one non-European population. We therefore advise caution in the use of tagSNP-based genetic disease risk models in populations that have a different genetic ancestry from the population in which original associations were first made. We then show that this differential strength of association can be attributed to population-dependent variations in the correlation between tagSNPs and the variant that actually determines risk—the so-called functional variant. Risk models based on functional variants are therefore likely to be more robust than tagSNP-based models.
Genome-wide association studies and follow-up meta-analyses in Crohn's disease (CD) and ulcerative colitis (UC) have recently identified 163 disease-associated loci that meet genome-wide significance for these two inflammatory bowel diseases (IBD). These discoveries have already had a tremendous impact on our understanding of the genetic architecture of these diseases and have directed functional studies that have revealed some of the biological functions that are important to IBD (e.g. autophagy). Nonetheless, these loci can only explain a small proportion of disease variance (∼14% in CD and 7.5% in UC), suggesting that not only are additional loci to be found but that the known loci may contain high effect rare risk variants that have gone undetected by GWAS. To test this, we have used a targeted sequencing approach in 200 UC cases and 150 healthy controls (HC), all of French Canadian descent, to study 55 genes in regions associated with UC. We performed follow-up genotyping of 42 rare non-synonymous variants in independent case-control cohorts (totaling 14,435 UC cases and 20,204 HC). Our results confirmed significant association to rare non-synonymous coding variants in both IL23R and CARD9, previously identified from sequencing of CD loci, as well as identified a novel association in RNF186. With the exception of CARD9 (OR = 0.39), the rare non-synonymous variants identified were of moderate effect (OR = 1.49 for RNF186 and OR = 0.79 for IL23R). RNF186 encodes a protein with a RING domain having predicted E3 ubiquitin-protein ligase activity and two transmembrane domains. Importantly, the disease-coding variant is located in the ubiquitin ligase domain. Finally, our results suggest that rare variants in genes identified by genome-wide association in UC are unlikely to contribute significantly to the overall variance for the disease. Rather, these are expected to help focus functional studies of the corresponding disease loci.
Genetic studies of common diseases have seen tremendous progress in the last half-decade primarily due to recent technologies that enable a systematic examination of genetic markers across the entire genome in large numbers of patients and healthy controls. The studies, while identifying genomic regions that influence a person's risk for developing disease, often do not pinpoint the actual gene or gene variants that account for this risk (called a causal gene/variant). A prime example of this can be seen with the 163 genetic risk factors that have recently been associated with the chronic inflammatory bowel diseases known as Crohn's disease and ulcerative colitis. For less than a handful of these 163 is the causative change in the genetic code known. The current study used an approach to directly look at the genetic code for a subset of these and identified a causative change in the genetic code for eight risk factors for ulcerative colitis. This finding is particularly important because it directs biological studies to understand the mechanisms that lead to this chronic life-long inflammatory disease.
Genome-wide association studies (GWAS) yielded significant advances in defining the genetic architecture of complex traits and disease. Still, a major hurdle of GWAS is narrowing down multiple genetic associations to a few causal variants for functional studies. This becomes critical in multi-phenotype GWAS where detection and interpretability of complex SNP(s)-trait(s) associations are complicated by complex Linkage Disequilibrium patterns between SNPs and correlation between traits. Here we propose a computationally efficient algorithm (GUESS) to explore complex genetic-association models and maximize genetic variant detection. We integrated our algorithm with a new Bayesian strategy for multi-phenotype analysis to identify the specific contribution of each SNP to different trait combinations and study genetic regulation of lipid metabolism in the Gutenberg Health Study (GHS). Despite the relatively small size of GHS (n = 3,175), when compared with the largest published meta-GWAS (n>100,000), GUESS recovered most of the major associations and was better at refining multi-trait associations than alternative methods. Amongst the new findings provided by GUESS, we revealed a strong association of SORT1 with TG-APOB and LIPC with TG-HDL phenotypic groups, which were overlooked in the larger meta-GWAS and not revealed by competing approaches, associations that we replicated in two independent cohorts. Moreover, we demonstrated the increased power of GUESS over alternative multi-phenotype approaches, both Bayesian and non-Bayesian, in a simulation study that mimics real-case scenarios. We showed that our parallel implementation based on Graphics Processing Units outperforms alternative multi-phenotype methods. Beyond multivariate modelling of multi-phenotypes, our Bayesian model employs a flexible hierarchical prior structure for genetic effects that adapts to any correlation structure of the predictors and increases the power to identify associated variants. This provides a powerful tool for the analysis of diverse genomic features, for instance including gene expression and exome sequencing data, where complex dependencies are present in the predictor space.
Nowadays, the availability of cheaper and accurate assays to quantify multiple (endo)phenotypes in large population cohorts allows multi-trait studies. However, these studies are limited by the lack of flexible models integrated with efficient computational tools for genome-wide multi SNPs-traits analyses. To overcome this problem, we propose a novel Bayesian analysis strategy and a new algorithmic implementation which exploits parallel processing architecture for fully multivariate modeling of groups of correlated phenotypes at the genome-wide scale. In addition to increased power of our algorithm over alternative Bayesian and well-established non-Bayesian multi-phenotype methods, we provide an application to a real case study of several blood lipid traits, and show how our method recovered most of the major associations and is better at refining multi-trait polygenic associations than alternative methods. We reveal and replicate in independent cohorts new associations with two phenotypic groups that were not detected by competing multivariate approaches and not noticed by a large meta-GWAS. We also discuss the applicability of the proposed method to large meta-analyses involving hundreds of thousands of individuals and to diverse genomic datasets where complex dependencies in the predictor space are present.
Genetic variants in cis-regulatory elements or trans-acting regulators frequently influence the quantity and spatiotemporal distribution of gene transcription. Recent interest in expression quantitative trait locus (eQTL) mapping has paralleled the adoption of genome-wide association studies (GWAS) for the analysis of complex traits and disease in humans. Under the hypothesis that many GWAS associations tag non-coding SNPs with small effects, and that these SNPs exert phenotypic control by modifying gene expression, it has become common to interpret GWAS associations using eQTL data. To fully exploit the mechanistic interpretability of eQTL-GWAS comparisons, an improved understanding of the genetic architecture and causal mechanisms of cell type specificity of eQTLs is required. We address this need by performing an eQTL analysis in three parts: first we identified eQTLs from eleven studies on seven cell types; then we integrated eQTL data with cis-regulatory element (CRE) data from the ENCODE project; finally we built a set of classifiers to predict the cell type specificity of eQTLs. The cell type specificity of eQTLs is associated with eQTL SNP overlap with hundreds of cell type specific CRE classes, including enhancer, promoter, and repressive chromatin marks, regions of open chromatin, and many classes of DNA binding proteins. These associations provide insight into the molecular mechanisms generating the cell type specificity of eQTLs and the mode of regulation of corresponding eQTLs. Using a random forest classifier with cell specific CRE-SNP overlap as features, we demonstrate the feasibility of predicting the cell type specificity of eQTLs. We then demonstrate that CREs from a trait-associated cell type can be used to annotate GWAS associations in the absence of eQTL data for that cell type. We anticipate that such integrative, predictive modeling of cell specificity will improve our ability to understand the mechanistic basis of human complex phenotypic variation.
When interpreting genome-wide association studies showing that specific genetic variants are associated with disease risk, scientists look for a link between the genetic variant and a biological mechanism behind that disease. One functional mechanism is that the genetic variant may influence gene transcription via a co-localized genomic regulatory element, such as a transcription factor binding site within an open chromatin region. Often this type of regulation occurs in some cell types but not others. In this study, we look across eleven gene expression studies with seven cell types and consider how genetic transcription regulators, or eQTLs, replicate within and between cell types. We identify pervasive allelic heterogeneity, or transcriptional control of a single gene by multiple, independent eQTLs. We integrate extensive data on cell type specific regulatory elements from ENCODE to identify general methods of transcription regulation through enrichment of eQTLs within regulatory elements. We also build a classifier to predict eQTL replication across cell types. The results in this paper present a path to an integrative, predictive approach to improve our ability to understand the mechanistic basis of human phenotypic variation.
Coronary heart disease (CHD) is the leading cause of mortality in both developed and developing countries worldwide. Genome-wide association studies (GWAS) have now identified 46 independent susceptibility loci for CHD, however, the biological and disease-relevant mechanisms for these associations remain elusive. The large-scale meta-analysis of GWAS recently identified in Caucasians a CHD-associated locus at chromosome 6q23.2, a region containing the transcription factor TCF21 gene. TCF21 (Capsulin/Pod1/Epicardin) is a member of the basic-helix-loop-helix (bHLH) transcription factor family, and regulates cell fate decisions and differentiation in the developing coronary vasculature. Herein, we characterize a cis-regulatory mechanism by which the lead polymorphism rs12190287 disrupts an atypical activator protein 1 (AP-1) element, as demonstrated by allele-specific transcriptional regulation, transcription factor binding, and chromatin organization, leading to altered TCF21 expression. Further, this element is shown to mediate signaling through platelet-derived growth factor receptor beta (PDGFR-β) and Wilms tumor 1 (WT1) pathways. A second disease allele identified in East Asians also appears to disrupt an AP-1-like element. Thus, both disease-related growth factor and embryonic signaling pathways may regulate CHD risk through two independent alleles at TCF21.
As much as half of the risk of developing coronary heart disease is genetically predetermined. Genome-wide association studies in human populations have now uncovered multiple sites of common genetic variation associated with heart disease. However, the biological mechanisms responsible for linking the disease associations with changes in gene expression are still underexplored. One of these variants occurs within the vascular developmental factor, TCF21, leading to dysregulated gene expression. Using various in silico and molecular approaches, we identify an intricate allele-specific regulatory mechanism underlying altered expression of TCF21. Notably, we observe that two apparently independent risk alleles identified in distinct populations function through a similar regulatory mechanism. Together these data suggest that conserved upstream pathways may organize the complex genetic etiology of coronary heart disease and potentially lead to new treatment opportunities.
Whole genome sequencing is poised to revolutionize personalized medicine, providing the capacity to classify individuals into risk categories for a wide range of diseases. Here we begin to explore how whole genome sequencing (WGS) might be incorporated alongside traditional clinical evaluation as a part of preventive medicine. The present study illustrates novel approaches for integrating genotypic and clinical information for assessment of generalized health risks and to assist individuals in the promotion of wellness and maintenance of good health.
Whole genome sequences and longitudinal clinical profiles are described for eight middle-aged Caucasian participants (four men and four women) from the Center for Health Discovery and Well Being (CHDWB) at Emory University in Atlanta. We report multivariate genotypic risk assessments derived from common variants reported by genome-wide association studies (GWAS), as well as clinical measures in the domains of immune, metabolic, cardiovascular, musculoskeletal, respiratory, and mental health.
Polygenic risk is assessed for each participant for over 100 diseases and reported relative to baseline population prevalence. Two approaches for combining clinical and genetic profiles for the purposes of health assessment are then presented. First we propose conditioning individual disease risk assessments on observed clinical status for type 2 diabetes, coronary artery disease, hypertriglyceridemia and hypertension, and obesity. An approximate 2:1 ratio of concordance between genetic prediction and observed sub-clinical disease is observed. Subsequently, we show how more holistic combination of genetic, clinical and family history data can be achieved by visualizing risk in eight sub-classes of disease. Having identified where their profiles are broadly concordant or discordant, an individual can focus on individual clinical results or genotypes as they develop personalized health action plans in consultation with a health partner or coach.
The CHDWB will facilitate longitudinal evaluation of wellness-focused medical care based on comprehensive self-knowledge of medical risks.
genetic prediction; risk assessment; preventive medicine; clinical profiling
Autism Spectrum Disorders (ASD) are highly heritable and characterised by impairments in social interaction and communication, and restricted and repetitive behaviours. Considering four sets of de novo copy number variants (CNVs) identified in 181 individuals with autism and exploiting mouse functional genomics and known protein-protein interactions, we identified a large and significantly interconnected interaction network. This network contains 187 genes affected by CNVs drawn from 45% of the patients we considered and 22 genes previously implicated in ASD, of which 192 form a single interconnected cluster. On average, those patients with copy number changed genes from this network possess changes in 3 network genes, suggesting that epistasis mediated through the network is extensive. Correspondingly, genes that are highly connected within the network, and thus whose copy number change is predicted by the network to be more phenotypically consequential, are significantly enriched among patients that possess only a single ASD-associated network copy number changed gene (p = 0.002). Strikingly, deleted or disrupted genes from the network are significantly enriched in GO-annotated positive regulators (2.3-fold enrichment, corrected p = 2×10−5), whereas duplicated genes are significantly enriched in GO-annotated negative regulators (2.2-fold enrichment, corrected p = 0.005). The direction of copy change is highly informative in the context of the network, providing the means through which perturbations arising from distinct deletions or duplications can yield a common outcome. These findings reveal an extensive ASD-associated molecular network, whose topology indicates ASD-relevant mutational deleteriousness and that mechanistically details how convergent aetiologies can result extensively from CNVs affecting pathways causally implicated in ASD.
Autism Spectrum Disorders (ASD) are characterised by impairments in social interaction and communication, and restricted and repetitive behaviours. ASD are highly heritable and many different stretches of DNA have been found to be duplicated or deleted in individuals with ASD. We found that an unusually high number of genes affected by these DNA deletions/duplications are associated with the functioning of synaptic transmission between nerve cells. The proteins made by many of these genes are known to interact with each other and, together with proteins from other deleted/duplicated genes, form a large interlinked biological network. This network was affected by almost 50% of the deletions/duplications in the ASD patients considered. Many individual ASD patients had deletions or duplications of multiple genes within this network, but for those patients with just a single gene from the network changed, that single gene appeared to play an important role. Furthermore, the network predicts that the effects arising from the genes in the deletions are similar to the effects arising from the genes in the duplications. Thus, the way that this ASD-associated network is wired together contributes to the understanding of the impact of these DNA deletions and duplications.
An under-appreciated aspect of the genetic analysis of gene expression is the impact of post-probe level normalization on biological inference. Here we contrast nine different methods for normalization of an Illumina bead-array gene expression profiling dataset consisting of peripheral blood samples from 189 individual participants in the Center for Health Discovery and Well Being study in Atlanta, quantifying differences in the inference of global variance components and covariance of gene expression, as well as the detection of variants that affect transcript abundance (eSNPs). The normalization strategies, all relative to raw log2 measures, include simple mean centering, two modes of transcript-level linear adjustment for technical factors, and for differential immune cell counts, variance normalization by interquartile range and by quantile, fitting the first 16 Principal Components, and supervised normalization using the SNM procedure with adjustment for cell counts. Robustness of genetic associations as a consequence of Pearson and Spearman rank correlation is also reported for each method, and it is shown that the normalization strategy has a far greater impact than correlation method. We describe similarities among methods, discuss the impact on biological interpretation, and make recommendations regarding appropriate strategies.
microarray analysis; normalization; variance component analysis; eSNP
There is increasing evidence that heritable variation in gene expression underlies genetic variation in susceptibility to disease. Therefore, a comprehensive understanding of the similarity between relatives for transcript variation is warranted—in particular, dissection of phenotypic variation into additive and non-additive genetic factors and shared environmental effects. We conducted a gene expression study in blood samples of 862 individuals from 312 nuclear families containing MZ or DZ twin pairs using both pedigree and genotype information. From a pedigree analysis we show that the vast majority of genetic variation across 17,994 probes is additive, although non-additive genetic variation is identified for 960 transcripts. For 180 of the 960 transcripts with non-additive genetic variation, we identify expression quantitative trait loci (eQTL) with dominance effects in a sample of 339 unrelated individuals and replicate 31% of these associations in an independent sample of 139 unrelated individuals. Over-dominance was detected and replicated for a trans association between rs12313805 and ETV6, located 4MB apart on chromosome 12. Surprisingly, only 17 probes exhibit significant levels of common environmental effects, suggesting that environmental and lifestyle factors common to a family do not affect expression variation for most transcripts, at least those measured in blood. Consistent with the genetic architecture of common diseases, gene expression is predominantly additive, but a minority of transcripts display non-additive effects.
Gene expression levels are known to influence common disease susceptibility in humans, with GWAS significant SNPs frequently found in regulatory regions. The expression levels of most genes are influenced by genetic variants, often located close to the gene itself. Expression Quantitative Trait Loci (eQTL) mapping studies have been very successful in identifying SNPs associated with expression levels; however, little is currently known about the extent of additive and non-additive genetic variance and the role of common environment on gene expression. Here we report a comprehensive study of the sources of genetic and non-genetic variation for gene expression levels using both pedigree and genotype information. We show that the majority of transcripts exhibit only additive genetic variance with congruence from independent methods using pedigree and genotype approaches. However, there are a small number of probes whose expression levels are influenced by non-additive genetic variance. For some of these probes we identify SNPs acting in a dominant and over-dominant manner that replicate in an independent sample. Surprisingly, only 17 probes exhibit significant levels of common environmental effects, suggesting that environmental and lifestyle factors common to a family do not affect expression variation for most transcripts, at least those measured in blood.
Mapping expression Quantitative Trait Loci (eQTLs) represents a powerful and widely adopted approach to identifying putative regulatory variants and linking them to specific genes. Up to now eQTL studies have been conducted in a relatively narrow range of tissues or cell types. However, understanding the biology of organismal phenotypes will involve understanding regulation in multiple tissues, and ongoing studies are collecting eQTL data in dozens of cell types. Here we present a statistical framework for powerfully detecting eQTLs in multiple tissues or cell types (or, more generally, multiple subgroups). The framework explicitly models the potential for each eQTL to be active in some tissues and inactive in others. By modeling the sharing of active eQTLs among tissues, this framework increases power to detect eQTLs that are present in more than one tissue compared with “tissue-by-tissue” analyses that examine each tissue separately. Conversely, by modeling the inactivity of eQTLs in some tissues, the framework allows the proportion of eQTLs shared across different tissues to be formally estimated as parameters of a model, addressing the difficulties of accounting for incomplete power when comparing overlaps of eQTLs identified by tissue-by-tissue analyses. Applying our framework to re-analyze data from transformed B cells, T cells, and fibroblasts, we find that it substantially increases power compared with tissue-by-tissue analysis, identifying 63% more genes with eQTLs (at FDR = 0.05). Further, the results suggest that, in contrast to previous analyses of the same data, the majority of eQTLs detectable in these data are shared among all three tissues.
Genetic variants that are associated with gene expression are known as expression Quantitative Trait Loci, or eQTLs. Many studies have been conducted to identify eQTLs, and they have proven an effective tool for identifying putative regulatory variants and linking them to specific genes. Up to now most studies have been conducted in a single tissue or cell type, but moving forward this is changing, and ongoing studies are collecting data aimed at mapping eQTLs in dozens of tissues. Current statistical methods are not able to fully exploit the richness of these kinds of data, taking account of both the sharing and differences in eQTLs among tissues. In this paper we develop a statistical framework to address this problem, to improve power to detect eQTLs when they are shared among multiple tissues, and to allow for differences among tissues to be estimated. Applying these methods to data from three tissues suggests that sharing of eQTLs among tissues may be substantially more common than it appeared in previous analyses of the same data.