As custom arrays are cheaper than generic GWAS arrays, larger sample size is achievable for gene discovery. Custom arrays can tag more variants through denser genotyping of SNPs at associated loci, but at the cost of losing genome-wide coverage. Balancing this trade-off is important for maximizing experimental designs. We quantified both the gain in captured SNP-heritability at known candidate regions and the loss due to imperfect genome-wide coverage for inflammatory bowel disease using immunochip (iChip) and imputed GWAS data on 61 251 and 38 550 samples, respectively. For Crohn's disease (CD), the iChip and GWAS data explained 19 and 26% of variation in liability, respectively, and SNPs in the densely genotyped iChip regions explained 13% of the SNP-heritability for both the iChip and GWAS data. For ulcerative colitis (UC), the iChip and GWAS data explained 15 and 19% of variation in liability, respectively, and the dense iChip regions explained 10 and 9% of the SNP-heritability in the iChip and the GWAS data. From bivariate analyses, estimates of the genetic correlation in risk between CD and UC were 0.75 (SE 0.017) and 0.62 (SE 0.042) for the iChip and GWAS data, respectively. We also quantified the SNP-heritability of genomic regions that did or did not contain the previous 163 GWAS hits for CD and UC, and SNP-heritability of the overlapping loci between the densely genotyped iChip regions and the 163 GWAS hits. For both diseases, over different genomic partitioning, the densely genotyped regions on the iChip tagged at least as much variation in liability as in the corresponding regions in the GWAS data, however a certain amount of tagged SNP-heritability in the GWAS data was lost using the iChip due to the low coverage at unselected regions. These results imply that custom arrays with a GWAS backbone will facilitate more gene discovery, both at associated and novel loci.
We expand our previous deterministic power calculations by calculating the required sample size to detect C in ACE models. The theoretical expected value of the maximum log-likelihood for the AE model was derived using two optimisation methods and these gave near-identical results. Theoretical predictions were verified by computer simulation and the results agreed very well. We have developed a user-friendly web-based tool, TwinPower, to perform power calculations to detect either A or C for the classical twin design. This new tool can be found at http://genepi.qimr.edu.au/cgi-bin/twinpower.cgi
Research on genetic influences on human fertility outcomes such as number of children ever born (NEB) or the age at first childbirth (AFB) has been solely based on twin and family-designs that suffer from problematic assumptions and practical limitations. The current study exploits recent advances in the field of molecular genetics by applying the genomic-relationship-matrix based restricted maximum likelihood (GREML) methods to quantify for the first time the extent to which common genetic variants influence the NEB and the AFB of women. Using data from the UK and the Netherlands (N = 6,758), results show significant additive genetic effects on both traits explaining 10% (SE = 5) of the variance in the NEB and 15% (SE = 4) in the AFB. We further find a significant negative genetic correlation between AFB and NEB in the pooled sample of –0.62 (SE = 0.27, p-value = 0.02). This finding implies that individuals with genetic predispositions for an earlier AFB had a reproductive advantage and that natural selection operated not only in historical, but also in contemporary populations. The observed postponement in the AFB across the past century in Europe contrasts with these findings, suggesting an evolutionary override by environmental effects and underscoring that evolutionary predictions in modern human societies are not straight forward. It emphasizes the necessity for an integrative research design from the fields of genetics and social sciences in order to understand and predict fertility outcomes. Finally, our results suggest that we may be able to find genetic variants associated with human fertility when conducting GWAS-meta analyses with sufficient sample size.
Variation in body iron is associated with or causes diseases, including anaemia and iron overload. Here we analyse genetic association data on biochemical markers of iron status from eleven European-population studies, with replication in eight additional cohorts (total up to 48,972 subjects). We find eleven genome-wide-significant (p < 5 × 10−8) loci, some including known iron-related genes (HFE, SLC40A1, TF, TFR2, TFRC, TMPRSS6) and others novel (ABO, ARNTL, FADS2, NAT2, TEX14). SNPs at ARNTL, TF, and TFR2 affect iron markers in HFE C282Y homozygotes at risk for hemochromatosis. There is substantial overlap between our iron loci and loci affecting erythrocyte and lipid phenotypes. These results will facilitate investigation of the roles of iron in disease.
Gene discovery, estimation of heritability captured by SNP arrays, inference on genetic architecture and prediction analyses of complex traits are usually performed using different statistical models and methods, leading to inefficiency and loss of power. Here we use a Bayesian mixture model that simultaneously allows variant discovery, estimation of genetic variance explained by all variants and prediction of unobserved phenotypes in new samples. We apply the method to simulated data of quantitative traits and Welcome Trust Case Control Consortium (WTCCC) data on disease and show that it provides accurate estimates of SNP-based heritability, produces unbiased estimators of risk in new samples, and that it can estimate genetic architecture by partitioning variation across hundreds to thousands of SNPs. We estimated that, depending on the trait, 2,633 to 9,411 SNPs explain all of the SNP-based heritability in the WTCCC diseases. The majority of those SNPs (>96%) had small effects, confirming a substantial polygenic component to common diseases. The proportion of the SNP-based variance explained by large effects (each SNP explaining 1% of the variance) varied markedly between diseases, ranging from almost zero for bipolar disorder to 72% for type 1 diabetes. Prediction analyses demonstrate that for diseases with major loci, such as type 1 diabetes and rheumatoid arthritis, Bayesian methods outperform profile scoring or mixed model approaches.
Most genome-wide association studies performed to date have focused on testing individual genetic markers for associations with phenotype. Recently, methods that analyse the joint effects of multiple markers on genetic variation have provided further insights into the genetic basis of complex human traits. In addition, there is increasing interest in using genotype data for genetic risk prediction of disease. Often disparate analytical methods are used for each of these tasks. We propose a flexible novel approach that simultaneously performs identification of susceptibility loci, inference on the genetic architecture and provides polygenic risk prediction in the same statistical model. We illustrate the broad applicability of the approach by considering both simulated and real data. In the analysis of seven common diseases we show large differences in the proportion of genetic variation due to loci with different effect sizes and differences in prediction accuracy between complex traits. These findings are important for future studies and the understanding of the complex genetic architecture of common diseases.
Genome-wide association analysis on monozygotic twin pairs offers a route to discovery of gene–environment interactions through testing for variability loci associated with sensitivity to individual environment/lifestyle. We present a genome-wide scan of loci associated with intra-pair differences in serum lipid and apolipoprotein levels. We report data for 1,720 monozygotic female twin pairs from GenomEUtwin project with 2.5 million SNPs, imputed or genotyped, and measured serum lipid fractions for both twins. We found one locus associated with intra-pair differences in high density lipoprotein (HDL) cholesterol, rs2483058 in an intron of SRGAP2, where twins carrying the C allele are more sensitive to environmental factors (p = 3.98 × 10−8). We followed up the association in further genotyped monozygotic twins (N = 1 261) which showed a moderate association for the variant (p = .002, same direction of an effect). In addition, we report a new association on the level of apolipoprotein A-II (p = 4.03 × 10−8).
twins; association; lipids; apolipoproteins; interaction
DNA methylation levels change with age. Recent studies have identified biomarkers of chronological age based on DNA methylation levels. It is not yet known whether DNA methylation age captures aspects of biological age.
Here we test whether differences between people’s chronological ages and estimated ages, DNA methylation age, predict all-cause mortality in later life. The difference between DNA methylation age and chronological age (Δage) was calculated in four longitudinal cohorts of older people. Meta-analysis of proportional hazards models from the four cohorts was used to determine the association between Δage and mortality. A 5-year higher Δage is associated with a 21% higher mortality risk, adjusting for age and sex. After further adjustments for childhood IQ, education, social class, hypertension, diabetes, cardiovascular disease, and APOE e4 status, there is a 16% increased mortality risk for those with a 5-year higher Δage. A pedigree-based heritability analysis of Δage was conducted in a separate cohort. The heritability of Δage was 0.43.
DNA methylation-derived measures of accelerated aging are heritable traits that predict mortality independently of health status, lifestyle factors, and known genetic factors.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-015-0584-6) contains supplementary material, which is available to authorized users.
Schizophrenia is a highly heritable disorder. Genetic risk is conferred by a large number of alleles, including common alleles of small effect that might be detected by genome-wide association studies. Here, we report a multi-stage schizophrenia genome-wide association study of up to 36,989 cases and 113,075 controls. We identify 128 independent associations spanning 108 conservatively defined loci that meet genome-wide significance, 83 of which have not been previously reported. Associations were enriched among genes expressed in brain providing biological plausibility for the findings. Many findings have the potential to provide entirely novel insights into aetiology, but associations at DRD2 and multiple genes involved in glutamatergic neurotransmission highlight molecules of known and potential therapeutic relevance to schizophrenia, and are consistent with leading pathophysiological hypotheses. Independent of genes expressed in brain, associations were enriched among genes expressed in tissues that play important roles in immunity, providing support for the hypothesized link between the immune system and schizophrenia.
A balanced t(1;11) translocation which transects the Disrupted in schizophrenia 1 (DISC1) gene shows genome-wide significant linkage for schizophrenia and recurrent major depressive disorder in a single large Scottish family, but genome-wide and exome sequencing-based association studies have not supported a role for DISC1 in psychiatric illness. To explore DISC1 in more detail, we sequenced 528 kb of the DISC1 locus in 653 cases and 889 controls. We report 2,718 validated single nucleotide polymorphisms of which 2,010 have a minor allele frequency of less than 1%. Only 38% of these variants are reported in the 1000 Genomes Project European subset. This suggests that many DISC1 SNPs remain undiscovered and are essentially private. Rare coding variants identified exclusively in patients were found in likely functional protein domains. Significant region-wide association was observed between rs16856199 and recurrent major depressive disorder (P=0.026, unadjusted P=6.3 × 10−5, OR=3.48). This was not replicated in additional recurrent major depression samples (replication P=0.11). Combined analysis of both the original and replication set supported the original association (P=0.0058, OR=1.46). Evidence for segregation of this variant with disease in families was limited to those of rMDD individuals referred from primary care. Burden analysis for coding and non-coding variants gave nominal associations with diagnosis and measures of mood and cognition. Together, these observations are likely to generalise to other candidate genes for major mental illness and may thus provide guidelines for the design of future studies.
The main genetic determinant of soluble IL-6R levels is the missense variant rs2228145, which maps to the cleavage site of IL-6R. For each Ala allele, sIL-6R serum levels increase by ~20 ng/ml and asthma risk by 1.09-fold. However, this variant does not explain the total heritability for sIL-6R levels. Additional independent variants in IL6R may therefore contribute to variation in sIL-6R levels and influence asthma risk. We imputed 471 variants in IL6R and tested these for association with sIL-6R serum levels in 360 individuals. An intronic variant (rs12083537) was associated with sIL-6R levels independently of rs4129267 (P = 0.0005), a proxy SNP for rs2228145. A significant and consistent association for rs12083537 was observed in a replication panel of 354 individuals (P = 0.033). Each rs12083537:A allele increased sIL-6R serum levels by 2.4 ng/ml Analysis of mRNA levels in two cohorts did not identify significant associations between rs12083537 and IL6R transcription levels. On the other hand, results from 16 705 asthmatics and 30 809 controls showed that the rs12083537:A allele increased asthma risk by 1.04-fold (P = 0.0419). Genetic risk scores based on IL6R regulatory variants may prove useful in explaining variation in clinical response to tocilizumab, an anti-IL-6R monoclonal antibody.
allergy; eQTL; expression; disease
Heritability is a population parameter of importance in evolution, plant and animal breeding, and human medical genetics. It can be estimated using pedigree designs and, more recently, using relationships estimated from markers. We derive the sampling variance of the estimate of heritability for a wide range of experimental designs, assuming that estimation is by maximum likelihood and that the resemblance between relatives is solely due to additive genetic variation. We show that well-known results for balanced designs are special cases of a more general unified framework. For pedigree designs, the sampling variance is inversely proportional to the variance of relationship in the pedigree and it is proportional to 1/N, whereas for population samples it is approximately proportional to 1/N2, where N is the sample size. Variation in relatedness is a key parameter in the quantification of the sampling variance of heritability. Consequently, the sampling variance is high for populations with large recent effective population size (e.g., humans) because this causes low variation in relationship. However, even using human population samples, low sampling variance is possible with high N.
experimental design; genomic relationship; heritability; maximum likelihood; sampling variance
Epistasis is the phenomenon whereby one polymorphism’s effect on a trait depends on other polymorphisms present in the genome. The extent to which epistasis influences complex traits1 and contributes to their variation2,3 is a fundamental question in evolution and human genetics. Though often demonstrated in artificial gene manipulation studies in model organisms4,5, and some examples have been reported in other species6, few examples exist for epistasis amongst natural polymorphisms in human traits7,8. Its absence from empirical findings may simply be due to low incidence in the genetic control of complex traits2,3, but an alternative view is that it has previously been too technically challenging to detect due to statistical and computational issues9. Here we show that, using advanced computation10 and a gene expression study design, many instances of epistasis are found between common single nucleotide polymorphisms (SNPs). In a cohort of 846 individuals with 7339 gene expression levels measured in peripheral blood, we found 501 significant pairwise interactions between common SNPs influencing the expression of 238 genes (p < 2.91 × 10−16). Replication of these interactions in two independent data sets11,12 showed both concordance of direction of epistatic effects (p = 5.56 ×10−31) and enrichment of interaction p-values, with 30 being significant at a conservative threshold of p < 0.05/501. Forty-four of the genetic interactions are located within 2Mb of regions of known physical chromosome interactions13 (p = 1.8 × 10−10). Epistatic networks of three SNPs or more influence the expression levels of 129 genes, whereby one cis-acting SNP is modulated by several trans-acting SNPs. For example MBNL1 is influenced by an additive effect at rs13069559 which itself is masked by trans-SNPs on 14 different chromosomes, with nearly identical genotype-phenotype (GP) maps for each cis-trans interaction. This study presents the first evidence for multiple instances of segregating common polymorphisms interacting to influence human traits.
In Mendelian randomization (MR) studies, where genetic variants are used as proxy measures for an exposure trait of interest, obtaining adequate statistical power is frequently a concern due to the small amount of variation in a phenotypic trait that is typically explained by genetic variants. A range of power estimates based on simulations and specific parameters for two-stage least squares (2SLS) MR analyses based on continuous variables has previously been published. However there are presently no specific equations or software tools one can implement for calculating power of a given MR study. Using asymptotic theory, we show that in the case of continuous variables and a single instrument, for example a single-nucleotide polymorphism (SNP) or multiple SNP predictor, statistical power for a fixed sample size is a function of two parameters: the proportion of variation in the exposure variable explained by the genetic predictor and the true causal association between the exposure and outcome variable. We demonstrate that power for 2SLS MR can be derived using the non-centrality parameter (NCP) of the statistical test that is employed to test whether the 2SLS regression coefficient is zero. We show that the previously published power estimates from simulations can be represented theoretically using this NCP-based approach, with similar estimates observed when the simulation-based estimates are compared with our NCP-based approach. General equations for calculating statistical power for 2SLS MR using the NCP are provided in this note, and we implement the calculations in a web-based application.
Power; Mendelian randomization; non-centrality parameter; instrumental variable
genome-wide association study; epidemiology; Mendelian randomization; interaction; polygene score
A major challenge in human genetics is to devise a systematic strategy to integrate disease-associated variants with diverse genomic and biological datasets to provide insight into disease pathogenesis and guide drug discovery for complex traits such as rheumatoid arthritis (RA)1. Here, we performed a genome-wide association study (GWAS) meta-analysis in a total of >100,000 subjects of European and Asian ancestries (29,880 RA cases and 73,758 controls), by evaluating ~10 million single nucleotide polymorphisms (SNPs). We discovered 42 novel RA risk loci at a genome-wide level of significance, bringing the total to 1012–4. We devised an in-silico pipeline using established bioinformatics methods based on functional annotation5, cis-acting expression quantitative trait loci (cis-eQTL)6, and pathway analyses7–9 – as well as novel methods based on genetic overlap with human primary immunodeficiency (PID), hematological cancer somatic mutations and knock-out mouse phenotypes – to identify 98 biological candidate genes at these 101 risk loci. We demonstrate that these genes are the targets of approved therapies for RA, and further suggest that drugs approved for other indications may be repurposed for the treatment of RA. Together, this comprehensive genetic study sheds light on fundamental genes, pathways and cell types that contribute to RA pathogenesis, and provides empirical evidence that the genetics of RA can provide important information for drug discovery.
Mixed linear models are emerging as a method of choice for conducting genetic association studies in humans and other organisms. The advantages of mixed linear model association (MLMA) include preventing false-positive associations due to population or relatedness structure, and increasing power by applying a correction that is specific to this structure. An underappreciated point is that MLMA can also increase power in studies without sample structure, by implicitly conditioning on associated loci other than the candidate locus. Numerous variations on the standard MLMA approach have recently been published, with a focus on reducing computational cost. These advances provide researchers applying MLMA methods with many options to choose from, but we caution that MLMA methods are still subject to potential pitfalls. Here, we describe and quantify the advantages and pitfalls of MLMA methods as a function of study design, and provide recommendations for the application of these methods in practical settings.
Family studies are consistent with genetic effects making substantial contributions to risk of psychiatric disorders such as schizophrenia, yet robust identification of specific genetic variants that explain variation in population risk had been disappointing until the advent of technologies that assay the entire genome in large samples. We highlight recent progress that has led to a better understanding of the number of risk variants in the population and the interaction of allele frequency and effect size. The emerging genetic architecture implies a large number of contributing loci (that is, a high genome-wide mutational target) and suggests that genetic risk to psychiatric disorders involves the combined effects of many common variants of small effect, as well as rare and de novo variants of large effect. The capture of a substantial proportion of genetic risk facilitates new study designs to investigate the combined effects of genes and the environment.
The success of genome-wide association studies has led to increasing interest in making predictions of complex trait phenotypes including disease from genotype data. Rigorous assessment of the value of predictors is critical before implementation. Here we discuss some of the limitations and pitfalls of prediction analysis and show how naïve implementations can lead to severe bias and misinterpretation of results.
Despite the important role DNA methylation plays in transcriptional regulation, the transgenerational inheritance of DNA methylation is not well understood. The genetic heritability of DNA methylation has been estimated using twin pairs, although concern has been expressed whether the underlying assumption of equal common environmental effects are applicable due to intrauterine differences between monozygotic and dizygotic twins. We estimate the heritability of DNA methylation on peripheral blood leukocytes using Illumina HumanMethylation450 array using a family based sample of 614 people from 117 families, allowing comparison both within and across generations.
The correlations from the various available relative pairs indicate that on average the similarity in DNA methylation between relatives is predominantly due to genetic effects with any common environmental or zygotic effects being limited. The average heritability of DNA methylation measured at probes with no known SNPs is estimated as 0.187. The ten most heritable methylation probes were investigated with a genome-wide association study, all showing highly statistically significant cis mQTLs. Further investigation of one of these cis mQTL, found in the MHC region of chromosome 6, showed the most significantly associated SNP was also associated with over 200 other DNA methylation probes in this region and the gene expression level of 9 genes.
The majority of transgenerational similarity in DNA methylation is attributable to genetic effects, and approximately 20% of individual differences in DNA methylation in the population are caused by DNA sequence variation that is not located within CpG sites.
Understanding genetic variation of complex traits in human populations has moved from the quantification of the resemblance between close relatives to the dissection of genetic variation into the contributions of individual genomic loci. But major questions remain unanswered: how much phenotypic variation is genetic, how much of the genetic variation is additive and what is the joint distribution of effect size and allele frequency at causal variants? We review and compare three whole-genome analysis methods that use mixed linear models (MLM) to estimate genetic variation, using the relationship between close or distant relatives based on pedigree or SNPs. We discuss theory, estimation procedures, bias and precision of each method and review recent advances in the dissection of additive genetic variation of complex traits in human populations that are based upon the application of MLM. Using genome wide data, SNPs account for far more of the genetic variation than the highly significant SNPs associated with a trait, but they do not account for all of the genetic variance estimated by pedigree based methods. We explain possible reasons for this ‘missing’ heritability.
Quantitative traits; whole genome methods; additive variance; genomic relationship; mixed linear model; genetic architecture
Education, socioeconomic status, and intelligence are commonly used as predictors of health outcomes, social environment, and mortality. Education and socioeconomic status are typically viewed as environmental variables although both correlate with intelligence, which has a substantial genetic basis. Using data from 6815 unrelated subjects from the Generation Scotland study, we examined the genetic contributions to these variables and their genetic correlations. Subjects underwent genome-wide testing for common single nucleotide polymorphisms (SNPs). DNA-derived heritability estimates and genetic correlations were calculated using the ‘Genome-wide Complex Trait Analyses’ (GCTA) procedures. 21% of the variation in education, 18% of the variation in socioeconomic status, and 29% of the variation in general cognitive ability was explained by variation in common SNPs (SEs ~ 5%). The SNP-based genetic correlations of education and socioeconomic status with general intelligence were 0.95 (SE 0.13) and 0.26 (0.16), respectively. There are genetic contributions to intelligence and education with near-complete overlap between common additive SNP effects on these traits (genetic correlation ~ 1). Genetic influences on socioeconomic status are also associated with the genetic foundations of intelligence. The results are also compatible with substantial environmental contributions to socioeconomic status.
•Generation Scotland is a large family-based cohort of ~ 24,000 people.•We investigate the genetic influences on education, SES, and intelligence.•Both DNA-based (subset of ~ 6500) and pedigree-based analyses are used.•Genetic effects on SES and education are linked to the genetic basis of intelligence.•There are also substantial environmental effects on all three traits.
Generation Scotland; Intelligence; Education; Socioeconomic status; Genetics