|Home | About | Journals | Submit | Contact Us | Français|
The study was designed by L.P., N.B.F., C.S., M.I.M., M.J.D., P.E., M.-R.J. and S.K.S. Information on traits was collected and maintained by M.-R.J., A.-L.H., A.P., A.R. and J.L. Genotyping was supervised by S.G. Database support was provided by H.T., U.S. and M.K. Statistical analysis was performed by C.S., S.K.S., A.C., C.G.J., J.B., N.A.Z., and M.J.D. S.R., E.J. and T.V. contributed to the analysis of the Finnish genetic signature. L.C., C.H. and P.E. participated in discussion of results. The manuscript was written by C.S., S.K.S., N.B.F. and L.P. All authors reviewed the manuscript.
Genome-wide association studies (GWAS) of longitudinal birth cohorts enable joint investigation of environmental and genetic influences on complex traits. We report GWAS results for nine quantitative metabolic traits (triglycerides, high-density lipoprotein, low-density lipoprotein, glucose, insulin, C-reactive protein, body mass index, and systolic and diastolic blood pressure) in the Northern Finland Birth Cohort 1966 (NFBC1966), drawn from the most genetically isolated Finnish regions. We replicate most previously reported associations for these traits and identify nine new associations, several of which highlight genes with metabolic functions: high-density lipoprotein with NR1H3 (LXRA), low-density lipoprotein with AR and FADS1-FADS2, glucose with MTNR1B, and insulin with PANK1. Two of these new associations emerged after adjustment of results for body mass index. Gene-environment interaction analyses suggested additional associations, which will require validation in larger samples. The currently identified loci, together with quantified environmental exposures, explain little of the trait variation in NFBC1966. The association observed between low-density lipoprotein and an infrequent variant in AR suggests the potential of such a cohort for identifying associations with both common, low-impact and rarer, high-impact quantitative trait loci.
Genetic variation and environmental exposures together determine the distribution of common diseases and disease-related traits within a population. Genome-wide association studies of cohort samples offer an opportunity to dissect the complex etiology and epidemiology of such traits. We report here a GWAS of metabolic traits in the Northern Finland Birth Cohort 1966 (NFBC1966)1,2, a sample that enrolled almost all individuals born in 1966 in the two northernmost Finnish provinces.
Several features advantageous for GWAS characterize NFBC1966: (i) participants derive from a genetic isolate that is relatively homogeneous in genetic background and environmental exposures, and that has more extensive linkage disequilibrium (LD) than most other populations3,4; (ii) cohort ascertainment reflects the composition of the population from which individuals are drawn, facilitating evaluation of effect sizes of identified genetic variants and comparison with environmental effects5; (iii) such ascertainment avoids selection and survival biases characteristic of case-control studies5; (iv) participants were born in the same year, eliminating age as a potential confounder and minimizing the effects of interindividual variability due to secular changes in environmental exposures; (v) longitudinal data collected over several decades 2,6-9 facilitate investigation of relationships between genetic variation, early life events and progression of disease risks.
The NFBC1966 study design enables GWAS of diverse traits. We focus here on nine quantitative traits that are heritable10-13 risk factors for cardiovascular disease (CVD) or type 2 diabetes (T2D): body mass index (BMI), fasting serum concentrations of lipids (triglycerides (TG), high-density lipoproteins (HDL) and low-density lipoproteins (LDL)), indicators of glucose homeostasis (glucose (GLU), and insulin (INS)) and inflammation (CRP), and systolic (SBP) and diastolic (DBP) blood pressure. Extreme values of these traits, in combination, identify a ‘metabolic syndrome’, hypothesized to increase risks for both CVD and T2D14,15.
Although environmental exposures such as diet influence these traits, the magnitude of their effects is controversial, largely because of uncertainty regarding critical time points for measuring exposures. For example, few datasets contain the longitudinal assessments needed to test the hypothesis that early life growth patterns may, through gene-environment interactions, affect values, in adulthood, of the traits comprising metabolic syndrome. NFBC1966 not only provides an ideal dataset for investigating interactions between genetic and environmental variables contributing to metabolic syndrome and its trait components, it offers the opportunity to prospectively investigate such interactions as participants attain ages characterized by high prevalence of CVD and T2D.
Prior GWAS results from case-control samples for these traits16-22 can be compared with findings from NFBC1966, in relation to differences in study design and sample composition. Specifically, NFBC1966 affords the opportunity for GWA analysis of the components of metabolic syndrome in a single cohort of young adults. We replicate most previously reported findings for these traits and report several newly identified associations, most of which are located in or near genes with postulated roles in metabolism. Our results suggest the value of cohorts such as NFBC1966 for elucidating genetic and environmental influences on quantitative traits.
After applying exclusion criteria and quality control procedures (see Methods), we obtained Illumina Infinium genotypes on 329,091 SNPs in 4,763 individuals (for further information on the final sample, see Supplementary Note and Supplementary Table 1 online).
On the basis of literature results and available measurements in NFBC1966 and their patterns of missing values, we focused on six covariates that influence quantitative metabolic traits: alcohol use, smoking, BMI, sex, oral contraceptive use and pregnancy status. All were assessed contemporaneously with and highly associated to the nine traits (Table 1)2,23. Given current interest in the relationship between fetal growth and risk for CVD and T2D24, and as previous investigations in NFBC1966 and other Finnish cohorts have identified such associations2,25,26, we also considered effects of gestational age and birth BMI on the nine traits (see Methods, Table 1 and Supplementary Table 2 online).
Recent studies using genome-wide SNPs have identified fine-scale population substructure that could affect association analyses27. We investigated the presence of such substructure in NFBC1966 by examining patterns of genetic similarity between different individuals (see Methods).
This analysis failed to identify the clearly separated clusters expected with strong subpopulation structure, but the genetic similarity between individuals correlates well with geographical and linguistic subdivisions of Northern Finland (Fig. 1, Supplementary Note and Supplementary Fig. 1 online). The information plotted in Figure 1 provides a measure of population of origin, which is available for the entire sample and which we use to correct our association analyses.
We analyzed the association between each trait and genomic locations using multiple approaches, with the aim of identifying genetic signals distinct from the effects of different covariates, correcting for possible population stratification, maximizing genomic coverage and investigating gene-environment interactions. Our primary genome-wide analysis tested the additive effect of each SNP on the traits, adjusted for sex, use of oral contraceptives and pregnancy (see Methods).
The quantile-quantile plot (Fig. 2) of the tails of the distribution of observed P values (on the log10 scale) from the GWA tests departed from the expected quantiles only at the extreme, suggesting that population structure does not substantially affect the results. Indeed, the estimated genomic control parameter28 for each trait only modestly deviated from the null value of 1.0 (TG: 1.0203, HDL: 1.0419, LDL: 1.077, CRP: 1.009, GLU: 1.040, INS: 1.019, BMI: 1.031, SBP: 1.056, DBP: 1.020). Furthermore, association tests adjusted for population structure produced markedly similar P values (Table 2). An analysis of the entire P-value distribution allowed us to reject the global null of no association for most of the nine traits (see Methods, Supplementary Note and Supplementary Fig. 2 online).
Identification of specific loci associated with each of nine traits requires adoption of a criterion to adjust for multiple comparisons, with a resulting significance threshold. Controlling false discovery rate (FDR) across the nine traits maximizes the power to detect relevant loci with our relatively small sample by allowing a small percentage of false findings. Our rationale, based on an FDR procedure, for adopting a threshold of 5 × 10−7 is detailed in the Methods (Supplementary Note and Supplementary Fig. 3 online).
We identified in our main analysis (Table 2 and Fig. 3) 21 associations with a P value <5 × 10−7 to one or more traits (four for TG, five for HDL, six for LDL, three for CRP and three for GLU), in 20 genome regions; one region, APOB, shows association to both TG and LDL. In 14 of these regions, previous studies have identified association at P <5 × 10−7, whereas seven represent new observations. Genome-wide analysis adjusting for BMI identified two additional regions, one for LDL and one for INS (Table 2, in italics), for a total of 23 association findings.
Imputation analysis (see Methods) identified additional associated SNPs in most regions highlighted by directly genotyped SNPs (Supplementary Fig. 4 online) and, in several cases, provided significant evidence for replication of known loci for which genotyped SNPs had shown less significant results. This analysis, however, did not uncover additional new regions (ignoring imputation signals found in the absence of even nominal association to genotyped SNPs, see Supplementary Note).
The study design offered the opportunity to test whether these association results depended on either epidemiologic variables known to affect these traits or the geographic origin of NFBC1966 participants. All locations in Table 2 continued to show association when we used multivariate regression, which modeled the trait of interest as a linear function of all relevant epidemiological and geographical variables (Supplementary Table 3 online).
For most associations in Table 2, multiple SNPs provided association evidence. We identified genomic ‘boundaries’ for each finding, using an LD map29 to evaluate whether associations with multiple SNPs within a region likely represent distinct findings. Figure 4 presents, for the nine newly identified loci, information describing the extent of the signal and its relation to local LD (in LD-map distances surrounding the SNP with the smallest P value) as well as its position relative to genes (for analogous information for previously reported associations, see Supplementary Fig. 5 online).
For most of the newly identified loci, the effect size of the individual associations is similar in magnitude to those observed in other GWAS of these traits (Tables 2 2 and and33 and Supplementary Fig. 6 online). An exception is the association observed between LDL and a variant in AR on chromosome X, which has a much larger effect size than other associations observed in NFBC1966 or in previous GWAS of lipid traits17,21.
Previous GWAS found 37 loci (represented by 42 SNPs) associated with six of the traits (Table 3). In most cases the causal variant remains unidentified, and several of these SNPs are located in genes known to have substantial allelic heterogeneity30. We therefore evaluated the evidence for replication in two ways, assessing the association signal for the specific SNP in question as well as for the region it implicates.
To investigate SNP replication (Table 3), we test for association with the same SNP (genotyped or imputed) or to a proxy SNP (r2 ≥0.8 with the originally reported SNP) (see Methods). SNP imputations as well as identification of the best proxy SNP rely on correlations estimated in the HapMap CEU sample, which may in some instances be imprecise. Results of this analysis, including comparisons of effect sizes, are in the first columns of Table 3 and summarized below by trait (Supplementary Note). In all instances, when the same SNP or a perfect proxy was analyzed in NFBC1966, the direction of effect was the same as the reported direction, and in most cases the effect sizes were of a similar magnitude.
We defined gene regions on the basis of the LD pattern around the previously reported SNP (a window of 4 LD-map units, see Methods). We tested for association in these windows using the strongest association signal among the SNPs genotyped in NFBC1966 within the window, evaluating its significance through permutations (see Methods; last two columns of Table 3). In several cases where SNP replication was equivocal, we obtained strong evidence of association with the gene region (for example, TG with APOB and APOA1/C3/A4/A5, HDL with LIPC, and CRP with CRP; Table 3).
For each trait we summarize our findings and present newly identified candidate genes (Table 2 and Fig. 4). Additionally, to evaluate the portion of explained variance for each trait, we incorporate into a multivariate regression model environmental variables hypothesized to contribute to total trait variability (see Methods) together with genetic associations (previously known and newly discovered, Tables 22 and and3):3): the results of this analysis are in Figure 5 and Supplementary Table 4 online. The effect of the multiple associated SNPs in these models is additive across loci and associations in Table 2 remain significant in models that include all associated SNPs. As expected, associated SNPs in different genetic regions are not correlated with each other.
We replicated SNP associations with GCKR and LPL at P < 5 × 10−7, and with ANGPTL3-DOCK7-ATG4C and BCL7B-TBL2-MLXIPL at 5 × 10−7 < P < 0.001. We did not observe any SNP association with GALNT2, LIPC and NCAN-CILP2-PBX4 (P > 0.05). The remainder of SNP associations for TG showed equivocal replication evidence in NFBC1966 (0.001 < P < 0.05). A newly identified TG association on chromosome 15 depends solely on evidence from rs2624265 (Table 2 and Fig. 4), and its significance decreases considerably when covariates are taken into account (Table 2). This SNP shows modest association to TG (P = 0.0027), with similar effect sizes, in the ENGAGE consortium meta-analysis of >10,000 European samples31. Together, the associated loci explain 4.2% of trait variability.
We replicated SNP associations with CETP and LCAT at P < 5 × 10−7, and with GALNT2, LPL and ABCA1 at 5 × 10−7 < P < 0.001. We did not observe any SNP association with APOA1/C3/A4/A5, ZNF259 or BUD13 (P > 0.05). The remainder of SNP associations for HDL showed equivocal replication evidence in NFBC1966 (0.001 < P < 0.05). We identified association in two newly identified gene regions. An associated region on chromosome 11 includes NR1H3, also termed LXRA, a transcriptional regulator of cholesterol metabolism32. A second associated region is on chromosome 17. In this region only one additional genotyped SNP approached the 5 × 10−7 threshold, and imputation analysis identified several SNPs showing weaker association at this location (Fig. 4). Both locations showed associations with HDL to the same SNPs as in the ENGAGE consortium meta-analysis31 (P = 6.96 × 10−8 and P = 0.003, respectively). Together, the associated loci explain 6% of trait variability.
We replicated SNP associations with CELSR2-PSRC1-SORT1, APOB and LDLR at P < 5 × 10−7, and with HMGCR at 5 × 10−7 < P < 0.001. We did not observe any SNP association with B3GALT4 and NCAN-CILP2-PBX4 (P > 0.05). For two SNPs, we observed convincing association; however, our proxy SNP is not in strong LD with the reported SNP (located in PCSK9 and the APO cluster).
We identified three new LDL associations. SNP rs4844614 on chromosome 1 lies in an intron of CR1L, which encodes a complement receptor protein not known to have a metabolic function. On chromosome X, association was observed for rs5031002 located in intron 6 of AR, which encodes a ligand-dependent transcription factor with several functions, including control of circulating androgen levels; alterations in such levels are associated with sex-specific dyslipidemias33. This low-frequency (minor allele frequency (MAF) = 0.017) variant is associated with markedly increased LDL, primarily among 38 males possessing it. Only one female subject is homozygous for the minor allele, and female heterozygotes show modest elevations in LDL (Supplementary Fig. 7 online). Association using females only identifies a similar effect size (0.3547 mmol/l for males and 0.325 mmol/l for females), albeit with a less significant P value (males-only P = 4 × 10−7, females-only P = 0.0039). Evidence for association with AR comes entirely from rs5031002, which shows low levels of LD with surrounding SNPs. Imputed HapMap SNPs in this region also showed relatively little association with LDL (Supplementary Fig. 6).
A locus on chromosome 11, including FADS1-FADS2, is associated at P < 5 × 10−7 when we adjust for BMI. These genes encode desaturases that have previously demonstrated strong association with various fatty acids present in serum phospholipids34. Association in this region with LDL, of a comparable magnitude, was also found by the ENGAGE consortium meta-analysis31. The entire collection of associated loci explains 6% of total variability.
We replicated SNP associations with LEPR and LEF1 at P < 5 × 10−7 and with IL6R at 5 × 10−7 < P < 0.001. We did not observe any SNP association with GCKR (P > 0.05). For one SNP in CRP, we observed convincing association, but our proxy SNP is not in strong LD with the reported SNP. The remainder of SNP associations for CRP showed equivocal replication evidence in NFBC1966 (0.001 < P < 0.05). These loci explained 4% of trait variability.
We replicated with P < 5 × 10−7 the previously reported SNP association of glucose with G6PC2-ABCB1 and observed associations with new loci on chromosomes 7 and 11. The latter association (P = 5.9 × 10−8) with variants in MTNR1B has also been identified in a larger sample35. MTNR1B is transcribed in human islets and rodent insulinoma cell lines, and the translated receptor is thought to mediate the inhibitory effect of melatonin on insulin secretion. The two associated SNPs on chromosome 7 travel on a single haplotype, as do the two associated SNPs on chromosome 11. These loci explained 1.6% of trait variability.
Adjusting for BMI, we identified an INS association on chromosome 10 at rs11185790, which lies in an intron of PANK1. In imputation analyses, when INS was not adjusted for BMI, this region showed association at similar levels. PANK1 encodes panthothenate kinase, a critical enzyme in the synthesis of coenzyme A that is induced by bezafibrate, a hypolipidemic agent36. Additionally, mouse chemical knockout studies of panthothenate kinase resulted in a hypoglycemic phenotype, providing functional evidence supporting the role of this gene in glucose metabolism37; PANK1 alone explains 0.56% of total variability.
No individual loci attained genome-wide significance in NFBC1966, including two established BMI loci, FTO (NFBC1966 was part of the sample used in the original FTO finding) and MC4R, which we replicate at significance levels of P = 2.4 × 10−4 and P = 1.78 × 10−3, respectively. FTO and MC4R together explain 0.55% of the trait variability.
No individual locus achieved genome-wide significance in NFBC1966.
The genome-wide association analyses described above identify loci accounting for a limited fraction of total trait variance in NFBC1966. Additional loci may be identified by analyzing interactions between genotypes and environment. Interaction analyses address whether the effect of a SNP on a trait depends on the level of an environmental covariate. We evaluated interaction with factors hypothesized to interact specifically with metabolic traits: measures of early growth (birth BMI and growth in the first six months of life), BMI at age 31, oral-contraceptive use and sex.
We consider genome-wide interaction analysis in NFBC1966 to be exploratory for two reasons: (i) given the level of multiple testing in these additional analyses, our sample is underpowered to achieve confirmatory significance levels for effect sizes similar to those reported above; (ii) replication of NFBC1966 interaction results may be difficult, as not all GWAS of these traits have data on the same covariates. Nevertheless, we present findings to generate hypotheses for future studies, focusing on tests resulting in P < 5 × 10−7 (Supplementary Note and Supplementary Table 5a-d online). In brief, all environmental factors except growth in the first six months of life interact with at least one SNP to influence the mean level of at least one trait.
This GWAS identifies variants influencing risk for metabolic traits in a population cohort of young adults. We replicate most prior findings and identify nine previously unreported associations. Five of these associations—HDL with NR1H3 (LXRA), LDL with AR and FADS1-FADS2, glucose with MTNR1B and insulin with PANK1—implicate genes with known or postulated roles in metabolism and, therefore, support heightened investigation of pathways through which they are hypothesized to function.
We highlight 23 findings passing a significance threshold (5 × 10−7) accounting for the multiple comparisons involved in GWAS of several traits. Choosing a threshold involves judging the degree of dependence among the tests. This dependence varies across studies, for both phenotypes and genotypes. We investigated component traits for metabolic syndrome, which are relatively dependent. More importantly, LD structure, which varies substantially between populations, determines the degree of dependence of SNPs in a GWAS; a ‘one size fits all’ threshold is overconservative in some situations and under-conservative in others. The extensive LD in Northern Finland4 supports the threshold we selected.
More important than the significance threshold in any one study is the accumulation of evidence that particular loci contribute to trait variation. Replication most precisely refers to a specific SNP (or another SNP in strong LD with that SNP) with the same direction of effect38. The 12 instances in which we replicate previously reported SNP associations (at P < 5 × 10−7, Table 3) are the most statistically significant findings in our study, with generally the largest effect sizes (which are roughly similar to those observed in the GWAS initially reporting these associations).
At four additional genes we observed associations (<5 × 10−7 threshold), but not for SNP(s) for which association was previously reported (Table 3). SNPs in different studies may each be strongly associated with a trait and in strong LD with an unknown causal variant but not with one another, resulting in nonreplication at the SNP level. Additionally, association with different SNPs across studies may reflect multiple causal variants within the same gene.
Our study differs from previous investigations in power and study design, each possibly accounting for our nonreplication of previously reported associations, and our identification of nine previously unknown loci. Previously reported associations with these traits derive largely from meta-analyses of samples much larger (up to ~20,000 individuals17) than that of NFBC1966; however, published data do not indicate obvious differences that could reflect sample size (for example, level of statistical support or effect size) between associations that we replicate and those that we do not replicate (Supplementary Note).
Cohort studies such as NFBC1966, unlike case-control studies, investigate individuals drawn from the full distribution of disease-associated quantitative traits. Differences between these study designs in the strength of association, in terms of significance level, at particular loci may provide clues regarding their function; stronger association in cohort studies may characterize loci with broader phenotypic effects, whereas stronger association in case-control studies may characterize loci that more directly influence the diseases that define cases.
Birth cohorts, unlike other study designs, factor out age-specific effects, both age-varying associations and age-gene interactions39. New associations identified in NFBC1966 may indicate loci that particularly influence metabolic traits in early adulthood, when individuals are mostly free of late-onset diseases. Additionally, variability in metabolic traits in younger adults may be more genetically determined than in older adults, simply because of less accumulation of environmental exposures. At the same time, the impact of other loci, potentially reflecting genetically mediated responses to such accumulations, may not yet be discernable in NFBC1966, for example previous associations observed for CRP in older individuals19.
The metabolic syndrome concept assumes that correlations between the traits assessed here identify an important source of risk for CVD and T2D. We confirmed overlapping associations between TG and HDL at LPL, GALNT2 and APOA1/C3/A4/A5, and between TG and LDL at NCAN-CILP2-PBX4 and APOB, and we observed HDL association at APOB, which to our knowledge has not been previously reported. No additional gene regions showed significant association with multiple traits, suggesting that common variants with sizable effects do not substantially account for trait correlations that characterize metabolic syndrome. More powerful methods might identify pleiotropic effects among these traits. Principal Component of Heritability (PCH), a technique to identify the linear combination of phenotypic values that maximizes heritability with a genotyped SNP40, seems powerful for uncovering pleiotropy, but in its current form it is computationally impractical for GWAS.
The longitudinal study design facilitates additional types of analyses that may illuminate the relationships between multiple influences on trait variation. For example, NFBC1966 has, from birth through age 31, assessed participants’ BMI. Elevated adult BMI, a core component of metabolic syndrome, is a well-established risk factor for CVD and T2D, and abnormal early-life BMI is also hypothesized to predispose to these disorders2,24,26.
In GWAS one can either adjust traits for such a covariate or condition on them in interaction analyses. By adjusting traits for adult BMI (in addition to sex, pregnancy status and oral-contraceptive use), we discovered two new associations (for LDL and INS) not identified in the primary analyses. In an interaction analysis, where individuals were stratified as having normal or elevated BMI, three loci not previously highlighted showed associations at P < 5 × 10−7 (Supplementary Table 5c and Supplementary Fig. 8a online). A similar analysis identified a locus for which the effect on CRP depended on the level of BMI at birth (Supplementary Note and Supplementary Fig. 8b). We consider these interaction analyses exploratory, as power is limited by several factors: the small effects characterizing the relationship between covariates and the traits, the relatively small sample of NFBC1966 given such effect sizes, and the requirement for a smaller alpha level given the multiple testing involved in interaction analyses of several covariates.
The currently identified loci, singly and cumulatively, explain little of the trait variability in NFBC1966 (at most ~6% based on multivariate regression). The estimated heritability for each trait is in the range of 50% or greater10,11,13,41,42. Our analyses indicate that environmental variables account for less than ~30% of trait variability. Even assuming that these analyses do not include all relevant exposures, the discrepancy between estimated heritability and the percentage of trait variance now accounted for by genetic associations suggests that loci contributing the largest proportion of trait variance remain undiscovered. Such loci could include additional common variants (with even smaller effects), require interaction with environmental variables (also presumably with small effects) or include multiple infrequent variants. Such rare variants could have substantial effects in small numbers of individuals and, in aggregate, represent a substantial proportion of trait variance in the population43.
The relative importance of common and rare variants in the genetic architecture of complex traits is a topic of intense current discussion, but so far with little data. Our results suggest that cohorts from population isolates, such as NFBC1966, may be a valuable source of such data. Current GWAS are designed to investigate association with common variants. Most such variants predate the separation of modern European populations, and, therefore, it is predictable that most of the associations detected in NFBC1966 with common SNPs reflect the actions of loci with roughly comparable effect sizes across such populations.
Rare variants, by contrast, likely arose more recently, and their frequency may vary substantially between populations, particularly considering isolates such as Northern Finland in which repeated bottlenecks resulted in extensive genetic drift at many loci. This phenomenon likely explains the association (not found in other studies) of LDL with an infrequent SNP (MAF < 0.02) in AR. In a sample of almost 5,000 individuals, such an infrequent variant must have a large effect to be statistically detectable, as was the case in NFBC1966 (males with the minor allele have a mean LDL elevation of 28 mg/dl).
Current GWAS designs are poorly powered to detect association with infrequent variants, and our finding for LDL and AR probably reflects luck as well as distinct genetic features of Northern Finland. Genome-wide resequencing efforts now underway, such as the 1000 Genomes project, will likely permit investigation of associations with infrequent variants for a wide range of complex traits. As with the LDL-AR association, these findings may prove difficult to replicate, but may immediately suggest biological pathways for further investigation.
Study subjects were members of the Northern Finnish Birth Cohort of 1966 (NFBC1966) (see Supplementary Note for more information on NFBC1966). Informed consent from all study subjects was obtained using protocols approved by the Ethical Committee of the Northern Ostrobothnia Hospital District.
All traits were measured at the 31-year examination. Blood samples were drawn after overnight fasting in the morning (between 0800 and 1100 h). The concentrations of INS were analyzed by radioimmuno-assay (Pharmacia Diagnostics), and GLU by a glucose dehydrogenase method (Granutest 250, Diagnostica Merck). Serum CRP concentrations were determined by immunoenzymometric assay (Medix Biochemica). Serum GLU, total cholesterol (TC), HDL and TG were determined by enzymatic methods using a Hitachi 911 Clinical Chemistry Analyzer (Boehringer Mannheim). Additional detail on all assays can be found in the Supplementary Note.
Height and body weight were measured using a standardized height measure and scale. BMI was calculated as kg m−2. Weight was not measured for pregnant women. SBP and DBP were measured by trained nurses with a mercury sphygmomanometer using a standardized procedure and ongoing quality control2. Measurements were made on the right arm, after the subject had rested in a sitting position for 15 min. The average of duplicate measures was used as the trait measure. Blood pressure of individuals who were on blood pressure medication (1.9% of study subjects) was adjusted by adding 15 mm Hg to SBP and 10 mm Hg to the DBP44.
For association analyses, TG, BMI, INS and GLU were all natural log transformed. One-half of the detection limit (0.002 mg/l) of CRP was added to 0 values, and then all values were natural log transformed for association analysis. Results use SI units for all traits.
Individuals were excluded from analysis of specific traits on the basis of criteria that were established separately for each trait (Supplementary Note). Individuals were excluded from analysis of lipid traits (TG, HDL, LDL) if they had not fasted before blood collection, or if they were diabetic. Individuals were excluded from analysis of GLU and INS if the blood sample was nonfasting, if they were diabetic, on diabetic medication, pregnant, or if their glucose/insulin measurement (after correction for sex, oral-contraceptive use, and pregnancy status) was in excess of three s.d. from the mean. Individuals were excluded from analysis of BMI if their weight was not directly measured. No exclusion criteria were applied to CRP or to SBP and DBP.
Alcohol consumption was measured in the absolute amount of alcohol (grams per day), as reported in a questionnaire administered at age 31. Smoking habits were measured in the same questionnaire by using the answer to question: “Have you ever smoked in your life?” Use of oral contraceptive and pregnancy was determined on the basis of self-reported data at age 31.
Birth BMI was regressed on gestational age, mother’s parity, smoking status, height and weight before pregnancy and by the infant’s sex. Residuals from this regression were ordered and placed into five quantiles, defining the ordered factor we use in the analysis. Early growth was examined by taking the residuals from a regression of BMI at six months on birth BMI, as done previously26. Gestational age of the subject was dichotomized into >37 weeks (full term) and ≤37 weeks (preterm).
The NFBC1966 database records the municipality of origin of both parents of each subject. Most parents of NFBC1966 individuals were born in municipalities that can be assigned to one of six linguistic and geographic regions that can be distinguished in Northern Finland: East Lapland, Central Lapland, West Lapland, North Oulu, South Oulu and Kainuu. A subject in NFBC1966 is assigned to one of these groups when both parents belong to the same group (see Supplementary Note).
Genomic DNA was extracted from whole blood using standard methods. All DNA samples for the Illumina Infinium 370cnvDuo array were prepared for genotyping by the Broad Institute Biological Sample Repository (BSP). Details on genotyping procedures can be found in the Supplementary Note.
All individuals in the study were genotyped with call rates >95%. Individuals with discrepancy between their reported sex and the sex determined from the X chromosome were excluded from analysis. We used the identity-by-descent (IBD) analysis option of PLINK45 to determine possible relatedness among our sample subjects and to identify sample duplications and sample contamination (the latter identified as individuals who seemed to be related to nearly everyone in the sample). If the sample duplication issue could not be resolved by external means, both samples were excluded. All apparently contaminated samples were excluded. We identified individuals related at the level of half-sibs or closer with the IBD analysis and excluded one subject from each pair (the subject with less complete genotyping). Subsequent to this overall exclusion, individuals may have been excluded from analysis of specific traits, as detailed above.
Markers were excluded from analysis if the call rate in the final sample was <95%, if the P value from a test of Hardy-Weinberg Equilibrium (HWE) was <0.0001, or if the MAF was <1%. A complete description of the individuals available for analysis can be found in the Supplementary Note.
Using individuals and markers that passed inclusion criteria, we assessed population structure using classical multidimensional scaling on the matrix of identity-by-state sharing of all pairs of individuals, as estimated by the program PLINK45. The first two dimensions of the result were plotted, color coding individuals known to belong to different linguistic/geographic groups. Unlike geographical/linguistic groups that can be assigned on the basis of parental birth records only for a subset of the sample, multidimensional scaling (MDS) components are available for the entire sample. We adjusted for ancestry using the first two components of MDS, as they allow us to distinguish the different linguistic/geographic groups, and are significantly associated with a large fraction of our traits.
We constructed a genome-wide LD map in LD units (LDU) from the SNP data to evaluate the extent of LD on each chromosome. Map distances are an LD analog of the centimorgan scale of linkage maps, with one LD unit corresponding to the average distance over which LD declines to ‘background’ levels. The LD maps were used to gauge the independence of signals in nearby genomic regions. Multiple SNPs located within two LDUs of each other were (conservatively) considered likely to represent the same signal. LD maps were constructed from weighted pairwise association data following Maniatis et al.29 as implemented in the LDMAP program. The details of LDU map construction precisely followed the methods described by Service et al.4. Fine-scale genetic map information for the CEU population was downloaded from the HapMap website.
Following a standard genetic approach, we selected a main model that maximizes the power to detect any genetic association, even when the pathway connecting gene and trait may be complex and involve other traits. That is, we selected a model in which the traits of interest are minimally adjusted: we correct for sex, pregnancy status and use of oral contraceptive. Adjusting for sex is standard practice. Oral contraceptive (OC) use has been previously associated with metabolic traits in this population23 and, together with pregnancy status (PG), is important in our comparatively young sample. The cumulative effect of these variables is captured by a factor (SexOCPG in Table 1) that has one level for males and nine levels for females corresponding to all combinations of reported use of oral contraceptives (yes, no, unknown) and pregnancy status (yes, no, unknown). As noted from Table 1, sex, oral-contraceptive use, pregnancy status and BMI are, by far, the covariates with strongest effects on the metabolic traits.
In our main analysis (analysis A), the residuals from the regression of each trait on SexOCPG were used as the outcome in a main effects model that tested the association between SNP markers and trait. SNP genotypes were coded as 0, 1 or 2 copies of the minor allele. An additive effect of genotype was assumed where the effect on the trait of the heterozygote was estimated to be midway between the levels of the two homozygotes. Association testing proceeded via regression analysis, and was done in PLINK45. On the basis of the strong effect of BMI on each of the traits documented in Table 1, we also carried out a genome-wide scan adjusting traits for BMI at age 31, in addition to SexOCPG (analysis B). To rule out the possibility of spurious associations due to population stratification, we corrected the P values of analysis A using the genomic-control parameter λ28, estimated as the median of the χ2 statistics and divided by 0.456. Furthermore, we conducted a genome-wide scan adjusting for the first two components of MDS (analysis C).
To maximize coverage, it is common to test for association at nongenotyped SNPs whose genotype can be imputed on the basis of haplotypes of nearby markers. We used a custom modification of weighted haplotype association (WHAP)46 that can be directly applied to quantitative traits and implements a novel quality control procedure (analysis D). Details can be found in the Supplementary Note. Haplotype phasing was done with Beagle47.
We established a threshold of significance of 5 × 10−7, conservatively applying the Benjamini Hochberg48 procedure to control FDR at the 0.05 level across the P value from analysis A of the nine traits (see Supplementary Note). P values from the other analysis are highly correlated with those obtained in our main model and treating them as independent would artificially reduce power. To validate this theoretically established threshold, we ran 1,000 genome scans under the complete null hypothesis, using permutations. Across the nine traits, we obtained an average of 1.6 SNPs that lead to smaller P values than 5 × 10−7; this can be taken as the expected number of false positives among the results we report.
Known loci associated with the nine traits were identified through Internet and Pubmed search, beginning with a catalog of published genome-wide association studies compiled by the National Human Genome Research Institute (see URLs section below). Literature searches were conducted for the names of the nine traits together with key words ‘genome-wide association’ and ‘genetic association’. Loci were considered ‘known’ if they had a P value <5 × 10−7. We were able to identify 42 previously reported associations.
Quantile-quantile plots were produced plotting the ordered observed P values (on the −log10 scale) for each trait against the expected value of the corresponding ordered statistics for 329,091 independent observations from the uniform distribution. The 0.025 and 0.975 quantiles of the beta approximation for the distribution of these order statistics under the hypothesis of independence was used to obtain point-wise confidence bounds. We used a set of 1,000 permutations for each of the traits to obtain empirical point-wise confidence intervals for the ordered P values.
To assess the amount of genetic signal remaining in our data, besides the signals corresponding either to our new genome-wide hits or to known locations, we excluded all SNPs with P < 0.001 that are within 4 LD units (2 LDU on either side) of either a new hit or a known location and compared the distribution of the P values for remaining SNPs with that expected under the complete null. We tested the global null hypothesis of no genetic effect using the Higher Criticism49 statistic (Supplementary Note).
To identify which of the covariates in Table 1 best contributes to explaining each of the nine traits, we used a step-wise model selection procedure as implemented in the R function step (). Results are listed in Supplementary Table 3. For each of the associations reported in Table 2, we investigate if the SNP remains significant in a multivariate regression that includes all the covariates identified as above. P values from the corresponding F test are listed in Table 2. To assess the overall impact of all genetic association and the relative importance of each locus, we carried out a multivariate regression that includes the covariates above and the SNPs carrying the strongest signal from each of the regions in Table 2 and each of the previously reported regions for which we obtained an empirical P value >0.05 in Table 3.
To investigate SNP replication, we examined the P value in our study of the originally reported SNP (either directly genotyped or imputed). When this was neither genotyped nor reliably imputable, we looked at the best proxy SNP, defined as the SNP in HapMap that is in highest LD with the original SNP (r2 of at least 0.8) and that is either genotyped or imputable in our sample.
To investigate gene-region replication, we identified all SNPs in our sample that reside within 2 LD units to either side (4 LDU total) of the known location. We identified the smallest association P value among these SNPs (NFBC min P in Table 3). We evaluated the probability that the smallest P value among all the SNPs in each of these regions is smaller then NFBC min P under the null hypothesis using 5,000 permutations. The proportion of permutations that lead to a smaller min P defines the empirical P in Table 3.
We considered gene-environment interactions with six variables: sex, use of oral contraceptives, an indicator variable for overweight (BMI > 25) individuals, subject’s gestational age (dichotomized as pre-term or term), birth BMI and early growth (see above). These epidemiological covariates, population structure covariates, and early life covariates were shown to affect mean trait levels in NFBC1966 (Table 1 and Supplementary Tables 2 and 3). Interaction of genotype with sex was analyzed in a sample that excluded women on oral contraceptives and pregnant women. Interaction of genotype with oral contraceptive use was analyzed in a sample of women that excluded pregnant women and women whose oral-contraceptive use was unknown. Interaction of genotype with the overweight indicator, gestational age, birth BMI and early growth was analyzed in the full sample, where traits had been adjusted for sex, oral-contraceptive use and pregnancy status. When analyzing interaction with binary variables (sex, oral-contraceptive use, overweight indicator, preterm or full-term gestational age), we evaluated interaction by comparing the effect size of the loci in the two groups, as implemented in the PLINK ‘gxe’ procedure45. Continuous covariates (subject’s birth BMI and early growth) were evaluated in a linear regression model that included a term for the additive effect of genotype, a term for the continuous covariate, and a term for the interaction of genotype and covariate.
We acknowledge the support of US National Heart, Lung, and Blood Institute grant HL087679 through the STAMPEED program, grants MH083268, GM053275-14 and U54 RR020278 from the US National Institutes of Health, grant DMS-0239427 from the National Science Foundation, the Medical Research Council of the UK, EURO-BLCS, QLG1-CT-2000-01643 and the European Community’s Seventh Framework Programme (FP7/2007-2013), ENGAGE project and grant agreement HEALTH-F4-2007-201413. The authors would like to thank the Center of Excellence in Common Disease Genetics of the Academy of Finland and Nordic Center of Excellence in Disease Genetics, the Sydantautisaatio (Finnish Foundation of Heart Diseases), the Broad Genotyping Center, D. Mirel, H. Hobbs, J. DeYoung, P. Rantakallio, M. Koiranen and M. Isohanni for advice and assistance.
Note: Supplementary information is available on the Nature Genetics website.
Published online at http://www.nature.com/naturegenetics/
Reprints and permissions information is available online at http://npg.nature.com/reprintsandpermissions