The electronic MEdical Records and GEnomics (eMERGE) network brings together DNA biobanks linked to electronic health records (EHRs) from multiple institutions. Approximately 51,000 DNA samples from distinct individuals have been genotyped using genome-wide SNP arrays across the nine sites of the network. The eMERGE Coordinating Center and the Genomics Workgroup developed a pipeline to impute and merge genomic data across the different SNP arrays to maximize sample size and power to detect associations with a variety of clinical endpoints. The 1000 Genomes cosmopolitan reference panel was used for imputation. Imputation results were evaluated using the following metrics: accuracy of imputation, allelic R2 (estimated correlation between the imputed and true genotypes), and the relationship between allelic R2 and minor allele frequency. Computation time and memory resources required by two different software packages (BEAGLE and IMPUTE2) were also evaluated. A number of challenges were encountered due to the complexity of using two different imputation software packages, multiple ancestral populations, and many different genotyping platforms. We present lessons learned and describe the pipeline implemented here to impute and merge genomic data sets. The eMERGE imputed dataset will serve as a valuable resource for discovery, leveraging the clinical data that can be mined from the EHR.
imputation; genome-wide association; eMERGE; electronic health records
Custom genotyping of markers in families with Familial Idiopathic Scoliosis (FIS) were used to fine-map candidate regions on chromosomes 9 and 16 in order to identify candidate genes that contribute to this disorder and prioritize them for next generation sequence analysis.
Candidate regions on 9q and 16p–16q, previously identified as linked to FIS in a study of 202 families, were genotyped with a high-density map of single nucleotide polymorphisms (SNPs). Tests of linkage for fine-mapping and intra-familial tests of association, including tiled regression, were performed on scoliosis as both a qualitative and quantitative trait.
Results and Conclusions
Nominally significant linkage results were found for markers in both candidate regions. Results from intra-familial tests of association and tiled regression corroborated the linkage findings and identified possible candidate genes suitable for follow-up with next generation sequencing in these same families. Candidate genes that met our prioritization criteria included FAM129B and CERCAM on chromosome 9 and SYT1, GNAO1, and CDH3 on chromosome 16.
idiopathic scoliosis; chromosome 9q; chromosome 16; genetic heterogeneity; genetics; association; family-based association study; complex disease
Background: B vitamins play an important role in homocysteine metabolism, with vitamin deficiencies resulting in increased levels of homocysteine and increased risk for stroke. We performed a genome-wide association study (GWAS) in 2,100 stroke patients from the Vitamin Intervention for Stroke Prevention (VISP) trial, a clinical trial designed to determine whether the daily intake of high-dose folic acid, vitamins B6, and B12 reduce recurrent cerebral infarction.
Methods: Extensive quality control (QC) measures resulted in a total of 737,081 SNPs for analysis. Genome-wide association analyses for baseline quantitative measures of folate, Vitamins B12, and B6 were completed using linear regression approaches, implemented in PLINK.
Results: Six associations met or exceeded genome-wide significance (P ≤ 5 × 10−08). For baseline Vitamin B12, the strongest association was observed with a non-synonymous SNP (nsSNP) located in the CUBN gene (P = 1.76 × 10−13). Two additional CUBN intronic SNPs demonstrated strong associations with B12 (P = 2.92 × 10−10 and 4.11 × 10−10), while a second nsSNP, located in the TCN1 gene, also reached genome-wide significance (P = 5.14 × 10−11). For baseline measures of Vitamin B6, we identified genome-wide significant associations for SNPs at the ALPL locus (rs1697421; P = 7.06 × 10−10 and rs1780316; P = 2.25 × 10−08). In addition to the six genome-wide significant associations, nine SNPs (two for Vitamin B6, six for Vitamin B12, and one for folate measures) provided suggestive evidence for association (P ≤ 10−07).
Conclusion: Our GWAS study has identified six genome-wide significant associations, nine suggestive associations, and successfully replicated 5 of 16 SNPs previously reported to be associated with measures of B vitamins. The six genome-wide significant associations are located in gene regions that have shown previous associations with measures of B vitamins; however, four of the nine suggestive associations represent novel finding and warrant further investigation in additional populations.
VISP; association; GWAS; one-carbon metabolism; B12; B6; folate
With white blood cell count emerging as an important risk factor for chronic inflammatory diseases, genetic associations of differential leukocyte types, specifically monocyte count, are providing novel candidate genes and pathways to further investigate. Circulating monocytes play a critical role in vascular diseases such as in the formation of atherosclerotic plaque. We performed a joint and ancestry-stratified genome-wide association analyses to identify variants specifically associated with monocyte count in 11 014 subjects in the electronic Medical Records and Genomics Network. In the joint and European ancestry samples, we identified novel associations in the chromosome 16 interferon regulatory factor 8 (IRF8) gene (P-value = 2.78×10(−16), β = −0.22). Other monocyte associations include novel missense variants in the chemokine-binding protein 2 (CCBP2) gene (P-value = 1.88×10(−7), β = 0.30) and a region of replication found in ribophorin I (RPN1) (P-value = 2.63×10(−16), β = −0.23) on chromosome 3. The CCBP2 and RPN1 region is located near GATA binding protein2 gene that has been previously shown to be associated with coronary heart disease. On chromosome 9, we found a novel association in the prostaglandin reductase 1 gene (P-value = 2.29×10(−7), β = 0.16), which is downstream from lysophosphatidic acid receptor 1. This region has previously been shown to be associated with monocyte count. We also replicated monocyte associations of genome-wide significance (P-value = 5.68×10(−17), β = −0.23) at the integrin, alpha 4 gene on chromosome 2. The novel IRF8 results and further replications provide supporting evidence of genetic regions associated with monocyte count.
Nicotine dependence is a highly heritable disorder associated with severe medical morbidity and mortality. Recent meta-analyses have found novel genetic loci associated with cigarettes per day (CPD), a proxy for nicotine dependence. The aim of this paper is to evaluate the importance of phenotype definition (i.e. CPD versus Fagerström Test for Cigarette Dependence (FTCD) score as a measure of nicotine dependence) on genome-wide association studies of nicotine dependence.
Genome-wide association study
A total of 3,365 subjects who had smoked at least one cigarette were selected from the Study of Addiction: Genetics and Environment (SAGE). Of the participants, 2,267 were European Americans,999 were African Americans.
Nicotine dependence defined by FTCD score ≥4, CPD
The genetic locus most strongly associated with nicotine dependence was rs1451240 on chromosome 8 in the region of CHRNB3 (OR=0.65, p=2.4×10−8). This association was further strengthened in a meta-analysis with a previously published dataset (combined p=6.7 ×10−16, total n=4,200).When CPD was used as an alternate phenotype, the association no longer reached genome-wide significance (β=−0.08, p=0.0007).
Daily cigarette consumption and the Fagerstrom Test for Cigarette Dependence (FTCD) show different associations with polymorphisms in genetic loci.
Microarray single-nucleotide polymorphism genotyping, combined with imputation of untyped variants, has been widely adopted as an efficient means to interrogate variation across the human genome. “Genomic coverage” is the total proportion of genomic variation captured by an array, either by direct observation or through an indirect means such as linkage disequilibrium or imputation. We have performed imputation-based genomic coverage assessments of eight current genotyping arrays that assay from ~0.3 to ~5 million variants. Coverage was determined separately in each of the four continental ancestry groups in the 1000 Genomes Project phase 1 release. We used the subset of 1000 Genomes variants present on each array to impute the remaining variants and assessed coverage based on correlation between imputed and observed allelic dosages. More than 75% of common variants (minor allele frequency > 0.05) are covered by all arrays in all groups except for African ancestry, and up to ~90% in all ancestries for the highest density arrays. In contrast, less than 40% of less common variants (0.01 < minor allele frequency < 0.05) are covered by low density arrays in all ancestries and 50–80% in high density arrays, depending on ancestry. We also calculated genome-wide power to detect variant-trait association in a case-control design, across varying sample sizes, effect sizes, and minor allele frequency ranges, and compare these array-based power estimates with a hypothetical array that would type all variants in 1000 Genomes. These imputation-based genomic coverage and power analyses are intended as a practical guide to researchers planning genetic studies.
genome-wide association study; genomic coverage; power; SNP microarrays
Insulin secretion plays a critical role in glucose homeostasis, and failure to secrete sufficient insulin is a hallmark of type 2 diabetes. Genome-wide association studies (GWAS) have identified loci contributing to insulin processing and secretion1,2; however, a substantial fraction of the genetic contribution remains undefined. To examine low-frequency (minor allele frequency (MAF) 0.5% to 5%) and rare (MAF<0.5%) nonsynonymous variants, we analyzed exome array data in 8,229 non-diabetic Finnish males. We identified low-frequency coding variants associated with fasting proinsulin levels at the SGSM2 and MADD GWAS loci and three novel genes with low-frequency variants associated with fasting proinsulin or insulinogenic index: TBC1D30, KANK1, and PAM. We also demonstrate that the interpretation of single-variant and gene-based tests needs to consider the effects of noncoding SNPs nearby and megabases (Mb) away. This study demonstrates that exome array genotyping is a valuable approach to identify low-frequency variants that contribute to complex traits.
Preterm delivery (PTD) is the leading cause of neonatal morbidity and mortality. Epidemiologic studies indicate recurrence of PTD is maternally inherited creating a strong possibility that mitochondrial variants contribute to its etiology. This study examines the association between mitochondrial genotypes with PTD and related outcomes.
This study combined, through meta-analysis, two case-control, genome-wide association studies (GWAS); one from the Danish National Birth Cohort (DNBC) Study and one from the Norwegian Mother and Child Cohort Study (MoBa) conducted by the Norwegian Institute of Public Health. The outcomes of PTD (≤36 weeks), very PTD (≤32 weeks) and preterm prelabor rupture of membranes (PPROM) were examined. 135 individual SNP associations were tested using the combined genome from mothers and neonates (case vs. control) in each population and then pooled via meta-analysis.
After meta-analysis there were four SNPs for the outcome of PTD below p≤0.10, and two below p≤0.05. For the additional outcomes of very PTD and PPROM there were three and four SNPs respectively below p≤0.10.
Given the number of tests no single SNP reached study wide significance (p=0.0006). Our study does not support the hypothesis that mitochondrial genetics contributes to the maternal transmission of PTD and related outcomes.
Background and Purpose
Ischemic stroke (IS) shares many common risk factors with coronary artery disease (CAD). We hypothesized that genetic variants associated with myocardial infarction (MI) or CAD may be similarly involved in the etiology of IS. To test this hypothesis, we evaluated whether single-nucleotide polymorphisms (SNPs) at 11 different loci recently associated with MI or CAD through genome-wide association studies were associated with IS.
Meta-analyses of the associations between the 11 MI-associated SNPs and IS were performed using 6865 cases and 11 395 control subjects recruited from 9 studies. SNPs were either genotyped directly or imputed; in a few cases a surrogate SNP in high linkage disequilibrium was chosen. Logistic regression was performed within each study to obtain study-specific βs and standard errors. Meta-analysis was conducted using an inverse variance weighted approach assuming a random effect model.
Despite having power to detect odds ratio of 1.09–1.14 for overall IS and 1.20–1.32 for major stroke subtypes, none of the SNPs were significantly associated with overall IS and/or stroke subtypes after adjusting for multiple comparisons.
Our results suggest that the major common loci associated with MI risk do not have effects of similar magnitude on overall IS but do not preclude moderate associations restricted to specific IS subtypes. Disparate mechanisms may be critical in the development of acute ischemic coronary and cerebrovascular events.
cerebral infarct; genetics; ischemia
Asthma is a complex disease characterized by striking ethnic disparities not explained entirely by environmental, social, cultural, or economic factors. Of the limited genetic studies performed on populations of African descent, notable differences in susceptibility allele frequencies have been observed.
To test the hypothesis that some genes may contribute to the profound disparities in asthma.
We performed a genome-wide association study in two independent populations of African ancestry (935 African American asthma cases and controls from the Baltimore-Washington, D.C. area, and 929 African Caribbean asthmatics and their family members from Barbados) to identify single-nucleotide polymorphisms (SNPs) associated with asthma.
Meta-analysis combining these two African-ancestry populations yielded three SNPs with a combined P-value <10-5 in genes of potential biological relevance to asthma and allergic disease: rs10515807, mapping to alpha-1B-adrenergic receptor (ADRA1B) gene on chromosome 5q33 (3.57×10-6); rs6052761, mapping to prion-related protein (PRNP) on chromosome 20pter-p12 (2.27×10-6); and rs1435879, mapping to dipeptidyl peptidase 10 (DPP10) on chromosome 2q12.3-q14.2. The generalizability of these findings was tested in family and case-control panels of UK and German origin, respectively, but none of the associations observed in the African groups were replicated in these European studies.
Evidence for association was also examined in four additional case-control studies of African Americans; however, none of the SNPs implicated in the discovery population were replicated. This study illustrates the complexity of identifying true associations for a complex and heterogeneous disease such as asthma in admixed populations, especially populations of African descent.
Asthma; GWAS; ADRA1B; PRNP; DPP10; African ancestry; ethnicity; polymorphism; genetic association
Genome-wide association studies (GWAS) are a useful approach in the study of the genetic components of complex phenotypes. Aside from large cohorts, GWAS have generally been limited to the study of one or a few diseases or traits. The emergence of biobanks linked to electronic medical records (EMRs) allows the efficient re-use of genetic data to yield meaningful genotype-phenotype associations for multiple phenotypes or traits. Phase I of the electronic MEdical Records and GEnomics (eMERGE-I) Network is a National Human Genome Research Institute (NHGRI)-supported consortium composed of five sites to perform various genetic association studies using DNA repositories and EMR systems. Each eMERGE site has developed EMR-based algorithms to comprise a core set of fourteen phenotypes for extraction of study samples from each site’s DNA repository. Each eMERGE site selected samples for a specific phenotype, and these samples were genotyped at either the Broad Institute or at the Center for Inherited Disease Research (CIDR) using the Illumina Infinium BeadChip technology. In all, approximately 17,000 samples from across the five sites were genotyped. A unified quality control (QC) pipeline was developed by the eMERGE Genomics Working Group and used to ensure thorough cleaning of the data. This process includes examination of sample quality, marker quality, and various batch effects. Upon completion of the genotyping and QC analyses for each site’s primary study, the eMERGE Coordinating Center merged the datasets from all five sites. This larger merged dataset re-entered the established eMERGE QC pipeline. Based on lessons learned during the process, additional analyses and QC checkpoints were added to the pipeline to ensure proper merging. Here we explore the challenges associated with combining datasets from different genotyping centers and describe the expansion to the eMERGE QC pipeline for merged datasets. These additional steps will be useful as the eMERGE project expands to include additional sites in eMERGE-II and also serve as a starting point for investigators merging multiple genotype data sets accessible through the National Center for Biotechnology Information (NCBI) in the database of Genotypes and Phenotypes (dbGaP). Our experience demonstrates that merging multiple datasets after additional QC can be an efficient use of genotype data despite new challenges that appear in the process.
quality control; genome-wide association (GWAS); eMERGE; dbGaP; merging datasets
To examine associations in a preterm population between rs9883204 in ADCY5 and rs900400 near LEKR1 and CCNL1 with birth weight. Both markers were associated with birth weight in a term population in a recent genome-wide association (GWA) study by Freathy et al.
A meta-analysis of mother and infant samples was performed for associations of rs900400 and rs9883204 with birth weight in 393 families from the U.S., 265 families from Argentina and 735 mother-infant pairs from Denmark. Z scores adjusted for infant sex and gestational age were generated for each population separately and regressed on allele counts. Association evidence was combined across sites by inverse-variance weighted meta-analysis.
Each additional C allele of rs900400 (LEKR1/CCNL1) in infants was marginally associated with a 0.069 standard deviation (SD) lower birth weight (95% CI = −0.159 – 0.022, P = 0.068). This result was slightly more pronounced after adjusting for smoking (P = 0.036). There were no significant associations identified with rs9883204 or in maternal samples.
These results indicate the potential importance of this marker on birth weight irrespective of gestational age.
Genetic; association; single nucleotide polymorphism
Over 90% of adults aged 20 years or older with permanent teeth have suffered from dental caries leading to pain, infection, or even tooth loss. Although caries prevalence has decreased over the past decade, there are still about 23% of dentate adults who have untreated carious lesions in the US. Dental caries is a complex disorder affected by both individual susceptibility and environmental factors. Approximately 35-55% of caries phenotypic variation in the permanent dentition is attributable to genes, though few specific caries genes have been identified. Therefore, we conducted the first genome-wide association study (GWAS) to identify genes affecting susceptibility to caries in adults.
Five independent cohorts were included in this study, totaling more than 7000 participants. For each participant, dental caries was assessed and genetic markers (single nucleotide polymorphisms, SNPs) were genotyped or imputed across the entire genome. Due to the heterogeneity among the five cohorts regarding age, genotyping platform, quality of dental caries assessment, and study design, we first conducted genome-wide association (GWA) analyses on each of the five independent cohorts separately. We then performed three meta-analyses to combine results for: (i) the comparatively younger, Appalachian cohorts (N = 1483) with well-assessed caries phenotype, (ii) the comparatively older, non-Appalachian cohorts (N = 5960) with inferior caries phenotypes, and (iii) all five cohorts (N = 7443). Top ranking genetic loci within and across meta-analyses were scrutinized for biologically plausible roles on caries.
Different sets of genes were nominated across the three meta-analyses, especially between the younger and older age cohorts. In general, we identified several suggestive loci (P-value ≤ 10E-05) within or near genes with plausible biological roles for dental caries, including RPS6KA2 and PTK2B, involved in p38-depenedent MAPK signaling, and RHOU and FZD1, involved in the Wnt signaling cascade. Both of these pathways have been implicated in dental caries. ADMTS3 and ISL1 are involved in tooth development, and TLR2 is involved in immune response to oral pathogens.
As the first GWAS for dental caries in adults, this study nominated several novel caries genes for future study, which may lead to better understanding of cariogenesis, and ultimately, to improved disease predictions, prevention, and/or treatment.
Dental caries; Genetics; Genome wide association; Permanent dentition; Genomics
We performed a multistage genome-wide association study of melanoma. In a discovery cohort of 1804 melanoma cases and 1026 controls, we identified loci at chromosomes 15q13.1 (HERC2/OCA2 region) and 16q24.3 (MC1R) regions that reached genome-wide significance within this study and also found strong evidence for genetic effects on susceptibility to melanoma from markers on chromosome 9p21.3 in the p16/ARF region and on chromosome 1q21.3 (ARNT/LASS2/ANXA9 region). The most significant single-nucleotide polymorphisms (SNPs) in the 15q13.1 locus (rs1129038 and rs12913832) lie within a genomic region that has profound effects on eye and skin color; notably, 50% of variability in eye color is associated with variation in the SNP rs12913832. Because eye and skin colors vary across European populations, we further evaluated the associations of the significant SNPs after carefully adjusting for European substructure. We also evaluated the top 10 most significant SNPs by using data from three other genome-wide scans. Additional in silico data provided replication of the findings from the most significant region on chromosome 1q21.3 rs7412746 (P = 6 × 10−10). Together, these data identified several candidate genes for additional studies to identify causal variants predisposing to increased risk for developing melanoma.
Clonal mosaicism for large chromosomal anomalies (duplications, deletions and uniparental disomy) was detected using SNP microarray data from over 50,000 subjects recruited for genome-wide association studies. This detection method requires a relatively high frequency of cells (>5–10%) with the same abnormal karyotype (presumably of clonal origin) in the presence of normal cells. The frequency of detectable clonal mosaicism in peripheral blood is low (<0.5%) from birth until 50 years of age, after which it rises rapidly to 2–3% in the elderly. Many of the mosaic anomalies are characteristic of those found in hematological cancers and identify common deleted regions that pinpoint the locations of genes previously associated with hematological cancers. Although only 3% of subjects with detectable clonal mosaicism had any record of hematological cancer prior to DNA sampling, those without a prior diagnosis have an estimated 10-fold higher risk of a subsequent hematological cancer (95% confidence interval = 6–18).
Non-syndromic cleft palate (CP) is a common birth defect with a complex and heterogeneous etiology involving both genetic and environmental risk factors. We conducted a genome wide association study (GWAS) using 550 case-parent trios, ascertained through a CP case collected in an international consortium. Family based association tests of single nucleotide polymorphisms (SNP) and three common maternal exposures (maternal smoking, alcohol consumption and multivitamin supplementation) were used in a combined 2 df test for gene (G) and gene-environment (G×E) interaction simultaneously, plus a separate 1 df test for G×E interaction alone. Conditional logistic regression models were used to estimate effects on risk to exposed and unexposed children. While no SNP achieved genome wide significance when considered alone, markers in several genes attained or approached genome wide significance when G×E interaction was included. Among these, MLLT3 and SMC2 on chromosome 9 showed multiple SNPs resulting in increased risk if the mother consumed alcohol during the peri-conceptual period (3 months prior to conception through the first trimester). TBK1 on chr. 12 and ZNF236 on chr. 18 showed multiple SNPs associated with higher risk of CP in the presence of maternal smoking. Additional evidence of reduced risk due to G×E interaction in the presence of multivitamin supplementation was observed for SNPs in BAALC on chr. 8. These results emphasize the need to consider G×E interaction when searching for genes influencing risk to complex and heterogeneous disorders, such as non-syndromic CP.
Despite twin studies showing that 50–70% of variation in DSM-IV cannabis dependence is attributable to heritable influences, little is known of specific genotypes that influence vulnerability to cannabis dependence. We conducted a genomewide association study of DSM-IV cannabis dependence. Association analyses of 708 DSM-IV cannabis dependent cases with 2,346 cannabis exposed nondependent controls was conducted using logistic regression in PLINK. None of the 948,142 SNPs met genomewide significance (p < E−8). The lowest p-values were obtained for polymorphisms on chromosome 17 (rs1019238 and rs1431318, p-values at E−7) in the ANKFN1 gene. While replication is required, this study represents an important first step towards clarifying the biological underpinnings of cannabis dependence.
With the advent of novel sequencing technologies, interest in the identification of rare variants that influence common traits has increased rapidly. Standard statistical methods, such as the Cochrane-Armitage trend test or logistic regression, fail in this setting for the analysis of unrelated subjects because of the rareness of the variants. Recently, various alternative approaches have been proposed that circumvent the rareness problem by collapsing rare variants in a defined genetic region or sets of regions. We provide an overview of these collapsing methods for association analysis and discuss the use of permutation approaches for significance testing of the data-adaptive methods.
association; collapsing methods; collection of rare variants; common disease/rare variant hypothesis; contingency table; generalized linear model; next-generation sequencing; pooling methods
Genome-wide association studies (GWAS) are being conducted at an unprecedented rate in population-based cohorts and have increased our understanding of the pathophysiology of complex disease. The recent application of GWAS to clinic-based cohorts has also yielded genetic predictors of clinical outcomes. Regardless of context, the practical utility of this information will ultimately depend upon the quality of the original data. Quality control (QC) procedures for GWAS are computationally intensive, operationally challenging, and constantly evolving. With each new dataset, new realities are discovered about GWAS data and best practices continue to be developed. The Genomics Workgroup of the National Human Genome Research Institute (NHGRI) funded electronic Medical Records and Genomics (eMERGE) network has invested considerable effort in developing strategies for QC of these data. The lessons learned by this group will be valuable for other investigators dealing with large scale genomic datasets. Here we enumerate some of the challenges in QC of GWAS data and describe the approaches that the eMERGE network is using for quality assurance in GWAS data, thereby minimizing potential bias and error in GWAS results. In this protocol we discuss common issues associated with QC of GWAS data, including data file formats, software packages for data manipulation and analysis, sex chromosome anomalies, sample identity, sample relatedness, population substructure, batch effects, and marker quality. We propose best practices and discuss areas of ongoing and future research.
As part of Genetic Analysis Workshop 17 (GAW17), our group considered the application of novel and standard approaches to the analysis of genotype-phenotype association in next-generation sequencing data. Our group identified a major issue in the analysis of the GAW17 next-generation sequencing data: type I error and false-positive report probability rates higher than those expected based on empirical type I error levels (as high as 90%). Two main causes emerged: population stratification and long-range correlation (gametic phase disequilibrium) between rare variants. Population stratification was expected because of the diverse sample. Correlation between rare variants was attributable to both random causes (e.g., nearly 10,000 of 25,000 markers were private variants, and the sample size was small [n = 697]) and nonrandom causes (more correlation was observed than was expected by random chance). Principal components analysis was used to control for population structure and helped to minimize type I errors, but this was at the expense of identifying fewer causal variants. A novel multiple regression approach showed promise to handle correlation between markers. Further work is needed, first, to identify best practices for the control of type I errors in the analysis of sequencing data and then to explore and compare the many promising new aggregating approaches for identifying markers associated with disease phenotypes.
population structure; correlated markers; next-generation sequencing
Genetic Analysis Workshop 17 (GAW17) provided a platform for evaluating existing statistical genetic methods and for developing novel methods to analyze rare variants that modulate complex traits. In this article, we present an overview of the 1000 Genomes Project exome data and simulated phenotype data that were distributed to GAW17 participants for analyses, the different issues addressed by the participants, and the process of preparation of manuscripts resulting from the discussions during the workshop.
Genotyping experiments are widely used in clinical and basic research laboratories to identify associations between genetic variations and normal/abnormal phenotypes. Genotyping assay techniques vary from single genomic regions that are interrogated using PCR reactions to high throughput assays examining genome-wide sequence and structural variation. The resulting genotype data may include millions of markers of thousands of individuals, requiring various statistical, modeling or other data analysis methodologies to interpret the results. To date, there are no standards for reporting genotyping experiments. Here we present the Minimum Information about a Genotyping Experiment (MIGen) standard, defining the minimum information required for reporting genotyping experiments. MIGen standard covers experimental design, subject description, genotyping procedure, quality control and data analysis. MIGen is a registered project under MIBBI (Minimum Information for Biological and Biomedical Investigations) and is being developed by an interdisciplinary group of experts in basic biomedical science, clinical science, biostatistics and bioinformatics. To accommodate the wide variety of techniques and methodologies applied in current and future genotyping experiment, MIGen leverages foundational concepts from the Ontology for Biomedical Investigations (OBI) for the description of the various types of planned processes and implements a hierarchical document structure. The adoption of MIGen by the research community will facilitate consistent genotyping data interpretation and independent data validation. MIGen can also serve as a framework for the development of data models for capturing and storing genotyping results and experiment metadata in a structured way, to facilitate the exchange of metadata.
Identifying the genetic variants that increase the risk of type 2 diabetes (T2D) in humans has been a formidable challenge. Adopting a genome-wide association strategy, we genotyped 1161 Finnish T2D cases and 1174 Finnish normal glucose-tolerant (NGT) controls with >315,000 single-nucleotide polymorphisms (SNPs) and imputed genotypes for an additional >2 million autosomal SNPs. We carried out association analysis with these SNPs to identify genetic variants that predispose to T2D, compared our T2D association results with the results of two similar studies, and genotyped 80 SNPs in an additional 1215 Finnish T2D cases and 1258 Finnish NGT controls. We identify T2D-associated variants in an intergenic region of chromosome 11p12, contribute to the identification of T2D-associated variants near the genes IGF2BP2 and CDKAL1 and the region of CDKN2A and CDKN2B, and confirm that variants near TCF7L2, SLC30A8, HHEX, FTO, PPARG, and KCNJ11 are associated with T2D risk. This brings the number of T2D loci now confidently identified to at least 10.
Ischemic stroke (IS) is among the leading causes of death in Western countries. There is a significant genetic component to IS susceptibility, especially among young adults. To date, research to identify genetic loci predisposing to stroke has met only with limited success. We performed a genome-wide association (GWA) analysis of early-onset IS to identify potential stroke susceptibility loci. The GWA analysis was conducted by genotyping 1 million SNPs in a biracial population of 889 IS cases and 927 controls, ages 15–49 years. Genotypes were imputed using the HapMap3 reference panel to provide 1.4 million SNPs for analysis. Logistic regression models adjusting for age, recruitment stages, and population structure were used to determine the association of IS with individual SNPs. Although no single SNP reached genome-wide significance (P < 5 × 10−8), we identified two SNPs in chromosome 2q23.3, rs2304556 (in FMNL2; P = 1.2 × 10−7) and rs1986743 (in ARL6IP6; P = 2.7 × 10−7), strongly associated with early-onset stroke. These data suggest that a novel locus on human chromosome 2q23.3 may be associated with IS susceptibility among young adults.
epidemiology; genetics; brain infarction; FMNL2
Genome-wide scans of nucleotide variation in human subjects are providing an increasing number of replicated associations with complex disease traits. Most of the variants detected have small effects and, collectively, they account for a small fraction of the total genetic variance. Very large sample sizes are required to identify and validate findings. In this situation, even small sources of systematic or random error can cause spurious results or obscure real effects. The need for careful attention to data quality has been appreciated for some time in this field, and a number of strategies for quality control and quality assurance (QC/QA) have been developed. Here we extend these methods and describe a system of QC/QA for genotypic data in genome-wide association studies. This system includes some new approaches that (1) combine analysis of allelic probe intensities and called genotypes to distinguish gender misidentification from sex chromosome aberrations, (2) detect autosomal chromosome aberrations that may affect genotype calling accuracy, (3) infer DNA sample quality from relatedness and allelic intensities, (4) use duplicate concordance to infer SNP quality, (5) detect genotyping artifacts from dependence of Hardy-Weinberg equilibrium (HWE) test p-values on allelic frequency, and (6) demonstrate sensitivity of principal components analysis (PCA) to SNP selection. The methods are illustrated with examples from the ‘Gene Environment Association Studies’ (GENEVA) program. The results suggest several recommendations for QC/QA in the design and execution of genome-wide association studies.
GWAS; DNA sample quality; genotyping artifact; Hardy-Weinberg equilibrium; chromosome aberration