Genome-wide association studies (GWAS) are being conducted at an unprecedented rate in population-based cohorts and have increased our understanding of the pathophysiology of complex disease. The recent application of GWAS to clinic-based cohorts has also yielded genetic predictors of clinical outcomes. Regardless of context, the practical utility of this information will ultimately depend upon the quality of the original data. Quality control (QC) procedures for GWAS are computationally intensive, operationally challenging, and constantly evolving. With each new dataset, new realities are discovered about GWAS data and best practices continue to be developed. The Genomics Workgroup of the National Human Genome Research Institute (NHGRI) funded electronic Medical Records and Genomics (eMERGE) network has invested considerable effort in developing strategies for QC of these data. The lessons learned by this group will be valuable for other investigators dealing with large scale genomic datasets. Here we enumerate some of the challenges in QC of GWAS data and describe the approaches that the eMERGE network is using for quality assurance in GWAS data, thereby minimizing potential bias and error in GWAS results. In this protocol we discuss common issues associated with QC of GWAS data, including data file formats, software packages for data manipulation and analysis, sex chromosome anomalies, sample identity, sample relatedness, population substructure, batch effects, and marker quality. We propose best practices and discuss areas of ongoing and future research.
The methyl-CpG-binding domain (MBD) gene family was first linked to autism over a decade ago when Rett syndrome, which falls under the umbrella of autism spectrum disorders (ASDs), was revealed to be predominantly caused by MECP2 mutations. Since that time, MECP2 alterations have been recognized in idiopathic ASD patients by us and others. Individuals with deletions across the MBD5 gene also present with ASDs, impaired speech, intellectual difficulties, repetitive behaviors, and epilepsy. These findings suggest that further investigations of the MBD gene family may reveal additional associations related to autism. We now describe the first study evaluating individuals with ASD for rare variants in four autosomal MBD family members, MBD5, MBD6, SETDB1, and SETDB2, and expand our initial screening in the MECP2 gene. Each gene was sequenced over all coding exons and evaluated for copy number variations in 287 patients with ASD and an equal number of ethnically matched control individuals. We identified 186 alterations through sequencing, approximately half of which were novel (96 variants, 51.6%). We identified seventeen ASD specific, nonsynonymous variants, four of which were concordant in multiplex families: MBD5 Tyr1269Cys, MBD6 Arg883Trp, MECP2 Thr240Ser, and SETDB1 Pro1067del. Furthermore, a complex duplication spanning the MECP2 gene was identified in two brothers who presented with developmental delay and intellectual disability. From our studies, we provide the first examples of autistic patients carrying potentially detrimental alterations in MBD6 and SETDB1, thereby demonstrating that the MBD gene family potentially plays a significant role in rare and private genetic causes of autism.
autism spectrum disorders (ASDs); copy number variation (CNV); methyl-CpG-binding domain (MBD); Rett syndrome; single nucleotide polymorphism (SNP)
Mitochondrial DNA (mtDNA) variation can affect phenotypic variation; therefore, knowing its distribution within and among individuals is of importance to understanding many human diseases. Intra-individual mtDNA variation (heteroplasmy) has been generally assumed to be random. We used massively parallel sequencing to assess heteroplasmy across ten tissues and demonstrate that in unrelated individuals there are tissue-specific, recurrent mutations. Certain tissues, notably kidney, liver and skeletal muscle, displayed the identical recurrent mutations that were undetectable in other tissues in the same individuals. Using RFLP analyses we validated one of the tissue-specific mutations in the two sequenced individuals and replicated the patterns in two additional individuals. These recurrent mutations all occur within or in very close proximity to sites that regulate mtDNA replication, strongly implying that these variations alter the replication dynamics of the mutated mtDNA genome. These recurrent variants are all independent of each other and do not occur in the mtDNA coding regions. The most parsimonious explanation of the data is that these frequently repeated mutations experience tissue-specific positive selection, probably through replication advantage.
DNA mutations are expected to be formed randomly, thus any reproducible pattern of DNA somatic mutations across multiple individuals or even across organs within each individual is highly unexpected. Using next generation sequencing of multiple tissues from the same individuals we found several somatic mutations in mitochondrial DNA that appear in a heteroplasmic state in all individuals examined, but only in particular tissues. These mutations were only found in known regions of replication control for the mitochondrial DNA. These data imply the presence of tissue-specific positive selection for these variants.
Alzheimer’s disease (AD) is a progressive neurodegenerative disorder characterized by memory and cognitive impairment and is the leading cause of dementia in the elderly. A number of genome wide association studies and subsequent replication studies have been published recently on late onset AD (LOAD). These studies identified several new susceptibility genes including phosphatidylinositol-binding clathrin assembly protein (PICALM) on chromosome 11. The aim of our study was to examine the entire coding sequence of PICALM to determine if the association could be explained by any previously undetected sequence variation. Therefore, we sequenced 48 cases and 48 controls homozygous for the risk allele in the signal SNP rs3851179. We did not find any new variants; however, rs592297, a known coding synonymous SNP that is part of an exonic splice enhancer region in exon 5, is in strong linkage disequilibrium with rs3851179 and should be examined for functional significance in Alzheimer pathophysiology.
Alzheimer; Neurodegenerative disease; PICALM; Sequencing; Exonic splicing
The clinical course of multiple sclerosis (MS) is highly variable, and research data collection is costly and time consuming. We evaluated natural language processing techniques applied to electronic medical records (EMR) to identify MS patients and the key clinical traits of their disease course.
Materials and methods
We used four algorithms based on ICD-9 codes, text keywords, and medications to identify individuals with MS from a de-identified, research version of the EMR at Vanderbilt University. Using a training dataset of the records of 899 individuals, algorithms were constructed to identify and extract detailed information regarding the clinical course of MS from the text of the medical records, including clinical subtype, presence of oligoclonal bands, year of diagnosis, year and origin of first symptom, Expanded Disability Status Scale (EDSS) scores, timed 25-foot walk scores, and MS medications. Algorithms were evaluated on a test set validated by two independent reviewers.
We identified 5789 individuals with MS. For all clinical traits extracted, precision was at least 87% and specificity was greater than 80%. Recall values for clinical subtype, EDSS scores, and timed 25-foot walk scores were greater than 80%.
Discussion and conclusion
This collection of clinical data represents one of the largest databases of detailed, clinical traits available for research on MS. This work demonstrates that detailed clinical information is recorded in the EMR and can be extracted for research purposes with high reliability.
Multiple sclerosis; electronic health records
Successful aging (SA) is a multi-dimensional phenotype involving preservation of cognitive ability, physical function, and social engagement throughout life. Multiple components of SA are heritable, supporting a genetic component. The Old Order Amish are genetically and socially isolated with homogeneous lifestyles, making them a suitable population for studying the genetics of SA. DNA and measures of SA were collected on 214 cognitively intact Amish individuals over age 80. Individuals were grouped into a 13-generation pedigree using the Anabaptist Genealogy Database. A linkage screen of 5,944 single nucleotide polymorphisms (SNPs) was performed using 12 informative sub-pedigrees with an affected-only 2-point and multipoint linkage analysis. Eleven SNPs produced 2-point LOD scores >2, suggestive of linkage. Multipoint linkage analyses, allowing for heterogeneity, detected significant lod scores on chromosomes 6 (HLOD = 4.50), 7 (LOD* = 3.11), and 14 (HLOD = 4.17), suggesting multiple new loci underlying SA.
Amish; longevity; genetic epidemiology; family-based study; population isolate
Genome-wide association studies (GWAS) are a useful approach in the study of the genetic components of complex phenotypes. Aside from large cohorts, GWAS have generally been limited to the study of one or a few diseases or traits. The emergence of biobanks linked to electronic medical records (EMRs) allows the efficient re-use of genetic data to yield meaningful genotype-phenotype associations for multiple phenotypes or traits. Phase I of the electronic MEdical Records and GEnomics (eMERGE-I) Network is a National Human Genome Research Institute (NHGRI)-supported consortium composed of five sites to perform various genetic association studies using DNA repositories and EMR systems. Each eMERGE site has developed EMR-based algorithms to comprise a core set of fourteen phenotypes for extraction of study samples from each site’s DNA repository. Each eMERGE site selected samples for a specific phenotype, and these samples were genotyped at either the Broad Institute or at the Center for Inherited Disease Research (CIDR) using the Illumina Infinium BeadChip technology. In all, approximately 17,000 samples from across the five sites were genotyped. A unified quality control (QC) pipeline was developed by the eMERGE Genomics Working Group and used to ensure thorough cleaning of the data. This process includes examination of sample quality, marker quality, and various batch effects. Upon completion of the genotyping and QC analyses for each site’s primary study, the eMERGE Coordinating Center merged the datasets from all five sites. This larger merged dataset re-entered the established eMERGE QC pipeline. Based on lessons learned during the process, additional analyses and QC checkpoints were added to the pipeline to ensure proper merging. Here we explore the challenges associated with combining datasets from different genotyping centers and describe the expansion to the eMERGE QC pipeline for merged datasets. These additional steps will be useful as the eMERGE project expands to include additional sites in eMERGE-II and also serve as a starting point for investigators merging multiple genotype data sets accessible through the National Center for Biotechnology Information (NCBI) in the database of Genotypes and Phenotypes (dbGaP). Our experience demonstrates that merging multiple datasets after additional QC can be an efficient use of genotype data despite new challenges that appear in the process.
quality control; genome-wide association (GWAS); eMERGE; dbGaP; merging datasets
To evaluate the association of risk and age at onset (AAO) of Alzheimer disease (AD) with single-nucleotide polymorphisms (SNPs) in the chromosome 19 region including apolipoprotein E (APOE) and a repeat-length polymorphism in TOMM40 (poly-T, rs10524523).
Conditional logistic regression models and survival analysis.
Fifteen genome-wide association study data sets assembled by the Alzheimer's Disease Genetics Consortium.
Eleven thousand eight hundred forty AD cases and 10 931 cognitively normal elderly controls.
Main Outcome Measures
Association of AD risk and AAO with genotyped and imputed SNPs located in an 800-Mb region including APOE in the entire Alzheimer's Disease Genetics Consortium data set and with the TOMM40 poly-T marker genotyped in a subset of 1256 cases and 1605 controls.
In models adjusting for APOE ε4, no SNPs in the entire region were significantly associated with AAO at P<.001. Rs10524523 was not significantly associated with AD or AAO in models adjusting for APOE genotype or within the subset of ε3/ε3 subjects.
APOE alleles ε2, ε3, and ε4 account for essentially all the inherited risk of AD associated with this region. Other variants including a poly-T track in TOMM40 are not independent risk or AAO loci.
Health information technologies facilitate the collection of massive quantities of patient-level data. A growing body of research demonstrates that such information can support novel, large-scale biomedical investigations at a fraction of the cost of traditional prospective studies. While healthcare organizations are being encouraged to share these data in a de-identified form, there is hesitation over concerns that it will allow corresponding patients to be re-identified. Currently proposed technologies to anonymize clinical data may make unrealistic assumptions with respect to the capabilities of a recipient to ascertain a patients identity. We show that more pragmatic assumptions enable the design of anonymization algorithms that permit the dissemination of detailed clinical profiles with provable guarantees of protection. We demonstrate this strategy with a dataset of over one million medical records and show that 192 genotype-phenotype associations can be discovered with fidelity equivalent to non-anonymized clinical data.
Autism is a common neurodevelopmental disorder with genetic and environmental components. Though unproven, genetic susceptibility to high mercury (Hg) body burden has been suggested as an autism risk factor in a subset of children. We hypothesized that exposure to “safe” Hg levels could be implicated in the etiology of autism if genetic susceptibility altered Hg's metabolism or intracellular compartmentalization. Genetic sequences of four genes implicated in the transport and response to Hg were screened for variation and association with autism. LAT1 and DMT1 function in Hg transport, and Hg exposure induces MTF1 and MT1a. We identified and characterized 74 variants in MT1a, DMT1, LAT1 and MTF1. Polymorphisms identified through screening 48 unrelated individuals from the general and autistic populations were evaluated for differences in allele frequencies using Fisher's exact test. Three variants with suggestive p-values <0.1 and four variants with significant p-values <0.05 were followed-up with TaqMan genotyping in a larger cohort of 204 patients and 323 control samples. The pedigree disequilibrium test was used to examine linkage and association. Analysis failed to show association with autism for any variant evaluated in both the initial screening set and the expanded cohort, suggesting that variations in the ability of the four genes studied to process and transport Hg may not play a significant role in the etiology of autism.
Mercury; Autism; LAT1; DMT1; MTF1; MT1a
Primary open-angle glaucoma (POAG) is a genetically complex common disease characterized by progressive optic nerve degeneration that results in irreversible blindness. Recently, a genome-wide association study (GWAS) for POAG in an Icelandic population identified significant associations with single nucleotide polymorphisms (SNPs) between the CAV1 and CAV2 genes on chromosome 7q31. In this study, we confirm that the identified SNPs are associated with POAG in our Caucasian US population and that specific haplotypes located in the CAV1/CAV2 intergenic region are associated with the disease. We also present data suggesting that associations with several CAV1/CAV2 SNPs are significant mostly in women.
Genome Wide Association Studies (GWAS) are a standard approach for large-scale common variation characterization and for identification of single loci predisposing to disease. However, due to issues of moderate sample sizes and particularly multiple testing correction, many variants of smaller effect size are not detected within a single allele analysis framework. Thus, small main effects and potential epistatic effects are not consistently observed in GWAS using standard analytical approaches that consider only single SNP alleles. Here we propose unique methodology that aggregates variants of interest (for example, genes in a biological pathway) using GWAS results. Multiple testing and type I error concerns are minimized using empirical genomic randomization to estimate significance. Randomization corrects for common pathway-based analysis biases such as SNP coverage and density, linkage disequilibrium, gene size and pathway size. PARIS (Pathway Analysis by Randomization Incorporating Structure) applies this randomization and in doing so directly accounts for linkage disequilibrium effects. PARIS is independent of association analysis method and is thus applicable to GWAS datasets of all study designs. Using the KEGG database as an example, we apply PARIS to the publicly available Autism Genetic Resource Exchange (AGRE) GWA dataset, revealing pathways with a significant enrichment of positive association results.
pathway analysis; genomic randomization; gene set; enrichment
The authors show that genetic variants associated with an optic nerve quantitative trait, vertical cup-to-disc ratio, are associated with primary open angle glaucoma and that this effect can be modified by factors controlling optic nerve area.
Genetically complex disorders, such as primary open angle glaucoma (POAG), may include highly heritable quantitative traits as part of the overall phenotype, and mapping genes influencing the related quantitative traits may effectively identify genetic risk factors predisposing to the complex disease. Recent studies have identified SNPs associated with optic nerve area and vertical cup-to-disc ratio (VCDR). The purpose of this study was to evaluate the association between these SNPs and POAG in a US Caucasian case-control sample.
Five SNPs previously associated with optic disc area, or VCDR, were genotyped in 539 POAG cases and 336 controls. Genotype data were analyzed for single SNP associations and SNP interactions with VCDR and POAG.
SNPs associated with VCDR rs1063192 (CDKN2B) and rs10483727 (SIX1/SIX6) were also associated with POAG (P = 0.0006 and P = 0.0043 for rs1063192 and rs10483727, respectively). rs1063192, associated with smaller VCDR, had a protective effect (odds ratio [OR] = 0.73; 95% confidence interval [CI], 0.58–0.90), whereas rs10483727, associated with larger VCDR, increased POAG risk (OR = 1.33; 95% CI, 1.08–1.65). POAG risk associated with increased VCDR was significantly influenced by the C allele of rs1900004 (ATOH7), associated with increased optic nerve area (P-interaction = 0.025; OR = 1.89; 95% CI, 1.22–2.94).
Genetic variants influencing VCDR are associated with POAG in a US Caucasian population. Variants associated with optic nerve area are not independently associated with disease but can influence the effects of VCDR variants suggesting that increased optic disc area can significantly contribute to POAG risk when coupled with risk factors controlling VCDR.
Through extensive linkage and association analyses in multiple independent datasets, this study identified CACNG3 as the most likely AMD susceptibility gene on 16p12.
Age-related macular degeneration (AMD) is a complex disorder of the retina, characterized by drusen, geographic atrophy, and choroidal neovascularization. Cigarette smoking and the genetic variants CFH Y402H, ARMS2 A69S, CFB R32Q, and C3 R102G have been strongly and consistently associated with AMD. Multiple linkage studies have found evidence suggestive of another AMD locus on chromosome 16p12 but the gene responsible has yet to be identified.
In the initial phase of the study, single-nucleotide polymorphisms (SNPs) across chromosome 16 were examined for linkage and/or association in 575 Caucasian individuals from 148 multiplex and 77 singleton families. Additional variants were tested in an independent dataset of unrelated cases and controls. According to these results, in combination with gene expression data and biological knowledge, five genes were selected for further study: CACNG3, HS3ST4, IL4R, Q7Z6F8, and ITGAM.
After genotyping additional tagging SNPs across each gene, the strongest evidence for linkage and association was found within CACNG3 (rs757200 nonparametric LOD* = 3.3, APL (association in the presence of linkage) P = 0.06, and rs2238498 MQLS (modified quasi-likelihood score) P = 0.006 in the families; rs2283550 P = 1.3 × 10−6, and rs4787924 P = 0.002 in the case–control dataset). After adjusting for known AMD risk factors, rs2283550 remained strongly associated (P = 2.4 × 10−4). Furthermore, the association signal at rs4787924 was replicated in an independent dataset (P = 0.035) and in a joint analysis of all the data (P = 0.001).
These results suggest that CACNG3 is the best candidate for an AMD risk gene within the 16p12 linkage peak. More studies are needed to confirm this association and clarify the role of the gene in AMD pathogenesis.
Sorting mechanisms that cause the amyloid precursor protein (APP) and the β-secretases and γ-secretases to colocalize in the same compartment play an important role in the regulation of Aβ production in Alzheimer’s disease (AD). We and others have reported that genetic variants in the Sortilin-related receptor (SORL1) increased the risk of AD, that SORL1 is involved in trafficking of APP, and that under expression of SORL1 leads to overproduction of Aβ. Here we explored the role of one of its homologs, the sortilin-related VPS10 domain containing receptor 1 (SORCS1), in AD.
We analyzed the genetic associations between AD and 16 SORCS1–single nucleotide polymorphisms (SNPs) in 6 independent data sets (2,809 cases and 3,482 controls). In addition, we compared SorCS1 expression levels of affected and unaffected brain regions in AD and control brains in microarray gene expression and real-time polymerase chain reaction (RT-PCR) sets, explored the effects of significant SORCS1-SNPs on SorCS1 brain expression levels, and explored the effect of suppression and overexpression of the common SorCS1 isoforms on APP processing and Aβ generation.
Inherited variants in SORCS1 were associated with AD in all datasets (0.001 < p < 0.049). In addition, SorCS1 influenced APP processing. While overexpression of SorCS1 reduced γ-secretase activity and Aβ levels, the suppression of SorCS1 increased γ-secretase processing of APP and the levels of Aβ.
These data suggest that inherited or acquired changes in SORCS1 expression or function may play a role in the pathogenesis of AD.
A recent genome-wide association study and follow-up shows significant association with the protocadherin 11 X-linked (PCDH11X) gene. Carrasquillo et al. (2009) show statistical association with four PCDH11X polymorphisms (rs5984894, rs2573905, rs5941047, rs4568761) in five of seven cohorts. The combined analysis of 2,356 cases and 2,384 controls showed the strongest association with a p-value of 2.2 × 10-7 with an allele specific odds ratio of 1.30 (95% CI, 1.18–1.43) at the rs5984894 polymorphism. We tested for association at these four SNPs in two independent datasets and then performed a joint analysis. Though we had adequate power to detect effects sizes with the reported odds ratios, we did not detect association between LOAD and the PCDH11X polymorphisms in our dataset of 889 cases and 850 controls, indicating that the PCDH11X association, if not a false positive, is not as strong or generalized as previously hypothesized.
Alzheimer disease; genetic association study; PCDH11X
Recent genome-wide association studies (GWAS) using selected community populations have identified genomic signals in SCN10A influencing PR duration. The extent to which this can be demonstrated in cohorts derived from electronic medical records is unknown.
Methods and Results
We performed a GWAS on 2,334 European-American patients with normal ECGs without evidence of prior heart disease from the Vanderbilt DNA databank, BioVU, which accrues subjects from routine patient care. Subjects were identified using combinations of natural language processing, laboratory, and billing code queries of de-identified medical record data. Subjects were 58% female, mean (±SD) age 54±15 years, and had mean PR intervals of 158±18 milliseconds. Genotyping was performed using the Illumina Human660W-Quad platform. Our results identify four single nucleotide polymorphisms (rs6800541, rs6795970, rs6798015, rs7430477) linked to SCN10A associated with PR interval (p=5.73×10−7 to 1.78×10−6).
This GWAS confirms a gene heretofore-unimplicated in cardiac pathophysiology as a modulator of PR interval in humans. This study is one of the first replication GWAS performed using an electronic medical record-derived cohort, supporting their further use for genotype-phenotype analyses.
electronic medical records; atrioventricular conduction; genome-wide association study; natural language processing
The Vanderbilt DNA Databank (BioVU) is a biorepository that currently contains >80,000 DNA samples linked to electronic medical records. While BioVU is a valuable source of samples and phenotypes for genetic association studies, it is unclear whether the administratively assigned race/ethnicity in BioVU can accurately describe and be used as a proxy for genetic ancestry.
We genotyped 360 SNPs on the Illumina DNA Test Panel containing ancestry informative markers (AIMs) in 1910 BioVU samples with observer-reported ancestry and 384 samples from the Multiple Sclerosis Genetics Group with self-reported ancestry. Genetic ancestry was inferred for all individuals using STRUCTURE 2.2.
More than 98% of observer-reported European Americans (EA) was genetically inferred to have at least 60% European ancestry. Ninety-three of observer-reported African Americans (AA) was genetically inferred to be predominantly of African ancestry. We determined that the concordance of observer-reported race/ethnicity and inferred genetic ancestry was not significantly different from that of self-reported race/ethnicity in either population (p=0.09 and 0.94 in European Americans and African Americans, respectively).
Observer-reported race/ethnicity for European Americans and African Americans approximates genetic ancestry as well as self-reported race/ethnicity, making biorepositories linked to EMRs such as BioVU a viable source of DNA samples for future large-scale genetic association studies.
biorepositories; admixture; ancestry; electronic medical record; population stratification
In this study, LOXL1 promoter haplotypes were identified that are significantly associated with exfoliation syndrome in a U.S. Caucasian population. These results suggest that promoter region SNPs can influence LOXL1 gene expression, potentially causing a reduction of enzyme activity that may predispose to disease.
LOXL1 is a major genetic risk factor for exfoliation syndrome (ES) and exfoliation glaucoma (EG). Recent evidence documenting reversal of risk alleles for the disease-associated missense variants R141L and G153D suggests that these variants are not causative and that they may be proxies for other unknown functional LOXL1 variants. The purpose of this study was to investigate the disease association of LOXL1 variants spanning the gene region, including the 5′ and 3′ regulatory regions, in a U.S. Caucasian case–control sample.
Twenty-five LOXL1 single-nucleotide polymorphisms (SNPs), distributed throughout the gene, were genotyped in 196 Caucasian patients with ES/EG and 201 matched controls. Genotype data were analyzed for single SNP associations, SNP interactions, and haplotype associations.
Promoter region haplotypes that included the risk alleles for rs12914489, a SNP located in the distal promoter region and independently associated with ES, and rs16958477, a SNP previously shown to affect gene transcription, were associated with increased disease risk (P = 0.0008; odds ratio [OR], 2.34; 95% confidence interval [CI], 1.42–3.85) and with protective effects (P = 2.3 × 10−6; OR, 0.38; 95% CI, 0.25–0.57). Haplotypes containing rs12914489 and rs16958477 risk and protective alleles also significantly influenced the disease risk associated with missense alleles R141L and G153D.
LOXL1 promoter haplotypes were identified that are significantly associated with ES/EG in a U.S. Caucasian population. These results suggest that promoter region SNPs can influence LOXL1 gene expression, potentially causing a reduction of enzyme activity that may predispose to disease.
Genetic studies have identified thousands of variants associated with complex traits. However, most association studies are limited to populations of European descent and a single phenotype. The Population Architecture using Genomics and Epidemiology (PAGE) Study was initiated in 2008 by the National Human Genome Research Institute to investigate the epidemiologic architecture of well-replicated genetic variants associated with complex diseases in several large, ethnically diverse population-based studies. Combining DNA samples and hundreds of phenotypes from multiple cohorts, PAGE is well-suited to address generalization of associations and variability of effects in diverse populations; identify genetic and environmental modifiers; evaluate disease subtypes, intermediate phenotypes, and biomarkers; and investigate associations with novel phenotypes. PAGE investigators harmonize phenotypes across studies where possible and perform coordinated cohort-specific analyses and meta-analyses. PAGE researchers are genotyping thousands of genetic variants in up to 121,000 DNA samples from African-American, white, Hispanic/Latino, Asian/Pacific Islander, and American Indian participants. Initial analyses will focus on single nucleotide polymorphisms (SNPs) associated with obesity, lipids, cardiovascular disease, type 2 diabetes, inflammation, various cancers, and related biomarkers. PAGE SNPs are also assessed for pleiotropy using the “phenome-wide association study” approach, testing each SNP for associations with hundreds of phenotypes. PAGE data will be deposited into the National Center for Biotechnology Information's Database of Genotypes and Phenotypes and made available via a custom browser.
cardiovascular diseases; cohort studies; genome-wide association study; multifactorial inheritance; neoplasms; obesity; population characteristics; reproducibility of results
Multiple sclerosis (MS) is a complex autoimmune disease of the central nervous system with a prominent genetic component. The primary genetic risk factor is the human leukocyte antigen (HLA)-DRB1*1501 allele; however, much of the remaining genetic contribution to MS has not been elucidated. The authors investigated the relation between variation in DNA repair pathway genes and risk of MS. Single-locus association testing, epistatic tests of interactions, logistic regression modeling, and nonparametric Random Forests analyses were performed by using genotypes from 1,343 MS cases and 1,379 healthy controls of European ancestry. A total of 485 single nucleotide polymorphisms within 72 genes related to DNA repair pathways were investigated, including base excision repair, nucleotide excision repair, and double-strand breaks repair. A single nucleotide polymorphism variant within the general transcription factor IIH, polypeptide 4 gene, GTF2H4, on chromosome 6p21.33 was significantly associated with MS (odds ratio = 0.7, P = 3.5 × 10−5) after accounting for multiple testing and was not due to linkage disequilibrium with HLA-DRB1*1501. Although other candidate genes examined here warrant further follow-up studies, collectively, these results derived from a well-powered study do not support a strong role for common variation within DNA repair pathway genes in MS.
decision trees; DNA repair; epistasis, genetic; genetic variation; multiple sclerosis