|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: JCB KDT IBB LAC MF VG SK DL YL BMP SSR DSS AVS AU CMVD JGW CJO JIR EB. Performed the experiments: MLG BJC TH MH YL RK FR. Analyzed the data: MLG BY EB. Contributed reagents/materials/analysis tools: KDT IBB MF VG SK YL GMP BMP SSR DSS AVS AU CMVD JGW CJO JIR EB. Wrote the paper: MLG BY BJC TH JCB KDT MH IBB LAC MF VG TBH SK RK LJL DL YL TM GMP BMP SSR FR DSS AVS AU CMVD JGW CJO JIR EB.
Genotyping arrays are a cost effective approach when typing previously-identified genetic polymorphisms in large numbers of samples. One limitation of genotyping arrays with rare variants (e.g., minor allele frequency [MAF] <0.01) is the difficulty that automated clustering algorithms have to accurately detect and assign genotype calls. Combining intensity data from large numbers of samples may increase the ability to accurately call the genotypes of rare variants. Approximately 62,000 ethnically diverse samples from eleven Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium cohorts were genotyped with the Illumina HumanExome BeadChip across seven genotyping centers. The raw data files for the samples were assembled into a single project for joint calling. To assess the quality of the joint calling, concordance of genotypes in a subset of individuals having both exome chip and exome sequence data was analyzed. After exclusion of low performing SNPs on the exome chip and non-overlap of SNPs derived from sequence data, genotypes of 185,119 variants (11,356 were monomorphic) were compared in 530 individuals that had whole exome sequence data. A total of 98,113,070 pairs of genotypes were tested and 99.77% were concordant, 0.14% had missing data, and 0.09% were discordant. We report that joint calling allows the ability to accurately genotype rare variation using array technology when large sample sizes are available and best practices are followed. The cluster file from this experiment is available at www.chargeconsortium.com/main/exomechip.
Exome- and whole-genome sequencing is becoming increasingly affordable and allows for detection and genotyping of rare variants in the human genome. Yet, genotyping arrays remain a cost-effective approach when investigating genetic polymorphisms previously identified in large populations. A limitation of using arrays to genotype rare variants is the difficulty that automated clustering algorithms have to accurately detect and assign accurate genotype calls , . Large sample sizes increase the number of occurrences of rare variants and, therefore, should facilitate automated clustering and genotyping.
An array focused on rare and low frequency coding variation, hereafter referred to as the exome chip, has been developed by querying the exomes sequenced in ~12,000 individuals and aggregating the variation that is seen in more than two individuals in more than two sequencing efforts (http://genome.sph.umich.edu/wiki/Exome_Chip_Design). Participating studies in the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium  consented to have their Illumina Infinium HumanExome BeadChip intensity data analyzed collectively (n=62,266) in order to increase the accuracy of rare variant genotype calls. The resulting cluster file (.egt) is publically available and we show that its use, along with best practices, increase genotype accuracy compared to other methods alone.
Genotypes were obtained for 238,876 successful variants in accordance with our best practices (96.4% SNP pass rate) which were converted to PLINK format  by cohort and combined into a single aggregate file for further analyses. Of the 62,266 samples genotyped, 1,380 (2.2%) had a GenCall quality score in the lower 10th percentile of the distribution across all variants genotyped (p10GC) <0.38 or call rate <0.97 and were excluded from allele frequency calculations. Because founder effects and unique population structure have been previously observed in Icelandic samples , , the Age, Gene/Environment, Susceptibility-Reykjavik study was excluded from subsequent steps. Known duplicated samples, individuals without self-reported race, and the HapMap controls were also removed. After excluding duplicate variants (n=811), the minor allele frequencies (MAF) for 238,065 successful SNPs and 56,407 samples by self-reported race are described in Table 1. There were 10,693 monomorphic SNPs (4.5%), and 78.6% of the variants on the exome chip have a MAF <0.005. Allele frequencies for each variant by race group are reported in the SNP information file (see Methods and Data Access sections). Ethnicity specific HapMap allele frequencies for the 96 controls (48 CEU and 48 YRI) and genotypes are also available.
To evaluate the performance of the rare variant calling approach (see Methods), we compared exome chip genotypes derived from three calling methods to available exome sequencing data in 530 ARIC individuals. First, exome chip genotypes were called with the Illumina issued cluster file HumanExome-12v1.egt (see Data Access section for file location) (Dataset I). Second, we used zCall (threshold set to 7)  to determine genotypes for the missing variant calls in Dataset I to create Dataset Z. Third, we used the CHARGE best practices (see Data Access) and joint calling approach described to ascertain exome chip genotypes (Dataset C). A total of 185,119 variants that were present in the exome sequence dataset and passed our best practices were compared using genotype concordance and uncertainty coefficient tests. Results are presented in Table 2. The uncertainty coefficients indicate that we can predict 86.4% of the information (entropy) in the exome sequence data when using the Illumina cluster file, 91.2% when using the zCall algorithm, and 93.4% when the CHARGE clustering method was utilized.
These data demonstrate the importance of implementing stringent laboratory quality control measures in addition to the clustering algorithms and rare variant calling approaches tested. The complete list of 8,994 failing SNPs identified in the jointly called exome chip project are available for download on the CHARGE public website. Genotypes ascertained with the CHARGE jointly called exome chip cluster file (Dataset C) were 99.77% concordant with sequence data, 0.14% were missing in exome chip data, in the exome sequence data, or both, and 0.09% were discordant (Figure 1). Heterozygotes in Dataset C were most often misclassified when compared to the common allele homozygote, and mismatches were attributed equally to both sequencing and genotyping (Table 2).
We also tested the ability of the CHARGE exome chip cluster file to accurately assign genotypes in the three rarest variant bins: singletons (minor allele count=1), doubletons (minor allele count=2), and tripletons (minor allele count=3). We observed high concordance between exome chip singletons (99.99%), doubletons (99.98%), and tripletons (99.97%) when compared to their respective sequence genotypes in the same 530 ARIC individuals previously described (data not shown). These results are consistent with the global concordance tests which suggest we are able to accurately call very rare variants.
The results presented here demonstrate that rare variants on the exome chip can be accurately called when using a large, combined cluster file and best practices described when compared to existing clustering algorithms and rare variant calling methods. The joint calling protocol, accompanying cluster file, list of poor performing variants on the chip, and annotation data are a valuable resource for the scientific community and will be of great utility to those having smaller sample sets where the calling of rare variants is problematic. All new projects will require user decisions based on their own cohort data and the metrics and best practices presented here should be updated accordingly.
Data from 62,266 participants from the following eleven studies in the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium  were included in this joint calling experiment and study descriptions were published previously: Age, Gene/Environment, Susceptibility-Reykjavik (AGES) Study , Atherosclerosis Risk in Communities (ARIC) Study , Cardiac Arrest Blood Study (CABS) , Cardiovascular Health Study (CHS) , , Coronary Artery Risk Development in Young Adults (CARDIA) , , Multi-Ethnic Study of Atherosclerosis (MESA) , Family Heart Study (FamHS) , Framingham Heart Study (FHS) , Health, Aging, and Body Composition (HABC) Study , Jackson Heart Study (JHS) , and the Rotterdam Study (RS) –. In addition, we genotyped 96 unrelated HapMap samples (48 CEU and 48 YRI) with each cohort and the list of sample IDs are available as a reference on the CHARGE exome chip public website.
All subjects provided written and informed consent to participate in genetic studies, and all study sites received approval to conduct this research from their local respective Institutional Review Boards (IRB) as follows: „The National Bioethics Committee“ and „The Data Protection Authority“ (AGES); University of Mississippi Medical Center IRB (ARIC – Jackson Field Center), Wake Forest University Health Sciences IRB (ARIC – Forsyth County Field Center), University of Minnesota IRB (ARIC – Minnesota Field Center), and Johns Hopkins University (Bloomberg School of Public Health) IRB (ARIC – Washington County Field Center); University of Washington IRB (CABS); Wake Forest University Health Sciences IRB (CHS – Forsyth County Field Center), University of California, Davis IRB (CHS – Sacramento County Field Center), Johns Hopkins University (Bloomberg School of Public Health) IRB (CHS – Washington County Field Center), and University of Pittsburgh IRB (CHS – Pittsburgh Field Center); University of Alabama at Birmingham (CARDIA – Birmingham Field Center), Northwestern University IRB (CARDIA – Chicago Field Center), University of Minnesota IRB (CARDIA – Minneapolis Field Center), and Kaiser Permanente IRB (CARDIA – Oakland Field Center); Washington University IRB (FamHS); Boston University IRB (FHS); Wake Forest University Health Sciences IRB (HABC); University of Mississippi Medical Center IRB (JHS); Columbia University IRB (MESA – New York Field Center), Johns Hopkins University IRB (MESA – Baltimore Field Center), Northwestern University IRB (MESA – Chicago Field Center), University of California IRB (MESA – Los Angeles Field Center), University of Minnesota IRB (MESA – Twin Cities Field Center), Wake Forest University Health Sciences IRB (MESA – Winston-Salem Field Center) and the National Heart, Lung, and Blood Institute; Medisch Ethische Toetsings Commissie (METC) at the Erasmus Medical Center, and the Netherlands Ministry of Health, Welfare and Sport (VWS) (RS). Joint calling of the array data was approved by the Committee for the Protection of Human Subjects (CPHS) which serves as the IRB for the University of Texas Health Science Center at Houston.
Study samples were processed on the HumanExome BeadChip v1.0 (Illumina, Inc., San Diego, CA) querying 247,870 variable sites described elsewhere (see Data Access) using standard protocols suggested by the manufacturer at the following seven genotyping centers: Broad Institute (JHS), Cedars-Sinai Medical Center (CHS, FamHS and MESA), Erasmus Medical Center (RS), Illumina Fast Track Services (FHS), University of Texas Health Science Center at Houston (AGES, ARIC and CARDIA), University of Washington (CABS), and Wake Forest University (HABC). Each center genotyped a common set of 96 HapMap samples to be utilized for quality control and determination of batch effects. The two channel raw data files (.idat) for all samples were transferred to a central location and assembled into a single project for joint calling. A summary of the samples genotyped within each cohort by race and gender is described in Table 3. The following variables were provided for each sample included in the project: study specific sample ID, cohort name, sample type (DNA or WGA), race (self-reported), gender, sample plate, sample well, chip barcode, chip position, replicate ID, father and mother IDs (if applicable).
The Illumina GenomeStudio v2011.1 software was utilized with the GenTrain 2.0 clustering algorithm. Genomic DNA study samples and HapMap controls with call rates >99% (n=55,142) were used to define genotype clusters with races combined and reruns excluded. The no-call threshold was set to 0.15 and we excluded female Y SNPs when calculating SNP statistics. The genotype quality score, representing the 10th percentile of the distribution of GenCall scores across all SNPs genotyped (p10GC), was visually examined in a scatter plot across all samples (Index vs. p10GC). Samples with an empirically determined p10GC <0.38 were identified as outliers and flagged for exclusion. The SNP parameters “Expected Number of Clusters of Y SNPs” and “Expected Number of Clusters of mtSNPs” were set to 2. Following automated clustering, all variants meeting the criteria provided in Table 4 (n=107,175) were visually inspected and manually clustered, if possible, by two independent laboratory technicians. AA and BB theta deviation cutoffs were determined empirically. Variants removed from the HumanExome BeadChip v1.1 (n=4,969) and cautious sites, as defined by the exome chip design committee (n=333) (ftp://share.sph.umich.edu/exomeChip/IlluminaDesigns/cautiousSites/cautiousSite.sorted.sites), were also inspected. Samples with a call rate between 0.95 and 0.99 that had been previously excluded were brought back in to the project and re-inspected based on the criteria listed in Table 4. This additional review was necessary as the CHARGE exome chip project contains samples from multiple DNA sources and ethnicities that were genotyped at several centers. SNPs exhibiting obvious batch effects were excluded. After joint calling, reproducibility and heritability statistics, SNP statistics and sample statistics were updated and the SNP-level quality control criteria described in Table 5 were implemented. SNPs with reproducibility (rep) errors >2, parent-parent-child (PPC) error >1, or parent-child (PC) error >1 were not excluded, but were flagged and reported back to the participating studies for further investigation. A list of the 8,994 excluded variants is provided on the CHARGE exome chip website as cluster positions for these sites are zeroed out in the.egt file (note: all SNP statistics for these sites will be converted to zero when the cluster file is imported into the Genome Studio project). A portion of excluded SNPs may be recoverable in projects with a homogenous population substructure, and we recommend clustering and reviewing the subset of variants with the user’s high quality samples. Table 6 describes the exome chip content and number of variants excluded by functional category (see Annotation). Importantly, the v1.0 cluster file should not be used for calling the Illumina v1.1 exome chip as the two versions were manufactured with different bead pools.
Genotypes derived from available exome sequencing of 540 ARIC participants were used as the comparison dataset to test the performance of the exome chip. We excluded 10 individuals from the sequencing dataset due to a high missing data rate <0.90, or non-overlap of individuals with existing exome chip data. Exome sequencing data is accessible via dbGaP as part of the National Heart Lung and Blood Institute (NHLBI) GO-ESP: Heart Cohorts Component of the Exome Sequencing Project (ARIC) (Study Accession: phs000398.v1.p1).
The following variants were excluded from the exome chip dataset as they were not available in the genotype data derived from exome sequencing results: replicate sites that were determined as triallelic or duplicates on opposite strands, short insertion/deletions, XY chromosome SNPs, Y chromosome SNPs, mitochondrial SNPs, or sites not identified in the exome sequencing dataset (n=56,042). Poor performing variants identified by our best practices criteria were removed if not previously excluded (n=6,709), thus a total of 185,119 variants were available for concordance analyses in 530 individuals.
Since concordance results are potentially high due to rare variation on the exome chip, we also calculated uncertainty coefficients  to determine the degree of association between each of the exome chip calling methods and exome sequence data. The uncertainty coefficient is a measure of association that is based on information entropy , or the uncertainty in a random variable, that is, a variable subject to chance variations. Uncertainty coefficients are useful when evaluating results obtained from clustering algorithms since genotype classification is usually random (all minor alleles are not classified as either AA or BB), thus the algorithm is not susceptible to rare variation bias in which the more common genotype could have been called by chance alone. See Press et al. (1992), pp. 758–762, for further clarification of the uncertainty coefficient metric .
Annotation of the v1.0 exome chip variants was performed with dbNSFP . The dbNSFP v2.0 annotations are available on the CHARGE exome chip public website in the SNP information file. dbSNP rs information has been curated and a look up table with the associated Illumina SNP name is also available. The reason for inclusion of the variant on the exome chip by the design team is also provided in the SNP info file (ftp://share.sph.umich.edu/exomeChip/IlluminaDesigns/annotatedList.txt).
The following CHARGE supporting documents are located at chargeconsortium.com/main/exomechip: CHARGE_ExomeChip_Best_Practices.pdf, CHARGE_ExomeChip_v1.0_Cluster_File.egt (cluster file for v1.0 chip), CHARGE_ExomeChip_v1.0_Excluded_Variants.txt (list of 8,994 zeroed out variants in cluster file), CHARGE_ExomeChip_SNP_Info_File.tsv.txt and Read Me file includes Illumina annotation, dbNSFP annotation, dbSNP rs numbers, overlapping sites between the HumanExome BeadChip v1.0 and v1.1, reason for inclusion, and race specific allele frequencies for each variant, including HapMap controls. Sample identifiers (CHARGE_ExomeChip_HapMap96_Control_List.csv) and genotypes for the 96 unrelated HapMap controls (CHARGE_ExomeChip_HapMap96_Genotype_Data.csv) are also available.
The Illumina genotyping protocol (Infinium_Best_Practices_370-2009-010.pdf) and cluster file (HumanExome-12v1.egt) are available with a MyIllumina login at https://icom.illumina.com/. The exome chip content data sheet is publicly available at http://www.illumina.com/documents/products/datasheets/datasheet_humanexome_beadchips.pdf.
zCall is a rare variant caller for array-based genotyping provided by Goldstein et al. and available for download at github.com/jigold/zCall . PLINK is a freely available analysis toolset at http://pngu.mgh.harvard.edu/purcell/plink/ .
We thank all participating cohorts and institutions for their collaboration in this large-scale effort, and acknowledge the important role of the CHARGE (Cohorts for Heart and Aging Research in Genome Epidemiology) consortium in the development and support of this manuscript. We would also like to recognize the following individuals at UT for their expertise and participation in the genotype calling, data management and analyses, respectively: Irina Strelets, Genesis Williams, and Kim Lawson. Also, we are thankful to Jennifer Brody at the University of Washington for her contributions to the exome chip supporting documentation.
The CHARGE investigators request that publications resulting from these exome chip data also cite their original publication: Psaty BM, O’Donnell CJ, Gudnason V, Lunetta KL, Folsom AR, Rotter JI, Uitterlinden AG, Harris TB, Witteman JC, Boerwinkle E; CHARGE Consortium. Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium: Design of prospective meta-analyses of genome-wide association studies from five cohorts. Circ Cardiovasc Genet 273-80, 2009.
Support for the centralized work was provided by Building on GWAS for NHLBI-diseases: the U.S. CHARGE consortium through the National Institutes of Health (NIH) American Recovery and Reinvestment Act of 2009 (ARRA) (5RC2HL102419) (PI: E. Boerwinkle). Funding support of the individual cohorts is detailed below. The Age, Gene/Environment, Susceptibility-Reykjavik (AGES) study is funded by NIH contract N01-AG-12100, the NIA Intramural Research Program, the Icelandic Heart Association, and the Icelandic Parliament. The Atherosclerosis Risk in Communities (ARIC) Study is carried out as a collaborative study supported by National Heart, Lung, and Blood Institute (NHLBI) contracts (HHSN268201100005C, HHSN268201100006C, HHSN268201100007C, HHSN268201100008C, HHSN268201100009C, HHSN2682011000010C, HHSN2682011000011C, and HHSN2682011000012C). The authors thank the staff and participants of the ARIC study for their important contributions. The research reported in this manuscript on behalf of the Cardiac Arrest Blood Study (CABS) was supported by grants from the National Heart, Lung, and Blood Institute (HL41993 and HL091244); the University of Washington Clinical Nutrition Research Unit (DK-35816); and the Medic One Foundation, Seattle, WA. The authors gratefully acknowledge the Coronary Artery Risk Development in Young Adults (CARDIA) study participants and staff for their valuable contributions. The CARDIA study is funded by contracts N01-HC-95095, N01-HC-48047, N01-HC-48048, N01-HC-48049, N01-HC-48050, N01-HC-45134, N01-HC-05187, N01-HC-45205, and N01-HC-45204 from the National Heart, Lung, and Blood Institute to the CARDIA investigators. Genotyping of the CARDIA participants was supported by grants U01-HG-004729 and R01-HL-084099. This manuscript has been reviewed by CARDIA for scientific content and consistency of data interpretation with previous CARDIA publications. This CHS research was supported by NHLBI contracts N01-HC-85239, N01-HC-85079 through N01-HC-85086; N01-HC-35129, N01 HC-15103, N01 HC-55222, N01-HC-75150, N01-HC-45133, HHSN268201200036C and NHLBI grants HL080295, HL087652, HL105756 with additional contribution from NINDS. Additional support was provided through AG-023629, AG-15928, AG-20098, and AG-027058 from the NIA. See also http://www.chs-nhlbi.org/pi. DNA handling and genotyping was supported in part by National Center of Advancing Translational Technologies CTSI grant UL1TR000124, the National Institute of Diabetes and Digestive and Kidney Diseases grant DK063491 to the Southern California Diabetes Endocrinology Research Center, and Cedars-Sinai Board of Governors’ Chair in Medical Genetics (JIR). The Rotterdam Study is supported by Erasmus Medical Center and Erasmus University, Rotterdam, Netherlands Organization for the Health Research and Development (ZonMw), the Research Institute for Diseases in the Elderly (RIDE), the Ministry of Education, Culture and Science, the Ministry for Health, Welfare and Sports, the European Commission (DG XII), and the Municipality of Rotterdam. The authors are grateful to the study participants, the staff from the Rotterdam Study and the participating general practitioners and pharmacists. The NHLBI’s Framingham Heart Study is a joint project of the National Institutes of Health and Boston University School of Medicine and was supported by contract N01-HC-25195 and its contract with Illumina, Inc. for genotyping services. The Family Heart Study (FamHS) was supported by grants from the National Heart, Lung, & Blood Institute (U01 HL56563, U01 HL56564, U01 HL56565, U01 HL56566, U01 HL56567, U01 HL56568, U01 HL56569, and K01-HL70444). The Health, Aging, and Body Composition (HABC) Study is supported by NIA contracts N01AG62101, N01AG62103, and N01AG62106. The genome-wide association study was funded by NIA grant 1R01AG032098-01A1 to Wake Forest University Health Sciences. This research was supported in part by the Intramural Research Program of the NIH, National Institute on Aging. The Jackson Heart Study (JHS) is supported by the National Heart, Lung, and Blood Institute (N01 HC-95170, N01 HC-95171 and N01 HC-95172) and the National Center on Minority Health and Health Disparities. The authors thank the staff, interns and participants in JHS for their long-term commitment and important contributions to the study. MESA is supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with MESA investigators. Support for MESA is provided by contracts N01-HC-95159 through N01-HC-95169 and UL1-RR-024156. Funding for MESA Family is provided by grants R01- HL-071051, R01-HL-071205, R01-HL-071250, R01-HL-071251, R01-HL-071252, R01-HL-071258, R01-HL-071259, UL1-RR-025005. The authors thank the other investigators, the staff, and the participants of the MESA study for their valuable contributions. A full list of participating MESA investigators and institutions can be found at http://www.mesa-nhlbi.org. Gina Peloso was supported by Award Number T32HL007208 from the National Heart, Lung, And Blood Institute. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Heart, Lung, and Blood Institute or the National Institutes of Health. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.