|Home | About | Journals | Submit | Contact Us | Français|
Conceived and designed the experiments: BJK ST SSM TB TSP LG JCB DNF GJP PdB MF CWKC RRF EB JGW MIM SG MR JCE DAN JH. Performed the experiments: BJK MH SDB AM EF CEK WK HH SSA JCE. Analyzed the data: BJK ST SSM TB JTG LG JCB SFAG HC MH WK HH JCE DAN JH. Contributed reagents/materials/analysis tools: BJK ST SSM TB LG HC MH GJP SA YG ML SD PdB SDB AM ACE KDT XG SSW TS LCG MB ASH AH NP CWKC WHO ALP PM MC TAD DR ASW TC NJS AJL EES WK MIM SK HH SSA MR JCE DAN DJR JH. Wrote the paper: BJK ST TSP SFAG DNF GJP YG EB JGW MIM HH JCE DAN DJR JH GAF.
A wealth of genetic associations for cardiovascular and metabolic phenotypes in humans has been accumulating over the last decade, in particular a large number of loci derived from recent genome wide association studies (GWAS). True complex disease-associated loci often exert modest effects, so their delineation currently requires integration of diverse phenotypic data from large studies to ensure robust meta-analyses. We have designed a gene-centric 50 K single nucleotide polymorphism (SNP) array to assess potentially relevant loci across a range of cardiovascular, metabolic and inflammatory syndromes. The array utilizes a “cosmopolitan” tagging approach to capture the genetic diversity across ~2,000 loci in populations represented in the HapMap and SeattleSNPs projects. The array content is informed by GWAS of vascular and inflammatory disease, expression quantitative trait loci implicated in atherosclerosis, pathway based approaches and comprehensive literature searching. The custom flexibility of the array platform facilitated interrogation of loci at differing stringencies, according to a gene prioritization strategy that allows saturation of high priority loci with a greater density of markers than the existing GWAS tools, particularly in African HapMap samples. We also demonstrate that the IBC array can be used to complement GWAS, increasing coverage in high priority CVD-related loci across all major HapMap populations. DNA from over 200,000 extensively phenotyped individuals will be genotyped with this array with a significant portion of the generated data being released into the academic domain facilitating in silico replication attempts, analyses of rare variants and cross-cohort meta-analyses in diverse populations. These datasets will also facilitate more robust secondary analyses, such as explorations with alternative genetic models, epistasis and gene-environment interactions.
Cardiovascular disease (CVD), the leading cause of death in the developed world , has been shown to have significant heritability –. The pattern of CVD in developed countries has changed as the detection and management of risk factors such as hypertension, hypercholesterolemia and predisposition to thrombosis has coincided with a decline in the incidence of myocardial infarction (MI) and stroke . Efforts to discover genetic determinants of complex disease have included analyses of genetic variation, using SNPs, between populations of individuals differing in incident or prevalent disease traits and/or clinical events. However, many apparent associations have not replicated for reasons including inadequate sample size, imprecise or inaccurate phenotyping, insufficiently stringent statistical thresholds, genuine heterogeneity of causality and population stratification , . The International HapMap Project , combined with advances in genotyping technologies, has led to the generation of multiple array-based SNP genotyping products for GWAS. These developments enable reasonably dense and unbiased global scans of the human genome which have already identified novel loci associated with CVD –. Despite the value of the GWAS approach, a number of limitations exist, including cost and incomplete coverage in the HapMap samples. GWAS also have relatively low power to detect subtle, but potentially important effects, in studies of “typical” sample sizes. For example, calculations of the general power to detect a primary effect using an array with >500 K SNPs are depicted in Figure 1.
Array-based genotyping technologies that have enabled GWAS also permit flexibility in choosing the scope and density of SNPs for candidate gene studies. For example, they allow improved exploitation of recent deep resequencing data, enabling more accurate capture of genetic diversity across populations . Such custom platforms, at scale, allow inclusion of large numbers of plausible candidate loci with a marginal increase in cost.
We describe here the design and implementation of a custom 50 K SNP genotyping array, primarily aimed at assaying SNPs in candidate genes and pathways for cardiovascular, inflammatory and metabolic phenotypes. Design of this genotyping array was led by investigators from the Institute of Translational Medicine and Therapeutics (ITMAT), the Broad Institute and by the National Heart Lung and Blood Institute (NHLBI) supported Candidate-gene Association Resource (CARe) Consortium. The custom SNP array is hereafter referred to as the “IBC array” (ITMAT-Broad-CARe array). A consortium of international academic and industrial partners have committed to using the IBC array to genotype DNA from more than 200,000 individuals who have been extensively phenotyped for risk factors and clinical evidence of vascular disease. The objectives of forming this consortium were (i) to pool expertise for selection of both loci and SNPs; (ii) to reduce costs by producing a standardized genotyping platform; and (iii) to facilitate cross cohort meta-analyses for a large set of SNPs in high priority candidate loci. Here we formally describe the resource and assess coverage of the genetic variants from prioritized loci generated on the SNP panel with the HapMap populations. We also evaluate the coverage of the array with the major GWAS products.
We used the following search string in PubMed ‘(genotype OR snp OR allele* OR polymorphism OR variant) AND (coronary OR heart OR myocardial OR cardiac OR ischemic OR hypertension OR thrombosis) AND (linkage OR association OR control OR randomized OR trial)’ covering publications from 1978 to May 2007, for version1 of the IBC array (IBCv1), and to October 1st 2007 for the 2nd version of the IBC array (IBCv2). Key information was collated including: PMID number, publishing journal, size and population examined, loci and SNPs studied (including the respective rs numbers, where retrievable) and functional evidence. Over 2,400 published studies were systematically analyzed. Emphasis was placed on sample size, data quality and strength of the described associations. Genes with known of putative association with phenotypes for sleep, lung, and blood diseases were also nominated. Input was also solicited directly from investigators within and outside the consortium.
Several pathway-based tools were used to identify additional biologically plausible candidate genes: Kyoto Encyclopedia of Genes and Genomes (KEGG), ; Protein ANalysis THrough Evolutionary Relationships (PANTHER), (www.pantherdb.org) and BioCarta (www.biocarta.com). These tools were employed to collate additional genes from key pathways including lipid metabolism, thrombogenesis, circulation and gas exchange, insulin resistance, metabolism, and inflammation, oxidative stress and apoptosis.
Early access was provided to a number of unpublished mouse atherosclerosis expression quantitative trait loci (eQTL) datasets. Genes predicted to be causal for atherosclerotic lesion size in genetic crosses of mice with differing susceptibility to atherosclerosis were identified  based on (i) the correlation between transcript levels and lesion size, (ii) the overlap of expression and atherosclerosis QTLs and (iii) the likelihood of a causal rather than a dependent or reactive relationship based on Bayesian modeling.
Early access was provided to a number of key findings emerging from five GWAS:
A three-way meta-analysis of the WTCCC, Broad-Novartis-Lund and FUSION studies led to the generation of stronger T2D candidate loci for inclusion on the custom array. We also included SNPs reaching genome-wide significance from the WTCCC Rheumatoid Arthritis, Crohn's Disease and Type-1 Diabetes studies .
Over 2,400 of the collated loci were placed on a database (http://bmic.upenn.edu/cvdsnp) along with key information displayed for each respective gene: the number of SNPs required to tag the four HapMap representative populations at various minor allele frequencies (MAFs) and r2 thresholds; SymAtlas® expression profiling for over 70 specific human tissues and cell-types ; links to National Center for Biotechnology Information (NCBI), Online Mendelian Inheritance in Man (OMIM) and other reference databases; public resequencing information; Jackson Lab Mouse (http://jaxmice.jax.org) and other phenotypic data. A voting system built into this database facilitated consensus amongst the consortium investigators for ranking genes proposed for inclusion on the IBC array. Over 2,000 of these loci were prioritized into three density criteria for tagging, as described below, based on voting by the participating investigators.
Group 1 (n=435 loci); genes and regions with a high likelihood of functional significance, including established mediators of vascular disease, loci derived from GWAS and those shown to be associated with phenotypes of interest. Tag SNPs for these loci were selected to capture known variation with MAF>0.02 and an r2 of at least 0.8 in HapMap populations and SeattleSNPs where available (for formal description, see Calculation of Coverage section below).
Group 2 (n=1,349 loci); candidate loci that are potentially involved in phenotypes of interest or established loci that required very large numbers of tagging SNPs. SNPs for these loci were selected for MAFs>0.05 with an r2 of at least 0.5 in HapMap populations and SeattleSNPs where available.
Group 3 (n=232 loci); comprised mainly of the larger genes (>100 kb) which were of lower interest a priori to the consortium investigators. Only non-synonymous SNPs (nsSNPs) and known functional variants of MAF>0.01 were captured for these loci.
Assays for specific SNPs of known or putative functionality and those shown to be highly associated with vascular disease from literature searching were directly ‘forced’ into the array content, with the aim of facilitating more powerful downstream meta-analyses with previously published data. nsSNPs and known functional variants of MAF>0.01 were selected where possible for all genes of interest.
SNPs from Group 1 and 2 loci were first chosen using the TAGGER software . Assays for SNPs in Group 1 loci were designed to be inclusive of the intronic, exonic, untranslated regions (UTRs) and 5 kb of the proximal promoter regions derived from NCBI build 35 with intronic, exonic and flanking UTRs covered for the ‘Group 2’ loci. This approach generated a set of tag SNPs and multimarker predictors that capture variation in the four HapMap populations (CEU, Centre d'Etude du Polymorphisme Humain collection; CHB, Han Chinese in Beijing, China; JPT, Japanese individuals from Tokyo, YRI, Yoruba from Ibadan, Nigeria; HapMap Data release 21/phase II July 2006 on NCBI build 35, dbSNP build 125). Where available, we also employed SeattleSNPs (http://pga.gs.washington.edu) and Environmental Genome Project (EGP), (http://egp.gs.washington.edu) resequencing data to identify additional tags, not represented in the HapMap populations, using ldSelect . We choose SNPs that were observed at least twice in unrelated individuals.
SNPs were categorized by their assay design scores for the Infinium genomic platform technology (Illumina, CA), based on a theoretical algorithm and all previously attempted wet-lab Infinium assays. In an attempt to reduce the proportion of failed assays on the final product, we pre-filtered most SNPs for Infinium design scores >=0.6, finding appropriate proxies where possible.
Two panels of ~1,500 and ~400 admixture and Ancestry Informative SNP markers (AIMs) were included for African versus European ancestry, and regional European (e.g. Northern versus Southern) ancestral populations respectively to enable admixture mapping and adjustment for population stratification in studies comprised of individuals from these ancestries. These SNPs were based on panels generated previously , , excluding SNPs failing Hardy-Weinberg equilibrium (P>0.01). The AIMs panels are listed within the IBC resource site (http://bmic.upenn.edu/cvdsnp/updates/ancestry_informative_markers-ibc-v1.xls). The incorporation of admixture and AIM panels enables admixture mapping in African Americans and adjustment for population stratification in African Americans and European Americans.
Genomic regions demonstrating ultra-high conservation across species were identified as previously described . Briefly, regions were identified with at least 98% sequence similarity, with a minimum length of 200 nucleotides within human-mouse-dog, human-mouse-rat or human-chicken alignments. In addition, conserved regions with sequence identity of at least 95% near the Group 1, 2 and 3 loci were selected. All variants within these regions (n=1023 SNPs), as evident in at least one HapMap population, were included on the IBC array.
Assays for 49,234 SNPs were attempted using the Infinium technology for IBCv1 ,  which became available to consortium members in October 2007. Assays for an additional 4,050 SNPs were added to the initial content to comprise the IBCv2 array to be released in the Summer of 2008. The additional IBCv2 SNP content was mainly derived from the following:
MAFs were assessed across the IBCv1 arrays in 6067 DNA samples collated from three studies with five populations of self described ethnicity, screened for cardiovascular traits; Caucasians (n=4244 European and n=1054 US Caucasians); African Americans (n=384) and South Asians (n=385). All samples described were genotyped following approval by the relevant institutional review boards. In each respective population, the minor allele frequency for each SNP on the IBCv1 array was determined. Histograms were generated with various allele frequency bins to determine the distribution of allele frequencies in each population.
We used previously described methods  to calculate coverage of HapMap SNPs. Briefly, pairwise r2 values were calculated using the expectation algorithm  based on the genotypes from HapMap release 22. Maximum r2 values were calculated for each SNP list (HapMap release 22) with each SNP on the array being tested. All pair-wise combinations were considered within 200 kb. For chromosome X, only female individuals were used; otherwise, all unrelated individuals were used.
Assays for 49,234 SNPs were attempted during manufacturing with 45,237 SNPs successfully passing the manufacturer's criteria. Reasons for failures included sub-optimal probe synthesis and insufficient resolving of assay traces, potentially due to nearby hidden SNPs or copy number variants (CNVs). Table 1 outlines the type of genetic variants contained on IBCv1. Table S1 shows the expected and observed conversion rates across the passing SNPs. Over 1,300 more SNPs failed than had been predicted by the theoretical conversion scores.
DNA samples from 117 HapMap individuals were genotyped on the IBCv1. We have made these genotype files available for the community (http://bmic.upenn.edu/cvdsnp/updates/hapmap_qc_samples-illumina.xls). 37,431 (82.7%) SNPs are evident in HapMap. Approximately 99.5% concordance was observed against respective HapMap data across the 117 samples and 52 Mendelian inconsistencies were observed in 25 HapMap trios (Table S2). All inconsistencies were attributable to a deleted region from chromosome 1 in a proband which may be caused by a bona fide de novo micro-deletion event or an artifact of the DNA derived from the EBV-immortalised cell-lines. Complete reproducibility was observed across six replicate samples (Table S3).
MAFs were assessed across the IBCv1 arrays in 6067 DNA samples collated from three studies with five populations of self described ethnicity, screened for cardiovascular traits; Caucasians (n=4244 European and n=1054 US Caucasians); African Americans (n=384); and South Asians (n=385). Some 1415 assays across the complete dataset were monomorphic. 2705 and 2566 assays were, respectively, monomorphic across self-described Caucasians and African Americans. Figure 2 illustrates the distribution of MAFs in the Caucasian, African American and South Asian populations, respectively. The various bins for MAFs>0.01 were comparable across all populations examined. Significant variability was evident for variants with MAFs<0.005 which is expected, given the frequency of observations, the varying number of individuals in each ethnic group studied and the natural allele frequency differences of such variants across populations.
The average number of SNPs across the Group 1 and Group 2 loci of IBCv1 were compared with several GWAS products (Figure 3). The average coverage for Group 1 loci is ~36.5 SNPs per locus in IBCv1. The Illumina Human1M and Affymetrix 6.0 platform, for comparison, have an average of ~28.0 and ~17.4 SNPs respectively across the equivalent IBC loci. The average number of SNPs observed for the Group 2 loci is ~16.3 SNPs which is comparable with the current GWAS products.
The coverage of HapMap SNPs was evaluated for all Group 1 loci against the various GWAS products in the HapMap individuals. The maximum r2 value was calculated between each HapMap SNP in the region to a SNP in each respective product. Figures 4 (a) through (f) shows the composite coverage of Group 1 loci from IBCv1 versus several GWAS products for CEU, YRI and CHB+JPT HapMap individuals using MAF cutoffs of >0.02 and >0.05 across the spectrum of r2 thresholds. The coverage using CEUs and CHB+JPT is comparable across all products, although the IBCv1 coverage for YRI is greater. A number of the GWAS products and the IBC array are strongly biased for composition of HapMap SNPs and will obviously have skewed coverage when directly compared. Over 20% of IBCv1 Group1 loci SNPs have not been assayed directly in HapMap with the majority of these additional SNPs derived from SeattleSNPs and the literature. Thus, the IBC array is likely to be more representative of broader population allelic architecture.
The combined coverage of IBCv1 with a number of the GWAS products was assessed for Group 1 loci. The coverage of the IBCv1 alone, with both of the 500 K SNP GWAS and with the one million SNP array products across the Group 1 loci is illustrated in Figure 5 under varying MAF thresholds across the HapMap populations. The combined coverage using IBCv1 with both 1 M SNP products is similar for Caucasians and Asian HapMap samples. The increase in coverage is more pronounced in African HapMap samples, reflecting the dense marker tagging for YRI in the IBC array.
We have produced a custom SNP array designed to capture genetic variation in prioritized loci known or postulated to increase risk of cardiovascular, metabolic and inflammatory diseases. Custom SNP selection allowed us to: (a) ensure selective and consistent coverage for a range of prioritized loci across multiple ancestries, (b) provide additional representative coverage to HapMap in loci of major interest, using SNP content from various sources including recent resequencing efforts; and (c) assay directly specific SNPs of interest such as those derived from previously published studies and known non-synonymous SNPs with MAF>0.01. The IBC array reveals greater depth of coverage than GWAS products with respect to information content and haplotype diversity in the high priority regions. This is particularly true of coverage for African HapMap representative samples. A modest fraction of tagging SNPs from the Group 1 loci on the IBC array are derived from SeattleSNPs analyses and were not assayed directly in HapMap, thus it is likely that the cumulative coverage of variation in these regions is actually underestimated in the current results. It is worth noting that as HapMap was predominantly used for the design of the IBC array (as well as many of the commercial products), then additional densely genotyped or sequenced populations, outside those covered in the original HapMap, would be required for a completely unbiased assessment of coverage.
Despite the recent reductions in price of whole genome SNP arrays, GWAS still remain expensive endeavors and power to cost issues are important factors in study design , . When a two-stage GWAS design is employed, the need for custom genotyping in the second stage can increase costs per individual to a substantial fraction of the cost of the initial stage. GWAS are limited because the cost prohibits acquisition of the sample size needed to overcome the multiple testing problems inherent in gene-gene analyses. Generating a consistent set of genotypes in candidate genes within a large sample may in the short term provide a better balance between sample size and number of testable hypotheses than can be provided by the more expensive and extensive GWAS, and will likely permit a better-powered assessment of the contribution of epistasis to complex traits. Furthermore the rational selection and greater density of coverage in these prioritized loci in the IBC array biases towards detection of disease causing loci, that complements the discovery nature of an unbiased GWAS strategy. As the IBC array is available as a standard tool to the community, the cost is greatly reduced with respect to custom genotyping. The IBC array can clearly be used in conjunction with GWAS products to increase coverage in the high priority regions, permitting greater exploration of gene-interactions and other secondary analyses for the collated high priority loci.
The HapMap project had a bias towards discovery and genotyping of variants with MAFs>0.05, but over 40% of SNPs were observed to have MAFs<0.05  and the ENCyclopedia Of DNA Elements (ENCODE) project indicates ~60% of SNPs have MAFs<0.05 . Many case-control association studies of complex diseases have tended to use MAFs>0.05 due to the power constraints of typical samples sizes. Gorlov and colleagues recently postulated that SNPs that are potentially deleterious are subjected to weak purifying selection and may represent significant contributors to genetic components of common disease . Indeed potentially damaging nsSNPs are skewed toward rarer distribution in the HapMap project, ENCODE and SeattleSNPs. In a recent study comparing the Illumina 14.5 K nsSNP array with GWAS tools (Affy 5.0, Illumina HumanHap300 and HumanHap550), Evans and colleagues found that the GWAS products failed to capture most of the rare variants present on the nsSNP platform . The major nsSNP studies attempted thus far have had modest sample sizes of ~1500 cases and controls . All nsSNPs>MAF 0.01 have been targeted in the design of the IBC array using information from both HapMap and SeattleSNPs and have tagged to MAFs>0.02 for a large number of key loci related to vascular diseases. Analyses of such lower frequency variants will be facilitated by the formation of an international consortium of investigators committed to using this platform. This will permit collaborative meta-analyses across a broad range of phenotypes. The CARe Consortium, for example, will make their IBCv2 genotype data (n~50,000 samples) and most related phenotypic data available to the academic community.
The IBC array is one of the first disease-specific custom arrays with highly focused content to be used on a large scale. We anticipate further generations of the IBC array and that future aggregation of large cohorts and studies with similar disease traits will become commonplace, affording significant cost reductions and increased power to detect effects of modest size.
Bins of SNPs with observed and expected Infinium conversion scores The distribution of SNPs binned according to Infinium score from 0.1 to 1 where a score of 0.8 indicates an 80% likelihood for conversion to a successful assay, 1.0 indicates an assay has ~100% theoretical score etc. A value of 1.1 indicates that an Infinium assay for SNPs has previously been successful in manufacture and analyses. Percentages are indicated in brackets.
(0.05 MB DOC)
Observed IBCv1 Mendelian consistency across 25 HapMap trios. Observed Parent-Parent-Child (PPC) heritability errors across the IBC version1 array using 25 HapMap individuals, where NA number denotes the official HapMap identifier.
(0.08 MB DOC)
Observed replicate consistency using six HapMap individuals. Observed IBC version1 array genotyping errors for six replicate HapMap samples, where NA number denotes the official HapMap identifier.
(0.05 MB DOC)
The CARe Consortium wishes to acknowledge the support of the National Heart, Lung, and Blood Institute and the contributions of the research institutions, study investigators, field staff and study participants in creating this resource for biomedical research. The following nine parent studies have contributed parent study data, ancillary study data, and DNA samples through the Massachusetts Institute of Technology - Broad Institute to create this genotype/phenotype database for wide dissemination to the biomedical research community: the Atherosclerosis Risk in Communities (ARIC) study, the Cardiovascular Health Study (CHS), the Cleveland Family Study (CFS), the Cooperative Study of Sickle Cell Disease (CSSCD), the Coronary Artery Risk Development in Young Adults (CARDIA) study, the Framingham Heart Study (FHS), the Jackson Heart Study (JHS), the Multi-Ethnic Study of Atherosclerosis (MESA), and the Sleep Heart Health Study (SHHS). The authors also acknowledge key contributions from the WTCCC, FUSION and DGI studies. We thank Mary Leonard for preparation of figures and Gonzalo Abecasis for contributions of loci towards IBCv2.
Competing Interests: None of the authors of this paper, have a commercial interest in the product aside from the two listed authors from Illumina.
Funding: Supported by a National Institute of Health Clinical and Translational Research Award (RR U54 RR023567) to the University of Pennsylvania and National Heart, Lung and Blood Institute (N01-HC-65226). Tushar Bhangale's work was supported by the Program for Genomic Applications supported by the NHLBI (U01 HL66642).