|Home | About | Journals | Submit | Contact Us | Français|
Four custom Axiom genotyping arrays were designed for a genome-wide association (GWA) study of 100,000 participants from the Kaiser Permanente Research Program on Genes, Environment and Health. The array optimized for individuals of European race/ethnicity was previously described. Here we detail the development of three additional microarrays optimized for individuals of East Asian, African American, and Latino race/ethnicity. For these arrays, we decreased redundancy of high-performing SNPs to increase SNP capacity. The East Asian array was designed using greedy pairwise SNP selection. However, removing SNPs from the target set based on imputation coverage is more efficient than pairwise tagging. Therefore, we developed a novel hybrid SNP selection method for the African American and Latino arrays utilizing rounds of greedy pairwise SNP selection, followed by removal from the target set of SNPs covered by imputation. The arrays provide excellent genome-wide coverage and are valuable additions for large-scale GWA studies.
Genome-wide association (GWA) studies have produced a large number of replicated novel genetic variants [1–4] for many diseases for which no variants had been previously found. The success of these studies has been a result of high-throughput genotyping platforms assaying hundreds of thousands to a million SNPs, with large sample sizes leading to an increased number of replicated associations [5,6]. Many of these have focused on common genetic variation (MAF (minor allele frequency) of 0.10 or greater), based on the HapMap catalog . Sequencing projects, particularly the 1000 Genomes Project (KGP) (http://www.1000genomes.org), are developing larger catalogs which can be leveraged to design arrays that assay lower frequency variants, further enabling discovery of disease-associated genetic variations.
Here we describe the development of three new microarrays for the Axiom Genotyping Solution tailored to individuals of East Asian, African American, and Latino race/ethnicity. These are the remaining three of four custom microarrays developed for the genome-wide genotyping analysis of 100,000 participants in the Kaiser Permanente Research Program on Genes, Environment and Health (RPGEH). A description of the genotyping project and RPGEH cohort is included in . Axiom arrays are limited to approximately 700,000 SNPs when SNPs are tiled with two replicates, which is the standard. Budget constraints for this project allowed for the genotyping of either a single array on 100,000 individuals or two arrays (up to 1.4 million SNPs) on 50,000 individuals. We opted to genotype 100,000 individuals with a single array. As a consequence, however, we chose to design four different arrays to maximize genome-wide coverage, especially for lower frequency variants, in each of the major US race/ethnicity groups (African Americans, East Asians, Latinos and Whites) represented in the RPGEH cohort.
The design of the first array in the series, optimized for US whites (designated EUR), has been described . The East Asian (EAS) array was designed for individuals of East Asian ancestry, although we also included SNPs to provide coverage of European-specific variants to accommodate some RPGEH subjects with mixed East Asian/European ancestry. The target set for the African American (AFR) array included both West African and European variants, recognizing the mixed ancestry of African Americans. Because Latinos have ancestry from three continents, we targeted SNPs common and specific to Europeans, West Africans and Native Americans for the Latino (LAT) array. These arrays were developed to maximize the number of high resolution SNPs for genome-wide coverage; to saturate regions previously identified as disease associated from prior GWA studies for both replication and fine mapping; to improve coverage of both common and uncommon variants by making use of data from the low pass and high pass phases of the KGP; and to incorporate redundant coverage of SNPs with known strong disease associations . For the EAS, AFR and LAT arrays, we used several approaches to enhance the overall genome-wide coverage, including modification to the SNP selection algorithm and reduction of the number of replicates for some SNPs on the array to create more space for additional SNPs.
There have been several methods proposed for SNP selection, starting with a greedy pairwise correlation (“tagging”) algorithm . There have also been efforts to extend pairwise tagging to tagging using multi-marker correlations to increase efficiency . However, to our knowledge, less has been done with imputation for tagging, aside from using it to tag singleton SNPs .
Imputation has played a major role in the analysis of genome-wide association data ; here we explore its use in the design of genotyping microarrays. Imputation of missing SNPs using HapMap reference samples can lead to an overall increase in power of up to 10% , and is becoming possible with larger sequenced reference panels, e.g., from the KGP. Simulations show that imputation is potentially the most beneficial for rare variants, which are harder to tag with a single marker . Several papers that imputed all variants in HapMap found significant associations with imputed SNPs that would not have been found by analyzing only the SNPs on the GWA array . Motivated by this analysis strategy of imputing all variants from a reference panel, in this paper, we describe a novel hybrid design method for selection of SNPs for genotype microarrays. The method uses alternating rounds of SNP selection based on pairwise tagging followed by rounds of target set coverage calculations based on imputation r2 values, which enables removal from the target set of SNPs that can be covered by imputation but were not covered by pairwise tagging. Using this approach, we were able to increase genome-wide coverage with the same fixed number of SNPs on the designed array.
The three new custom arrays described here utilize the Axiom Genotyping Solution (http://media.affymetrix.com/support/technical/datasheets/axiom_genotyping_solution_datasheet.pdf). Briefly, it is a two-color ligation-based assay utilizing 30-mer oligonucleotide probes synthesized in situ on a microarray substrate with automated parallel processing of 96 samples per plate, with a total of ~1.38 million features available for experimental content. In the design of the EUR array, every SNP was represented by at least 2 features (2-rep); some high-value SNPs that had poor resolution were tiled on the array with more than two representations, and hence required more than 2 features (e.g., 4 features or 8 features). As a consequence, the EUR array contains a total of 674,518 SNPs. At the time of design of the EAS, AFR and LAT arrays, it became apparent through analysis of the two representations on the EUR array that the highest resolution SNPs could be tiled on the array with a single feature with only a very small reduction in call rate. We therefore increased the genome-wide coverage of these arrays by tiling some of the highest resolution SNPs with only a single feature (1-rep), enabling greater SNP content on the arrays.
At the time of design of the AFR and LAT arrays, Affymetrix introduced a new reagent kit, Axiom Reagent Kit 2.0. An increased number of SNPs were validated by Affymetrix on the new kit, providing a larger sample of candidate SNPs for the design of these two arrays. The benefits were two-fold: more of the primary, secondary and tertiary SNPs could be directly tiled onto the arrays, and a wider choice of high resolution SNPs were available for selection for genome-wide coverage.
Results in Fig. 1 for the HapMap sample African Ancestry in Southwest USA (ASW) and Fig. 2 for the Luhya in Webuye, Kenya (LWK) compare coverage for a hypothetical array designed in the Yoruba in Ibadan population (YRI) by hybrid SNP selection to one designed by pairwise tagging (on chromosome 21), and show that the hybrid SNP selection algorithm outperforms the pairwise tagging selection algorithm by an average of about 5% on all coverage curves. During the course of hybrid SNP selection in creating this hypothetical array, we noted that the number of SNPs marked as covered by the imputation piece of the algorithm was 28,788 (87%), compared with 19,293 (58%) marked as covered by simple pairwise tagging. A separate analysis of chromosome 20 produced similar results. As a consequence, we proceeded to design both the AFR and LAT arrays using the hybrid SNP selection strategy described in Section 4.1. The EAS array, which required fewer SNPs for genome wide coverage, was designed by traditional pairwise tagging SNP selection.
The four arrays developed for the genotyping project on the Kaiser Permanente RPGEH were optimized for individuals of varying ancestries. The design of the EUR array is given elsewhere ; the design of the remaining 3 arrays is given below in Section 4.3.2. The collection of SNPs on the four arrays differed, although there was considerable overlap. It was part of the design algorithm to maximize the overlap of SNP content between the arrays. Table 1 provides a description of SNP content for the four arrays, including a breakdown by type (autosomal, X-linked, Y-linked or mitochondrial), the number of 1-rep SNPs on each of the arrays (the SNPs tiled with one representation were only those selected for genome-wide coverage), and the number of overlapping SNPs between the different arrays. Among the four arrays, 804,385 SNPs were unique to a single array; 403,981 were shared by two arrays; 156,270 were shared by three arrays; and 254,438 were shared by all four arrays. In total, 1,619,074 unique SNPs were included on at least one array.
The design of each array, as described for the EUR array , involved selection of SNPs from a preselect set that was prioritized for inclusion and a target set for which SNPs were selected for coverage. The preselect set consisted of three tiers of the most important SNPs described below in Section 4.3.1 (e.g., related to disease) that were directly tiled on the array before other rounds of SNP selection began . The EAS array had 258 SNPs in the primary tier; 9764 in the secondary; and 43,908 in the tertiary. The AFR array had 270 primary; 16,669 secondary; and 43,398 tertiary. Lastly, the LAT array had 279 primary; 20,020 secondary; and 43,398 tertiary. The increasing number of secondary SNPs on the AFR and LAT arrays is primarily a result of the availability of more validated SNPs at the time the array was designed, due to the greater numbers of SNPs available for use with the new Affymetrix Axiom Reagent Kit 2.0 and to the increasing SNP requirement due to decreased LD in African ancestry populations for imputing missing SNPs.
Coverage was computed for each array against an appropriate target population by calculating imputation r2 values. To obtain an unbiased estimate of coverage, we used chromosome 2 sequence data of the 1000 Genomes Project interim June 2011 release data (KG2011) (http://1000genomes.org) consisting of 1094 individuals of 14 race/ethnicities: 61 ASW, 87 Utah residents with ancestry from Northern and Western Europe from Centre d'Etude du Polymorphisme Humain (CEU), 97 Han Chinese in Beijing (CHB), 100 Han Chinese South (CHS), 60 Colombian in Medellin, Colombia (CLM), 93 Finnish individuals from Finland (FIN), 89 British individuals from England and Scotland (GBR), 14 Iberians in Spain (IBS), 89 Japanese (JPT), 97 LWK, 66 HapMap Mexican individuals from Los Angeles, California (MXL), 55 Puerto Ricans in Puerto Rico (PUR), 98 Toscani in Italia (TSI), and 88 YRI. To compute coverage, all subjects other than the target group were included in the reference set. For example, for the 97 target Chinese in Beijing (CHB) individuals, we used all other populations except CHB in the reference sample. Imputation accuracy is affected by the size of the reference sample, among other things . Our reference and target panels vary slightly amongst the different populations; however, the sizes are sufficiently large and similar that results are comparable.
The EAS array was designed to cover SNPs from the ASI population (up to 90 CHB and 89 JPT unrelated HapMap 3 individuals  using HapMap and Axiom validated dbSNP and KGP SNPs) with MAF ≥ 0.02, and SNPs from the CEU population (up to 116 unrelated HapMap 3 CEU individuals , using HapMap and Axiom validated dbSNP and KGP SNPs) with MAF ≥ 0.10. To obtain an unbiased estimate of genome-wide coverage, we used the KG2011 data for the imputation calculation, as described above. Results are given in Fig. 3 for CHB. This dataset was sequenced at a low (average ~5×) coverage [17,18], and some genotype calls for these subjects were improved through imputation from HapMap 3 data . Hence, because of potential noise in the low pass sequencing phase of the KGP, in Fig. 3 we also display coverage of the subset of SNPs also found in the 1000 Genomes High Pass (KGHP) data (coverage of 20–60×). Because these high quality SNPs were derived from sequencing only two trios, they are biased towards more common allele frequencies. However, this set still contains low frequency variants, and we have stratified the results based on MAF ranges found in the ASI. As can be seen in the figure, coverage is excellent down to a MAF of 0.01. We note that coverage of the subset of KGHP SNPs is considerably better than for the KG2011 SNPs as a whole. As we have reported before , this is likely due, at least in part, to the low coverage sequence data containing inaccurate genotype calls and false positive SNPs due to the low pass sequencing. Results were nearly identical for imputation coverage in CHS using all other individuals except CHS, as well as in JPT using all individuals but JPT (results not shown).
The AFR array was designed to cover SNPs from the YRI population (up to 116 unrelated HapMap 3 individuals  using HapMap and Axiom validated dbSNP and KGP SNPs) with MAF ≥ 0.02, and SNPs from the CEU population (same genotypes as in EAS array) with MAF ≥ 0.10. Coverage results for the 61 KG2011 ASW individuals are given in Fig. 4. Coverage is excellent for MAF of 0.04 or greater at an r2 of 0.8, and still good for MAF of 0.01 or greater. This reflects the increased genetic variation and decreased linkage disequilibrium observed in the YRI population.
The LAT array was designed to cover YRI SNPs with a MAF ≥ 0.10, CEU SNPs with a MAF ≥ 0.03, in addition to a set of projected Native American-specific SNPs (see Section 220.127.116.11). Results are shown for MXL and PUR in Fig. 5 and Fig. 6, respectively. Coverage is excellent for both populations for a MAF greater than 0.01. The LAT array was designed for Latino populations with higher amounts of African ancestry, such as the PUR individuals; individuals from Mexico have only a modest degree of African ancestry, on average . Coverage, however, is excellent for both populations for a MAF greater than 0.01.
Finally, we previously reported the coverage of CEU subjects by the EUR array , using the 1000 Genomes Pilot Phase I data, and cross validation imputation coverage using 60 individuals. Because of the small reference sample size, coverage for the array was underestimated. Fig. 7 shows the coverage of the EUR array on the CEU population using the much larger reference sample described above. The coverage is excellent down to a MAF of 0.01. Results were nearly identical for imputation coverage of FIN, GBR, and TSI (results not shown).
While genotyping arrays with millions of SNPs that offer universal coverage are clearly optimal for GWA studies of multi-racial and multi-ethnic cohorts, the production time and expense associated with such arrays was prohibitive for a very large scale project such as ours. Also, we felt that a single array platform with up to 700,000 SNPs universally applied to individuals of all racial/ethnic backgrounds was not optimal, because it would provide less coverage of lower frequency variation overall. Hence, our compromise was to design race/ethnicity specific arrays, which could provide coverage of both common and rare variation in multiple racial/ethnic groups. Several advances during the design of the four arrays in this project led to enhanced coverage. First, the reagent kits developed by Affymetrix improved by the time of design of the AFR and LAT arrays, affording us a wider choice of Axiom validated SNPs to tile onto those arrays. Second, we developed a novel hybrid SNP selection scheme which enhanced the ultimate coverage of the AFR and LAT arrays over what they would have been had SNP selection been based simply on pairwise tagging. Third, we determined that high performing SNPs could be tiled with a single representation on the Axiom arrays without significant loss of genotype quality. These latter three developments were most critical for the AFR and LAT arrays, where African ancestry required both a larger number of SNPs and improved SNP selection.
One priority for SNP selection on all three arrays described here was overlap with the first designed array, the EUR array . As a consequence, over 250,000 SNPs are overlapping on all four arrays. These SNPs represent common variation found in all race/ethnicity groups. By contrast, across all four arrays, there are over 1.6 million SNPs represented. This large number reflects both common and lower frequency variation that is race/ethnicity specific. Many of the SNPs in this collection of 1.6 million are polymorphic or high frequency in only one or a few race/ethnicity groups, and monomorphic in others. One disadvantage of a universal array is that for a given race/ethnicity group, many of the SNPs on that array will be monomorphic (in particular if the cost in time or money is a factor). On the other hand, for SNPs that are polymorphic in two or more race/ethnicity groups, non-overlap of SNPs on the various arrays means that imputation must be used to create a set of SNPs common to the arrays. While imputation may be accurate for many of the SNPs, it may not be accurate for all.
Each array demonstrates good to excellent genome-wide coverage for the datasets that they were designed to cover. As expected, coverage of the EAS, LAT, and CEU arrays are very high, with the AFR array modestly less. While coverage is substantially greater for SNPs that appeared in the KG high pass as well as low pass data in general, the difference is more dramatic for comparisons based on the EAS array and East Asian populations. The reason for this is likely due to the number of minor alleles observed in the reference sample, which can have a strong influence on imputation coverage for that SNP. The high pass data were derived from one trio of European ancestry and one trio of African ancestry. Hence, SNPs found in the high pass data are likely to occur in the imputation reference sample at higher frequency than SNPs found in the low pass data that were not found in the high pass data (e.g., SNPs that are specific to East Asians). This bias has less impact on other arrays and populations. Although our present coverage only uses chromosome 2, we expect the coverage to be similar to the genome-wide coverage, as we have seen in other datasets.
We believe that the cost and throughput of next-generation genotyping arrays in conjunction with imputation from dense next generation sequencing data, will aid in the discovery of novel common and low frequency disease-associated variants, especially when used in large scale, well phenotyped populations. It is likely that genome-wide genotyping arrays will continue to be higher throughput and less expensive than whole genome sequencing. In particular, we look forward to the identification of novel variants associated with a variety of diseases and traits using the data from these arrays in the Kaiser Permanente RPGEH.
A novel hybrid SNP selection algorithm was based on cycles alternating greedy SNP selection based on pairwise tagging with imputation coverage calculations. The candidate set of SNPs corresponds to SNPs validated by Affymetrix for use with the Axiom Genotyping Solution that are available for tiling on the array. The target set of SNPs refers to the set of SNPs for which coverage is attempted (typically larger than the candidate set and limited to SNPs passing a MAF cutoff). The selected set of SNPs is the collection of SNPs chosen for tiling on the array after each cycle of SNP selection. The selected set of SNPs increases with each cycle, and the target set of SNPs is reduced after each cycle.
At each cycle of the hybrid SNP selection, a certain number of SNPs are selected from the candidate set based on greedy pairwise coverage and added to the selected set. The selection step is then followed by a coverage step, wherein the selected set of SNPs is used to calculate imputation-based coverage for each SNP remaining in the target set. SNPs in the target set with an imputation r2 greater than a given threshold are removed from the target set. The two steps constitute a single cycle. For computational reasons, imputation was done without cross validation. Imputation coverage was calculated as the correlation between dosages of true genotypes and expected dosages from imputed genotype probabilities [8,15].
The number of SNPs selected was determined based on the total number of SNPs that could be covered and the total number of rounds of SNP selection. In general, more iterations resulted in more efficient SNP selection (smaller number of selected SNPs to reach the same coverage), but with greatly increased computational time (primarily due to the imputation coverage step).
We compared the novel hybrid SNP selection strategy to the standard greedy SNP selection algorithm  which is based on pairwise linkage disequilibrium in terms of expected genome-wide coverage for the design of the AFR array. We created a hypothetical array under both strategies to tag 31,119 SNPs from HapMap 3 and some Affymetrix internal screens with MAF ≥ 0.02 in YRI on chromosome 21, using an imputation r2 cutoff of 0.9 and pairwise r2 cutoff of 0.8. For the hybrid array, we allowed the hybrid SNP selection algorithm to run until all markers were covered that could be, which resulted in selecting 7544 markers. To compare imputation with greedy SNP selection, we used the first 7544 markers chosen by the algorithm for the second hypothetical array.
We then compared genome-wide coverage of the two hypothetical arrays by imputing all genotypes for two HapMap samples: the ASW and the LWK. For each hypothetical array, imputation was based on the SNPs present on that array. The reference genotype sample for imputation was the HapMap YRI. Imputation coverage for a SNP was calculated as the square of the correlation (r2) of the expected dosages derived from imputation to the dosages derived from the true genotypes using Beagle version 3.3.0 . Only SNPs with at least 50 genotypes available in both the reference and target sample were included in these analyses. Genome-wide coverage was calculated as the proportion of SNPs in the target sample with a given imputation r2 value or greater.
Cluster separation for a SNP with alleles A and B was assessed by a Fisher's Linear Discriminant-related Score (FLD Score) [8,22] which is defined as the minimum of two linear discriminants as follows:
where MAB is the center of the heterozygous cluster in the log ratio dimension, MAA and MBB are the centers of the two respective homozygous clusters in the log ratio dimension, and SAA,AB,BB is the standard deviation of the clusters pooled across all three distributions. SNPs with higher FLD Score values are very highly correlated with tighter clusters and higher call rates.
Many of the initial strategies for designing the EAS, AFR, and LAT arrays were the same as those used for the EUR array, explained in detail elsewhere . We describe these again briefly here, with an extended discussion of the modifications to them.
SNPs that were considered for tiling on the array (candidate set) were selected based on having good cluster separation (high FLD Score), a minimum of 3 observed examples of the minor allele (unless in the primary set as described below), and good accuracy (concordance with HapMap when possible, reproducibility, and consistency with Mendelian inheritance). SNP selection proceeded progressively through tiers of importance; SNPs comprising the tiers were updated during each successive array design. All SNPs, aside from those in the primary set, were filtered to have allele frequencies above a certain threshold (discussed below on an array-wise basis).
Primary SNPs were based on strongly confirmed disease associations from literature and online databases [23,24]. Most were directly tiled on the array with redundant coverage based on SNPs chosen for imputation coverage. When adding coverage/redundant coverage to the primary SNPs, we first selected tag SNPs based on the population of greatest relevance. If a single tag SNP with an r2 greater than 0.8 with the target SNP was not available, we selected coverage SNPs by imputation. This entailed greedily adding SNPs so long as it improved the imputation r2 by more than 0.03. Then, additional SNPs were selected, if necessary, for tagging the same target SNP in other relevant populations. Redundant coverage of the same target SNP was obtained by repeating the imputation tagging process with a new set of candidate SNPs.
The secondary set consisted of SNPs that were suggestive of association with disease or traits of interest but were not as strongly replicated as the primary SNPs. This group derived from a variety of sources [23–26]. When these SNPs could not be directly tiled, coverage was obtained by selection of tagging SNPs based on imputation. This group was not provided with redundant coverage.
The tertiary set consisted of SNPs that were mined from various database sources for potential functional significance (e.g., miRNA, splice site, MHC, coding, etc., SNPs). When possible (i.e. an Affymetrix Axiom validated probeset was available), these SNPs were directly tiled on the array, and they were also included in the first target set for greedy pairwise SNP selection. This first target set also included “gene-enrichment” SNPs in coding regions, adjacent introns and upstream and downstream UTR regions of approximately 5000 genes of interest .
The genome-wide coverage algorithm differed amongst the four arrays. The EAS array followed a simple greedy SNP selection paradigm similar to that used for the EUR array, whereas the AFR and LAT arrays utilized hybrid SNP selection. All 3 of the new arrays described here tiled some SNPs with a single representation to increase the total number of SNPs on the array, although their numbers varied among arrays. Only the highest resolution SNPs not in the preselect set were tiled with a single representation.
The EAS array was designed primarily to cover common and rare polymorphisms in East Asians. However, because some individuals in the RPGEH have mixed East Asian and European ancestry, we also wanted to optimize this array for such mixed-ancestry individuals. Therefore, coverage included SNPs not polymorphic in East Asians but polymorphic in Europeans down to a frequency of 0.10. There were 3 rounds of coverage, as follows:
In round 3, the total number of selected SNPs and features was greater than could fit on a single array. Therefore, in order to fit all selected SNPs into a single array, 10% of the SNPs were included using 1 feature instead of 2 features. These SNPs all have a FLD Score ≥ 8.5 and were chosen from the end of the ranked selected SNP list.
Design of the AFR array took into account the mixed continental ancestry of African Americans, so that both African and European SNPs were considered. However, the lower MAF threshold for the two ancestries was different. We assumed that for an African American population with approximately 20% European ancestry , it was sufficient to include European-specific SNPs with a MAF of 0.10 or greater, as this would translate into a MAF of 0.02 or greater in an African American sample. There were two rounds of coverage, as follows:
Design of the LAT array was the most complex, because it needed to take into account three different continental ancestries—African, European and Native American. Adding to the complexity, Latino populations differ considerably in their relative proportions of these 3 ancestries . Therefore, we started with coverage in the YRI population, assuming the target Latino population had up to about 40% African ancestry, on average, for example as has been observed in Dominicans . We started with SNPs that had been selected for the AFR array, and then removed and added additional SNPs according to coverage characteristics for the other two ancestries. Coverage of European SNPs was based on recent sequence data. Coverage of Native American SNPs, i.e., those polymorphic in Native Americans but absent or of low frequency in other race/ethnicity groups, was complicated by the fact that at the time of design, no such sequence data for Native Americans or Latinos was available. Therefore, for coverage of Native American variation, we needed to rely instead on sources of genotype data for SNPs previously identified. One of these sources was a sample of 92 Latinos from Kaiser Permanente Northern California that was genotyped for approximately 5 million SNPs by Affymetrix specifically to assist SNP selection for the LAT array. There were 5 rounds of SNP selection, as follows:
Rounds 3–5 produced some overlapping SNPs. Removing the overlap, the 3 rounds resulted in a total of 28,047 unique “Native American” SNPs added to the LAT array. All SNPs carried over from the AFR array for the LAT array were tiled with the same number of features as in the AFR array. Native American SNPs (described above) were tiled at half the original number of features when their FLD Score was at least 7.5. SNPs selected during hybrid SNP selection were also tiled with a single feature instead of 2 features when their FLD Score was at least 7.5.
Genome-wide coverage of the EAS, AFR and LAT arrays was evaluated by calculating imputation r2 values for all SNPs in the appropriate target set, as described previously  and above. For each array, the target set included SNPs obtained in the KGHP sequencing effort, but using the KG2011 genotype data derived for those SNPs from sequencing the samples. When computing the coverage of specific racial/ethnic groups for a given array, we used one population from that racial/ethnic group in the target, and all other individuals from populations of that racial/ethnic group plus other racial/ethnic groups in the reference. As described above, we used the program Beagle version 3.3.0  when designing the array, but final coverage estimates for the array were calculated using the program Impute2 version 2.1.2  which we found had slightly higher accuracy, as has been shown before .
This work was supported by grant RC2 AG036607 from the National Institutes of Health, grants from the Robert Wood Johnson Foundation, the Ellison Medical Foundation, the Wayne and Gladys Valley Foundation, Kaiser Permanente, and NIH postdoctoral training grant R25 CA112355. We are grateful to the Kaiser Permanente Northern California members who have generously agreed to participate in the Kaiser Permanente Research Program on Genes, Environment and Health.