|Home | About | Journals | Submit | Contact Us | Français|
Aims: To develop a panel of markers able to extract full haplotype information for candidate genes in alcoholism, other addictions and disorders of mood and anxiety. Methods: A total of 130 genes were haplotype tagged and genotyped in 7 case/control populations and 51 reference populations using Illumina GoldenGate SNP genotyping technology, determining haplotype coverage. We also constructed and determined the efficacy of a panel of 186 ancestry informative markers. Results: An average of 1465 loci were genotyped at an average completion rate of 91.3%, with an average call rate of 98.3% and replication rate of 99.7%. Completion and call rates were lowered by the performance of two datasets, highlighting the importance of the DNA quality in high throughput assays. A comparison of haplotypes captured by the Addictions Array tagging SNPs and commercially available whole-genome arrays from Illumina and Affymetrix shows comparable performance of the tag SNPs to the best whole-genome array in all populations for which data are available. Conclusions: Arrays of haplotype-tagged candidate genes, such as this addictions-focused array, represent a cost-effective approach to generate high-quality SNP genotyping data useful for the haplotype-based analysis of panels of genes such as these 130 genes of interest to alcohol and addictions researchers. The inclusion of the 186 ancestry informative markers allows for the detection and correction for admixture and further enhances the utility of the array.
Unraveling the underlying mechanisms behind genetically complex traits remains one of the principal goals in psychiatric neurogenetics. The challenges associated with identifying the underlying causes of complex diseases are well illustrated by alcoholism, addictions and other psychiatric diseases. These are complex disorders with moderate to high heritability (approximate range 0.4–0.6) (Goldman et al., 2005). The high incidence and complex inheritance patterns suggest that the elucidation of the roles of common genetic variations in vulnerability might be critical for a better understanding of the pathophysiologies and for the improvement in diagnostic specificity. Whilst several functional loci have been identified (e.g. ADH1B His47Arg and ALDH2 Glu487 in alcoholism (Quertemont, 2004), the MAOA VNTR in dyscontrol behaviors (Popova, 2006; Craig, 2007) and HTTLPR in anxiety/dysphoria (Heinz et al., 2001)), the underlying origins of the genetic variance in vulnerability to addictions and other major psychiatric diseases remain largely unknown.
Analysis of markers throughout the genome has shown that alleles of single nucleotide polymorphisms (SNPs) are often linked to each other in stretches that can range in size from <5 Kb up to >100 Kb (Gabriel et al., 2002). These combinations of linked alleles (haplotypes) allow the entire genome (or portions thereof) to be analyzed using a relatively small number of SNPs. Disease causing SNPs will therefore be linked to other markers and can be identified through their association with other markers even if the causative SNP itself is not assayed (Risch and Merikangas, 1996; reviewed in Kruglyak, 2008).
Until recently researchers were limited in their options for genetic analysis by the limited number of available markers, coupled with comparatively high cost for each genotype obtained. Classical genetic linkage approaches could only be applied when families could be recruited. With the rapid increase in marker information from the HapMap (http://www.hapmap.org/) and GenBank (www.ncbi.nlm.nih.gov/Genbank/) data- bases and the availability of high-density SNP genotyping platforms, researchers now have the possibility of comprehensively interrogating candidate genes and entire biosynthetic/physiological pathways (Perlis et al., 2008) for their genetic contribution to a disorder or phenotype, as well as of performing genome wide scans to identify new candidate genes.
Whole-genome association studies have shown promise in the identification of causative genes in disease (Wellcome Trust Case Control Consortium, 2007; Easton et al., 2007; Hunter et al., 2007; Frayling et al., 2007; Rioux et al., 2007). However, several problems remain with widespread use of this technology. Published whole-genome association studies have demonstrated that common vulnerability alleles often lead to odds ratios of less than 2, and due to the genome-wide nature of these analyses, and the need for statistical correction (Risch and Merikangas, 1996; Hirschhorn and Daly, 2005) (although the required degree of correction for multiple correction remains uncertain), large sample sizes in excess of several thousand cases and controls are needed to detect loci influencing risk (Wang et al., 2005). Furthermore, in the case of bipolar disorder a recent whole-genome association study that compared 2000 cases to 3000 controls identified only a single association signal that survived criteria for genome-wide significance, and this locus accounts for only a small part of the variance in vulnerability attributable to genetic factors. The relatively high per sample cost and the requirement for large numbers of cases and control subjects to identify alleles of modest effect size with associations that are able to withstand correction for multiple testing, make the widespread use of this approach impractical and financially burdensome for many research groups unless pooling approaches are adopted (Shifman et al., 2002; Liu et al., 2006; Johnson et al., 2006).
The complexity of neuropsychiatric and behavioral disorders coupled with the fact that phenoptype can be modulated by environmental factors and that clinical diagnostic criteria likely miss possible etiological heterogeneity only detectible by biologic measures has promoted researchers to use so-called endophenotypes as surrogates for disease states. These endophenotypes are heritable quantitative measurable traits that are inherited in a stable manner and that are more frequently observed in both cases and their first degree relatives and potentially confer vulnerability to a disorder (Gottesman and Gould, 2003; Flint and Munafo, 2007; Frederick and Iacono, 2006; Enoch et al., 2003). Often these endophenotypes are measured by the use of imaging technologies (MRI and PET) (Martinez et al., 2001; Meyer-Lindenberg and Weinberger, 2006) or by EEG measures (Yoon et al., 2006), techniques which due to their cost, invasive nature, requirement for expensive, specialized equipment and length of time required for data acquisition are impractical to use on large cohorts. The practicality of the whole-genome association approach to the study of quantitative imaging traits is being assessed, and although no studies are currently published, the data appear to be promising.
Although candidate gene studies have their own inherent limitations (reviewed in Tabor et al., 2002), the use of smaller focused arrays possibly represents a more practical approach for many studies. These focused arrays are able to overcome the issues of inadequate gene coverage and ethnic stratification by providing full coverage for a limited number of candidate genes and by the inclusion of ancestry informative markers (AIMs). Such focused arrays offer the advantages of lower cost and lower false discovery rate, especially in situations where a dataset may have inadequate power for WGA either because of size or other reasons. In the future it also appears likely that such arrays will be required for follow-up on genomic regions identified by linkage and association studies. Studies on individual candidate genes or small groups of such genes have led to the discovery of functional loci such as the ones cited earlier, but on the other hand these studies have been hampered in other ways. Many linkage and association studies on the role of candidate genes in complex disorders have used single non-functional markers that do not capture sufficient information or do not evaluate all genes in the functional domain of interest. In many instances different markers are selected by groups to interrogate a single gene, making the comparison of data difficult. An additional confound in these single gene studies has been the general failure to control for unrecognized ethnic stratification within the cohort that can lead to the generation of both false positive and false negative signals (Schork et al., 2001; Rosenberg and Nordborg, 2006). Such unrecognized stratification is problematic for genetic studies and can also confound studies relating phenotype to phenotype or risk variable to outcome. In such instances ethnicity can represent a hidden variable.
Recent advances in the neurobiology of addiction, mood disorders and psychoses have established the importance of several mechanisms, including reward, stress resiliency and executive cognitive control (reviewed in Goldman et al., 2005). These studies thereby implicate several molecular networks that are integral to those processes and genes necessary for their function. These molecular pathways include signaling networks, stress/endocrine genes, key neurotransmitter systems including dopamine, serotonin, glutamate, GABA and acetylcholine. In several instances, particular genes and molecules have also been specifically implicated in addiction liability or in addictions-related phenotypes by whole-genome or candidate-gene-focused linkage results.
We have designed a 1536 SNP array, implemented on the Illumina Goldengate assay platform. This array includes 1350 SNPs selected for 130 genes and 186 markers that are highly informative for AIMs. The 130 candidate genes were selected on the basis of their roles in functional domains important in the addictions and in the related phenotypes of anxiety and depression. Figure Figure11 lists the 130 candidate genes organized into one somewhat arbitrary scheme of functional categorization. The candidate genes included a limited number involved in the pharmacokinetic domain (e.g. several genes in the ADH gene cluster, and ALDH genes). The majority of the genes represent the domains of vulnerability to drug use and pharmacodynamic response. These include dopamine, serotonin, glutamine, GABA, and opioid neurotransmitter genes, signaling genes, and genes modulating stress resiliency and behavioral dyscontrol domains. There is a high degree of overlap between functional gene categories because of pleiotropic actions of molecules on behavior.
A total of 1350 SNPs (Table (Table1)1) from 130 candidate genes (Fig. (Fig.1)1) were selected for inclusion on the array. Tagging SNPs were identified for these genes using the following design pipeline:
A panel of 186 SNPs was selected as genomic controls based on the following criteria: (i) reference allele frequency (RFA) of pairwise SNPs from the HapMap Project was at least 0.75; (ii) the minimum distance between SNPs was 80 kb; (iii) the absolute value of log (RFA1/RFA2) was greater than 1 (i.e. there was a 10-fold difference). The selected SNPs represent a sub-fraction of a larger 204 SNP AIMS panel (Enoch et al., 2006) previously tested on the Illumina platform where failed or uninformative assays have been removed from the assay pool. AIMs data were analyzed using structure 2.1 to generate population assignments for all individuals (Pritchard et al., 2000). For the CEPH (Centre Etudes du Polymorphisme Humain) diversity panel, the run parameters used were 1051 individuals, 179 loci, 51 populations assumed, 100,000 Burn-in period and 200,000 Reps. For the test populations, the same run parameters were used, with 5 populations assumed for the 564 samples and 159 loci. The output was graphically represented using the distruct program (Rosenberg, 2004; http://rosenberglab.bioinformatics.med.umich.edu/distruct.html).
All samples used were collected under protocols approved by the relevant institutional IRB, with participants providing written informed consent for use of their samples in genetic studies.
Genotyping was performed using the Illumina GoldenGate genotyping protocols on 96-well format Sentrix® arrays. Five hundred nanogram of sample DNA was used per assay. All pre-PCR processing was performed using a TECAN liquid handling robot running Illumina protocols. Arrays were imaged using an Illumina Beadstation GX500 and the data analyzed using GenCall v18.104.22.168 and GTS Reports software v22.214.171.124 (Illumina). Genotype clusters were determined for a test dataset and this template was applied to all subsequent datasets. Data for each dataset were polished by manual adjustment of the clustering for each marker to correct for differences between datasets arising from sample integrity and concentration. Loci for which three distinct clusters could not be resolved were assigned zero scores. Data were further polished as follows: genotypes with low GenCall scores (<0.25) were called as undetermined. The GenCall score is a value between 0 and 1 giving a confidence score for that genotype call (the higher the score the higher the confidence in the call) and is derived from the tightness of the clusters for a given locus and the position of the sample relative to its cluster.
Loci with a call rate >90% were determined to have failed and were excluded. At this point deviation from Hardy–Weinberg equilibrium was not used as an exclusion criterion since all datasets contained both case and control samples and, in general, were of mixed ethnic composition.
A total of 8309 unique samples were genotyped from seven different datasets. DNA samples were excluded using the following criteria. The GenTrain scores for a sample for all loci are used to determine the 10% percentile GenCall score (%10 GC) for that sample. The sample exclusion threshold is based on a single project and is calculated by taking the 90th percentile of %10 GC scores for all samples in the project and multiplying by 0.85. Any sample with the %10 GC value below that threshold was classified as failed and removed from the analysis.
Genotyping accuracy was determined based on genotype concordance between DNA replicates. The level of sample replication varied between datasets averaging 16% across all seven datasets.
Haplotypes were derived using the program Phase 2.0 (Li and Stephens, 2003).
Five of the seven datasets (sets A, B, C, D and G) averaged 1481 passing loci, with an average completion rate of 97.60% for those loci (Table (Table3).3). Datasets E and F had fewer passing loci, 1351 and 1387 respectively, and greatly reduced completion rates, 86% and 67%. Once all failing DNAs were removed, the average call rate per sample for the datasets was 99.31%, with all but dataset F having a call rate of 90.4%. The reduced performance of the array for datasets E and F is likely due to issues of DNA concentration and quality since the average replication rate for all seven datasets was 99.7% and datasets E and F recorded replication rates of 99.5% (99.95% if one pair were excluded) and 99.6%, respectively, indicating the high quality of genotyping generated for these two datasets.
One of the datasets was derived from a Finnish population which allowed us to estimate the genotyping accuracy by the comparison of the minor allele frequency (MAF) for all passing loci in this dataset to the MAF (where known) for the HapMap Caucasian population. This similarity in MAF for the 1440+ loci (Fig. (Fig.2)2) suggests that the genotyping clusters were correctly assigned. Only one marker showed a deviation in MAF >±0.25. This marker rs4824001 is one of the 186 AIMs and was originally selected for its high MAF in the Yoruban population (MAF = 0.833), intermediate frequency in Asian populations (MAF = 0.471) and low MAF (0.017) in Caucasians. The observed MAF (0.498) was confirmed by inspection of the cluster file, which showed clear cluster separation (data not shown). This suggests that this marker, in conjunction with others, may have utility for identifying population stratification in Caucasian populations.
The array was designed to allow haplotype analysis. Tagging SNPs were selected to be able to detect haplotypes present at a haplotype frequency of 0.006 or higher. However, subsequent to the design of the oligo pool additional SNPs have been identified and genotyped in the HapMap populations resulting in an increase in the number of possible haplotypes. The haplotype coverage offered by the tagging SNPs Addictions Array for alcohol dehydrogenase 6 (ADH6) was compared to haplotypes calculated for data from HapMap release 21 (Fig. (Fig.3)3) for the combined Asian and Caucasian samples. To facilitate the analysis haplotypes for Nigeria (YRI) and Utah (CEU) samples in the chromosome region were downloaded from HapMap project release 21 (http://www.hapmap.org/). Based on Manhattan distances weighted by minor allele frequency and marker average LD, haplotypes were clustered hierarchically using R (http://www.r-project.org). Haplotype coverage was determined by dividing the number of haplotypes correctly identified by the tag SNP set divided by the total number of SNPs within the corresponding cluster. As shown in Fig. Fig.3,3, the majority of all the haplotypes could be correctly called in the combined Asian sample with only three minor haplotypes not being determined by the tag SNP set. Overall in the Asian population haplotype coverage averaged 0.98. In Caucasians the overall haplotype coverage remained at 0.94; however, of the 11 minor haplotypes not detected, the majority (9) were cladistically related, arising in the H3 cluster.
The average haplotype coverage for the genes analyzed by the Addictions Array was compared to the coverage provided by the Illumina HumanHap 550®, the Affymetrix Human-Wide SNP Array 5.0® whole-genome association array and the Affymetrix Human-Wide SNP Array 6.0® (Fig. (Fig.4).4). Only 121 of the 130 genes represented on the Addictions Array were analyzed because X-linked phased haplotypes carried a discrepancy warning from HapMap and because in the case of several smaller genes only two markers had been genotyped in HapMap. The subsets of genes analyzed for the Illumina and Affymetrix arrays were not completely overlapping. Out of 121 Addiction Array genes only 113 were represented on the Illumina array, 112 on the Affymetrix 6.0 array and 103 on the Affymetrix 5.0 array. The whole-genome arrays on average used more than twice the number of SNPs (averaging 18, 31 and 20 SNPs per gene for the Affymetrix 5.0, Affymetrix 6.0 and Illumina 550 arrays, respectively) to cover each gene compared to the Addictions Array (average 9 SNPs per gene). Despite the reduced number of SNPs per gene, the average haplotype coverage (HCM–haplotype coverage mean) for the Addictions Array was consistently higher than that of the Affymetrix 5.0 Array for all three HapMap populations. The superior performance of the Addictions Array over the Affymetrix 5.0 array product was also confirmed by the coverage median values in all three populations. The Addictions Array performed comparably to the Illumina humanhap 550 array, and the Affymetrix 6.0 array for the Caucasian and Asian HapMap populations with an HCM of 0.76 for the Caucasian and Asian groups, compared to the 0.80 and 0.79 values for the Illumina 550 k array, and 0.77 and 0.78 for the Affymetrix 6.0 array. The Addictions Array produced a higher HCM (0.74) and coverage median (0.76) for the Yoruban population than the Illumina array (HCM 0.67, median coverage 0.69) and comparable results to the Affymetrix 6.0 array (HCM 0.73; median coverage 0.76).
The ability of the AIMs panel to detect differences between populations that were not originally used in the design of the panel was tested by genotyping the CEPH diversity panel (Cann et al., 2002). Genotyping data were analyzed using structure 2.2 for a six-population solution (Fig. (Fig.5a).5a). Using the combined global data the AIMs panel is able to distinguish six distinct populations that segregate along continental lines. This solution is similar to that obtained by Rosenberg et al. using a panel of 377 micro-satellite markers. Additionally the two samples previously shown to be misidentified (Rosenberg et al., 2002), as members of the Biaka pygmy and Japanese cohorts, were correctly assigned by this AIMs panel to their correct continental groups (Europe/Middle East and the Americas, respectively).The ability of AIMs panel to detect admixture was then tested by analyzing the combined data from five populations, Finnish Caucasians (n = 85), African Americans from New Jersey (n = 83), Native Americans from the Midwest (n = 86), Han Chinese (n = 83) and Mexican Americans from California (n = 228). The analysis was performed using the assumption of five populations using data for 159 loci and the results are shown in Fig. Fig.5b.5b. All individuals were correctly assigned to their ethnic cluster, although individuals can be seen to vary in their degree of admixture. The admixture contribution of a cluster to each population is shown as a percentage of the inferred clusters (Fig. (Fig.5).5). As expected the African American and Mexican American populations showed higher degrees of admixture than the Finnish and Han Chinese samples, both of which had been previously shown to be relatively homogenous groups (Enoch et al., 2006).
Technologies for genotyping have increased genotyping throughput whilst at the same time decreasing the cost per genotype. At present up to 1 million SNPs can be interrogated simultaneously in an individual allowing for whole-genome association studies. Such studies have successfully identified susceptibility loci for obesity (Freyling et al., 2007) and breast cancer (Hunter et al., 2007; Easton et al., 2007) as well as for bipolar disorder, coronary heart disease, Crohns disease, rheumatoid arthritis, and type 1 and type 2 diabetes (Wellcome trust Case Control Consortium, 2007; Rioux et al., 2007). The cost of this approach remains prohibitively high for generalized use and requires large datasets to obtain the necessary power to detect association to a phenotype. This is particularly problematic for those datasets where PET or MRI imaging is performed since the high cost of the scans coupled with the time required to acquire the data makes the collection of large datasets impractical. Pooling of samples has been successfully used to reduce the overall number of arrays required for a study; however, this approach has not gained widespread acceptance or use due to the practical issues of sample normalization, statistical testing and the loss of individual haplotype information needed amongst other reasons to validate the homogeneity of the phenotypic groups. Although the cost per sample of the whole-genome arrays is constantly falling, and the data could be used for haplotype-based analysis of individual genes, these arrays are likely to remain inappropriate for candidate gene analysis due to issues of sample throughput. Certainly the use of these arrays would allow for more fine-tuned control and correction of population stratification due to the higher number of markers. Currently, however, the use of more focused arrays represents a more appropriate approach for many studies where the number of subjects is limited, and where the investigators wish to study a specific hypothesis where candidate genes are selected on the basis of function or where individual SNPs are known to alter the expression or biological activity of the gene product. Additionally in future, once a number of large whole-genome association studies have been completed, it may be more appropriate to use focused arrays to interrogate genomic regions identified as potential candidate regions in a large number of smaller datasets. In this context where there are convergences of whole-genome association data to previously identified candidate genes, the two approaches act synergistically as cross-validation of each positive association finding.
The SNP tagging pipeline for this array used the HapMap data for the Yoruban population as its basis. Whilst it would be preferred that a tag set was used for each unique population it has been shown that tagging SNP sets have high portability across populations (deBakker et al., 2006; Conrad et al., 2006; Gonzalez-Niera et al., 2006). The discovery of additional SNPs subsequent to the array design has resulted in a reduction in the haplotype capture or coverage, but it remains at levels comparable to the high-density arrays available. Use of clustering on a cladistic basis allows the grouping of related haplotypes, particularly those with low frequency. It might be considered desirable to generate an array capable of universally high haplotype capture in all populations; however, such a goal is unlikely to be achieved. For complete haplotype capture in all three HapMap populations the number of tagged SNPs for each gene would have to be increased, with a concomitant reduction in the number of genes that can be interrogated. That reduction in the number of genes is likely to make any array less attractive to researchers as it increases the likelihood that one or more genes of interest will be absent from the array.
In this array we have focused on genes of particular interest to alcohol researchers, which are also of interest to the general neuropsychiatric community. The use of SNP-tagging allows the reduction in the number of SNPs required to successfully interrogate each gene, and maximizes the utility of any array design by increasing the number of candidates that can be incorporated in to the design. This is an important consideration for custom designs, the cost of which falls as the number of samples screened increases. In addition we have been able to include a large panel of AIMs for the detection and correction for population stratification. To genotype such a large SNP panel on a SNP-by-SNP basis would be uneconomic and take a considerable time to accomplish. Since one of the possible confounds in association studies is false positive (and negative) finding arising from differences in the makeup of the control and case groups the detection of any stratification is of the highest importance. Usually this problem has been handled by careful selection and matching of the case and control groups, often resulting in increased costs and the time of study participant recruitment. Such selection of participants usually results in the exclusion of minorities and represents a contributory factor in racial disparities in healthcare and is obviously undesirable both scientifically and socially. Even when the issue of population stratification was addressed by using genotypes from markers unlinked to each other and to the gene of interest, it was rarely demonstrated that the markers used were in fact capable of detecting it. By genotyping the AIMs in the CEPH reference populations a canonical dataset was created enabling the computation of ethnic factor scores anchored against worldwide genetic diversity and allowing direct dataset-to-dataset comparisons. Fixed solutions for admixture correction can be performed using individual ethic factor scores as covariates, or alternatively association data can be corrected using programs such as STRAT (Pritchard et al., 2000) that directly use the output of the STRUCTURE 2.0 to correct for any detected population stratification.
Comparisons between the results from published candidate gene studies have been hampered in the past by the use of different sets of markers for the interrogation of the same gene. Whilst often this results from studies being performed contemporaneously or due to constraints of a particular genotyping platform, clearly it is desirable to be able to easily correlate data from different studies. This issue has clearly been seen in the study of DISC1 as a candidate gene for schizophrenia where multiple groups have performed association studies using many different markers for their analysis (Hwu et al., 2003; Hennah et al., 2003; Hodgkinson et al., 2004; Thompson et al., 2005). Although the studies have provided supportive evidence for each other, the identification of the functional loci has been hampered. Similarly the general region in which the GABAA subunit gene cluster is located on chromosome 4p was implicated in alcohol dependence by family linkage scans and a series of more recent studies (beginning Long et al., 1998; Porjesz et al., 2002; Song et al., 2003; Edenberg et al., 2004; Lappalainen et al., 2005; Prescott et al., 2006; Drgon et al., 2006) have now demonstrated linkage disequilibrium within the GABAA subunit gene cluster itself including the same alleles and haplotypes as determined by analysis of the data from the partially overlapping loci evaluated in these studies. Frequently, a SNP or multilocus haplotype can be used to impute a different SNP (Wellcome Trust); however, the ability to compare across studies is made considerably more challenging by the genotyping of different markers in different studies. In this context and others, the use of genotyping tools, including commercially available arrays that access common sets of markers, is highly advantageous. Although information from the International HapMap Project provides valuable information about linkage disequilibrium between markers can assist in cross-study comparisons, the process is clearly inefficient, time-consuming and not without error as only four populations are currently represented in the database.
Widespread use of haplotype capture arrays such as this addictions array would greatly facilitate cross-study comparisons and use of large panels of AIMs might permit the data from different studies to be combined and analyzed by allowing for population admixture to be controlled for in the analysis.
We would like to thank Amy Doebber and Rema Paudel for technical assistance. This work was supported by 1RO1 AA13640, 5P50 AA11998, 1RO1 NS43762, The Blanche F. Ittleson Endowment Fund (RDT), NIH Grants R01 DA 12422, K02 DA 15766 (JFC) and an unrestricted research grant from Glaxo-Smith-Kline (EBB), grants AA06420 and AA10201 to CLE and NIH grants MH062185, MH048514 and MH056390 to J.J.M. This project was also supported by the National Institute on Alcohol Abuse and Alcoholism Intramural Research Program.