Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Hum Mol Genet. Author manuscript; available in PMC 2008 April 1.
Published in final edited form as:
PMCID: PMC2278047

Novel Genes Identified in a High Density Genome Wide Association Study for Nicotine Dependence


Tobacco use is a leading contributor to disability and death worldwide, and genetic factors contribute in part to the development of nicotine dependence. To identify novel genes for which natural variation contributes to the development of nicotine dependence, we performed a comprehensive genome wide association study using nicotine dependent smokers as cases and non-dependent smokers as controls. To allow the efficient, rapid, and cost effective screen of the genome, the study was carried out using a two-stage design. In the first stage, genotyping of over 2.4 million SNPs was completed in case and control pools. In the second stage, we selected SNPs for individual genotyping based on the most significant allele frequency differences between cases and controls from the pooled results. Individual genotyping was performed in 1050 cases and 879 controls using 31,960 selected SNPs. The primary analysis, a logistic regression model with covariates of age, gender, genotype and gender by genotype interaction, identified 35 SNPs with p-values less than 10-4 (minimum p-value 1.53 × 10-6). Although none of the individual findings is statistically significant after correcting for multiple tests, additional statistical analyses support the existence of true findings in this group. Our study nominates several novel genes, such as Neurexin 1 (NRXN1), in the development of nicotine dependence while also identifying a known candidate gene, the β3 nicotinic cholinergic receptor. This work anticipates the future directions of large-scale genome wide association studies with state-of-the-art methodological approaches and sharing of data with the scientific community.

Tobacco use, primarily through cigarette smoking, is responsible for about 5 million deaths annually, making it the largest cause of preventable mortality in the world (1), and nicotine is the component in tobacco that is responsible for the maintenance of smoking. Because of increasing tobacco use in developing nations, it is predicted that the death toll worldwide will rise to more than 10 million per year by 2020.

In the United States, 21% of adults were current smokers in 2004, with 23% of men and 19% of women smoking (2). Each year, approximately 440,000 people die of a smoking related illness (3). The economic burden of smoking is correspondingly high. Annual costs are estimated at $75 billion in direct medical expenses and $92 billion in lost productivity. The prevalence of cigarette smoking has decreased over the last 30 years in the U.S., primarily through smokers’ successful efforts to quit. Yet, the rate of smoking cessation among adults has been slowing since the mid-1990’s underscoring the limitations of current treatments for smoking. In addition, adolescents continue to initiate cigarette use, with 21% of high school students reporting cigarette smoking in the last month (4).

Smoking behaviors, including onset of smoking, smoking persistence (current smoking versus past smoking), and nicotine dependence, cluster in families (5), and large twin studies indicate that this clustering reflects genetic factors (6-10). Previous approaches have used genetic linkage studies (11-14) and candidate gene tests (15-17) to identify chromosomal regions and specific genetic variants suspected to be involved in smoking and nicotine dependence. We have extended the search for genetic factors by performing a high-density whole genome association study using a case-control design in unrelated individuals to identify common genetic variants that contribute to the transition from cigarette smoking to the development of nicotine dependence.


The final sample of 1,050 nicotine dependent case subjects and 879 non-dependent controls who smoked was examined for population stratification, and no evidence of admixture was observed. Quality control measures were applied to the individually genotyped SNPs and 31,960 SNPs were available for analysis.

The most significant findings are presented in Table 1 for those SNPs with a p value of less than 10-4. Several genes not previously implicated in the development of nicotine dependence are listed and their hypothesized mechanism of involvement is discussed below. The most significant result was observed with rs2836823 (p-value = 1.53 × 10-6). This SNP is intergenic, as are several of the top findings. A SNP was defined as “intergenic” if it was not physically in a gene or within 10kb of a known transcribed region. See Figure 1 for an overview of the individual genotyping results.

Figure 1
P values of genome-wide association scan for genes that affect the risk of developing nicotine dependence. -log10 (p) is plotted for each SNP in chromosomal order. The spacing between SNPs on the plot is based on physical map length. The horizontal lines ...
Table 1
SNPs with primary model p-value < 0.0001. Listed genes are within 10kb of the SNP position

Because of the dense genome-wide scope of our study, the interpretation of these p-values was complicated by the large number of statistical tests. Approximately 2.4 million SNPs were examined in the pooled screening stage. Although this is a large sample with nearly 2,000 subjects, no SNP showed a genome-wide significant p-value after Bonferroni correction for multiple tests. Yet, several independent lines of evidence provided support that true genetic associations were identified in this top group of SNPs.

We used the agreement of direction of effect for the top SNPs in the Stage I samples (those included in the pooled genotyping, N=948) as compared with those samples added in Stage II (N=981) as a measure of evidence for real associations within the dataset. If there were no true associations in the data, the expectation would be a random assortment of effect direction between the two sample sets. In contrast, 30 of the top 35 SNPs in the Stage I samples show the same direction of effect in the additional Stage II sample set. This level of agreement was highly significant, with a p-value of 1.1 × 10-5 from the binomial distribution indicating the error rate associated with rejecting the hypothesis of chance agreement. Thus, our top SNPs were enriched for real and reproducible allele frequency differences between cases and controls.

Further evidence for the presence of true associations came from comparison of these results with a candidate gene study conducted simultaneously [described in the companion paper by Saccone et al]. The β3 nicotinic receptor candidate gene, CHRNB3, the most significant finding in the candidate gene study, was also tagged by SNPs identified in the genome wide association study. This gene has a strong prior probability of a relationship with nicotine dependence, and the likelihood of any of the candidate genes in the study by Saccone and colleagues being selected in the top group of SNPs in the genome wide association study is less than 5%.

To investigate the accuracy of pooled genotyping estimates of the allele frequency differences between cases and controls, we examined the relationship between the pooled and individual genotyping results. The pooled genotyping indeed enriched the selected set of SNPs for sizable allele frequency differences between cases and controls included in the pooled study. When p-values were computed from individual genotypes using only Stage I samples, there is a strong enrichment of small p-values (see Figure 2a). If the pooled genotyping was not at all successful, the distribution of p-values would be uniform, and if the pooling was completely accurate, then only small p-values would be present in the individual genotyping stage assessed in this sample subset. As seen in Figure 2a, our results lie between these extremes. We also examined the p-values of the samples added into the Stage II that were not in the pooling step. Because these Stage II samples are an independent random sample from the case and control populations, they are not expected to show the same allele frequency differences as Stage I samples where those differences are due to sampling error. Thus, their p-values should be uniformly distributed except for possible real associations, which would be consistent between the two sets of samples. This is seen in Figure 2b. The graph is fairly uniform with only a slight increase in small p-values.

Figure 2Figure 2
a) Distribution of p-values from the Stage I sample of the 31,960 individually genotyped SNPs that were selected from pooled genotyping stage. The distribution shows that the pooled genotyping produced an enrichment of SNPs with small p-values. A uniform ...

In addition, we directly compared allele frequency estimates based on the pooled genotyping with those based on individual genotyping. As seen in Figure 3, the majority of the allele frequency estimates from the pooled and individual genotyping results lie along the diagonal. A similar finding is seen if case or control samples are examined separately. We computed a correlation of 87% between allele frequencies estimated from the case pooled genotyping and allele frequencies computed in the individual genotyping sample of cases from Stage I (case subjects N=482). Similarly, there was an 84% correlation of allele frequencies seen in the comparison of the pooled and individual genotyping in the control sample from Stage I (control subjects N=466). When we compared the allele frequency differences between cases and controls in pools (which is implicitly large because the SNPs were selected for individual genotyping) with the difference between cases and controls in the individual genotyping, we found a 58% correlation. This indicates a high level of concordance between the pooled and individual genotyping results; thus, the pooled genotyping was successful in identifying SNPs that would show allele frequency differences in individually genotyped case and control subjects.

Figure 3
Scatter plot of the allele frequencies from pooling and individual genotyping from the Stage I sample.

Lastly, we examined potential differences between the U. S. and Australian samples. A comparison of cases and controls from the two populations did not show any significant differences by gender or stratification results.


Smoking contributes to the morbidity and mortality of a large component of the population, and twin studies provide strong evidence that genetic factors contribute substantially to the risk of developing nicotine dependence. This is the first high density, genome wide association study with the goal to identify common susceptibility or resistance gene variants for nicotine dependence.

Several novel genes were identified in this study as potential contributors to the development of nicotine dependence, such as Neurexin 1 (NRXN1). There were at least two signals in NRXN1. See Table 2. The SNP rs10490162 is weakly correlated with the other two SNPs that were genotyped in the gene (maximum pair wise correlation is r2 = 0.45 with the other two SNPs, which were found to be in strong disequilibrium with each other). Interestingly, another neurexin gene, Neurexin 3 (NRXN3), was reported as a susceptibility gene for polysubstance addiction in a pooled genome wide association study by Uhl and colleagues (18). In addition, the most significant SNP in NRXN3 in our study, rs2221299, had a p-value of 0.0034. While there was substantially less evidence for association with NRXN3 in our study, the fact that two independent studies of substance dependence found evidence of association with neurexin genes merits further investigation.

Table 2
All SNPs individually genotyped in the genes NRNX1 and VPS13A

The neurexin gene family is a group of polymorphic cell surface proteins expressed primarily in neurons that function in cell-cell interactions and are required for normal neurotransmitter release (19). Neurexins are important factors in GABAergic and glutamatergic synapse genesis and are the only known factors reported to induce GABAergic postsynaptic differentiation. NRXN1 and NRXN3 are among the largest known human genes, and they utilize at least two promoters and alternatively spliced exons to produce thousands of distinct mRNA transcripts and protein isoforms. It is hypothesized that differential expression of neurexin isoforms by GABAergic and glutamatergic neurons contribute to the local induction of postsynaptic specialization. Because substance dependence is modeled as a relative imbalance of excitatory and inhibitory neurotransmission (or related to “disinhibition”)(20), the neurexin genes are plausible new candidate genes that contribute to the neurobiology of dependence through the regulated choice between excitatory or inhibitory pathways. Biological characterization of these genes may define a role of neural development or neurotransmitter release and dependence.

This study also identified a vacuolar sorting protein, VPS13A, as a potential contributor to nicotine dependence. Interestingly, three independent genetic linkage studies of smoking (11-13) identified a region on chromosome 9 near this gene. This gene appears to control the cycling of proteins through the cell membrane, and there are numerous alternative transcripts. Variants in the VPS13A gene cause progressive neurodegeneration and red cell acanthocytosis (21). Another novel gene for further study is TRPC7 (transient receptor potential canonical) channel which encodes a subunit of multimeric calcium channels (22). A recent study using animal model indicated that TRPC channels can functionally regulate nicotine-induced neuronal activity in the locomotion circuitry (23).

There are several other genes tagged by the top SNPs. An alpha catenin gene, CTNNA3, inhibits Wnt signaling and has variants that affect the levels of plasma amyloid beta protein (Abeta42) in Alzheimer’s disease families (24), though other reports fail to find an association with Alzheimer’s disease (25). The CLCA1 gene encodes a calcium-activated chloride channel that may contribute to the pathogenesis of asthma (26) and chronic obstructive pulmonary disease (27). While none of these genes has a known relationship to nicotine metabolism or mechanism of action, they are involved in brain and lung function and therefore have plausible biological relationships to smoking behavior and dependence. Replication of these findings and additional biological characterization of these variants and genes may solidify these proposed links.

In addition to the novel genes implicated in the genome wide association study, a classic candidate gene, the β3 nicotinic receptor (CHRNB3) is among the top group. The nicotinic receptors are a family of ligand-gated ion channels that mediate fast signal transmission at synapses. Nicotine is an agonist of these receptors that produce physiological responses.

The SNPs were tested for varying gender effects as part of the primary analytic model. Several of the top SNPs had significantly different odds ratios for men and women (Table 1). It is clear from epidemiological data that there are significant gender differences in the risk for the development of dependence, and this study provides evidence that separate genes may contribute to the development of nicotine dependence in men and women. Following the primary analyses, we further analyzed the top ranked SNPs to determine if there was evidence for other modes of transmission, such as recessive or dominant models. There was no evidence for improvement in the fit for either of these models for any of the SNPs in the top group.

The maximum effect size for these top associated SNPs is an odds ratio of 2.53. These estimates are likely to be overestimates of the true population values due to the “jackpot effect” of many multiple comparisons. Several alternatives exist for correction of these estimates, but have not been applied to these data. The effect size estimates are consistent with multiple genes of modest effect contributing to the development of dependence.

This genome wide association study is a first step in a large-scale genetic examination of nicotine dependence. Our analytic plan was determined a priori so that we would be able to interpret the results most clearly. We purposefully chose to examine the entire sample as the primary analysis, rather than use a split sample design because we felt that this had the greatest power to detect true findings (28). Though we have evidence of true results in this study, confirmation in an independent sample is crucial.

Many other issues will need to be addressed in the future examination of these data. For example, smoking and nicotine dependence are correlated with many other disorders, such as alcohol dependence and major depressive disorder (29-32). Preliminary analyses of our sample have confirmed that this clustering of other disorders with nicotine dependence is present in our sample. In addition, nicotine dependence can be defined by other measures, such as the American Psychiatric Association criteria in the Diagnostic and Statistical Manual, Version IV (DSM-IV) (33). Previous work has shown that though different measures of nicotine dependence are correlated, there is not perfect overlap because the FTND and DSM-IV definitions focus on different features of dependence (34). The FTND is a measure that focuses on physiological dependence, whereas the DSM-IV dependence includes cognitive and behavioral aspects of dependence. Different classification by FTND and DSM-IV nicotine dependence is also seen in our sample with 75% of our cases (FTND ≥ 4) and 24% of our controls (FTND=0) affected with DSM-IV nicotine dependence. As we move forward with additional analyses, which will include comorbid disorders and varying definitions of nicotine dependence, we hope to explicate some of the individual features that contribute to these findings of association.

In summary, efforts to understand nicotine dependence are important so that new approaches can be developed to reduce tobacco use, especially cigarette smoking. This systematic survey of the genome nominates novel genes, such as NRXN1, that increase an individual’s risk of transitioning from smoking to nicotine dependence. The continued genetic and biological characterization of these genes will help in understanding the underlining causality of nicotine dependence and may provide novel drug development targets for smoking cessation. These variants also may be involved in addictive behavior in general. The current pharmacological treatments for nicotine dependence continue to produce only limited abstinence success, and the tailoring of medications to promote smoking cessation to an individual’s genetic background may significantly increase the efficacy of treatment. Our work is part of an emerging body of knowledge that may facilitate personalized approaches in the practice of medicine through large-scale study of genetic variants. Novel targets can now be studied and hopefully will facilitate the development of improved treatment options to alleviate this major health burden and reduce smoking related deaths.

Materials and Methods

The purpose of this study was to identify genes contributing to the progression from smoking to the development of nicotine dependence. As a result, the study examined the phenotypic contrast between nicotine dependent subjects and individuals who smoked but never developed nicotine dependence.


All subjects (1050 cases and 879 controls) were selected from two ongoing studies: the Collaborative Genetic Study of Nicotine Dependence, a United States based sample (St. Louis, Detroit, and Minneapolis), and the Nicotine Addiction Genetics study, an Australian based, European-Ancestry sample. The United States sample was recruited through telephone screening of community based subjects to determine eligibility for recruitment as case (current FTND ≥ 4) or control status. Qualifying subjects were invited to participate in the genetic study. The Australian participants were enrolled at the Queensland Institute of Medical Research as families and spouses of the Australian Twin Panel.

The Institutional Review Board approved both studies, and all subjects provided informed consent to participate. Blood samples were collected from each subject for DNA analysis and submitted together with electronic phenotypic data to the NIDA Center for Genetic Studies, which manages the sharing of research data in accordance with NIH guidelines. All subjects were self-identified as being of European descent. See Table 3 for further demographic details.

Table 3
Distribution of sex, age, FTND score, and recruitment site in cases and controls

Phenotype Data

Equivalent assessments were performed at both sites. A personal interview that comprehensively assessed nicotine dependence using several different criteria such as the Fagerström Test for Nicotine Dependence (35) and the Diagnostic and Statistical Manual of Mental Disorders-IV (33) was administered.

Case Definitions of Nicotine Dependence

The focus of this study was a case-control design of unrelated individuals for a genetic association study of nicotine dependence. Cases were defined by a commonly used definition of nicotine dependence, a Fagerström Test for Nicotine Dependence (FTND) score of 4 or more when smoking the most (maximum score of 10) (35). No significant difference was observed in FTND score between the U.S. and Australian samples (mean FTND: 6.43 for U.S. and 6.06 for Australian cases).

Control Definitions

Control subject status was defined as an individual who smoked (defined by smoking at least 100 cigarettes during their lifetime), yet never became dependent (lifetime FTND=0). Historically, the threshold of smoking 100 or more cigarettes has been used in survey research as a definition of a “smoker”. With the selection of controls who smoked, the study focused on those genetic effects related to the transition from smoking to the development of nicotine dependence. Additional data from the Australian twin panels supports this designation of a control status. Among monozygotic twins who smoked, the rate of nicotine dependence, defined as a score of 4 or more using the Heavy Smoking Index (HSI-an abbreviated version of the FTND) (36), was lowest in those whose co-twin had an HSI score of 0; lower even than in those whose co-twin had experimented with cigarettes, but never became a smoker, or those whose co-twin had never smoked even a single cigarette (see Table 4).

Table 4
Prevalence of nicotine dependence in monozygotic twins

DNA Preparation

DNA was extracted from whole blood and EBV transformed cell lines and was aliquoted and stored frozen at -80°C until distributed to the genotyping labs.

Study Design

To allow the efficient, rapid, and cost-effective screening of over 2.4 million SNPs, we performed a whole genome association study using a two-stage design.

Stage I - Pooled Genotyping High-density Oligonucleotide Genotyping Arrays

In Stage I, 482 case and 466 control DNA samples from U.S. and Australian subjects of European ancestry were selected for study. To examine potential population stratification, we performed a STRUCTURE analysis (37) using 295 individually genotyped SNPs. The selected SNPs were roughly evenly spaced across the autosomes and were selected for stratification analyses (38). The STRUCTURE program identifies subpopulations of individuals who are genetically similar through a Markov chain Monte Carlo sampling procedure using markers selected across the genome. There was no evidence of population admixture. Cases and controls were then placed in pools for genotyping of 2.4 million SNPs, and estimates of allele frequency differences between case and control pools were determined.

Pooled genotyping was performed using 8 case and 8 control pools. DNA was quantified using Pico Green. The concentrations were normalized and verified to within a coefficient of variation of < 10%. Equimolar amounts of DNA from approximately 60 individuals were placed into each of the 16 pools. An individual’s sample was included in only one pool. The 16 pools were hybridized to 49 chip designs to interrogate 2,427,354 SNPs across the whole genome.

Determination of Pooled Allele Frequency Estimates

Allele frequencies were approximated using the intensities collected from the high-density oligonucleotide arrays. A SNP’s allele frequency p was a ratio of the relative amount of the DNA with reference allele to the total amount of DNA, and thus can have values between 0 and 1:


where CRef and CAlt are the concentrations of reference allele and alternate allele, respectively. As probe intensities were directly related to the concentrations of the SNP alleles, the p^ computed from the intensities of reference and alternate features was a good approximation of the true allele frequency p. The p^ value was computed from the trimmed mean intensities of perfect match features, after subtracting a measure of background computed from trimmed means of intensities of mismatch features:




I™ was the trimmed mean of perfect match or mismatch intensities for a given allele and strand denoted by the subscript. The trimmed mean disregarded the highest and the lowest intensity from the 5 perfect match intensities and also from the 5 mismatch intensities in the 40-feature tilings before computing the arithmetic mean.

Three quality control metrics were developed to assess the reliability of the intensities for a SNP on an array scan. The first metric, concordance, evaluated the presence of a target for a SNP. The second metric, signal to background ratio, related the amount of specific and non-specific binding, estimated from the intensities of perfect match and mismatch features. The third metric tracked the number of features in each SNP tiling that had saturated intensities. Cutoffs were applied to all three metrics, and SNP feature sets that did not pass were discarded from further evaluation.

Concordance was computed independently for both reference and alternate allele feature sets, then a maximum was taken of the two values. For each allele at each offset for both the forward and reverse strand feature sets, the identity of the brightest feature was noted. The concordance for a particular allele was computed as a ratio of the number of times the perfect match feature was the brightest to the total number of offsets over the forward and reverse strands. In the 40 feature SNP tiling each allele was represented by 20 features, distributed along 5 offsets and forward and reverse strands. If NPMX was the number of times for allele X when the perfect match feature was brighter than the mismatch feature over all offsets and both strands, then:


SNP feature sets with concordance < 0.9 were discarded from further evaluation.Signal to background ratio was the ratio between the amplitude of signal, computed from trimmed means of perfect match feature intensities, and amplitude of background, computed from trimmed means of mismatch feature intensities. The signal and background were computed as follows:


The trimmed mean intensities I™ for both the perfect match and mismatch feature sets were obtained as described above. SNP feature sets with signal/background < 1.5 were discarded from further evaluations.

The number of saturated features was computed as the number of features that reached the highest intensity possible for the digitized numeric intensity value. SNPs with number of saturated features > 0 were discarded from further evaluations.

Stage II SNP Selection

Test to identify conforming Linkage Disequilibrium blocks

Linkage disequilibrium (LD) blocks comprising SNPs with significantly consistent effect were identified using a t-test. The t-test p-values were computed from t-score values that were the average delta p^ across all SNPs in an LD block divided by the standard error of the delta p^. The null hypothesis tested if all delta p^ values across an LD block equaled 0. A two-sided significant t-test indicated that the p^ values were different from 0, and thus the SNPs in the LD block showed consistent effect. This test was used to bias the SNP selection towards SNPs for which the estimated allele frequency differences conform to the known SNP correlational structure (37) (see SNP selection criterion 1, below).

Computation of empirical p-values to evaluate each SNP’s association independently

Corrected t-test p-values were computed similarly to regular t-test p-values. For testing of the difference between average case p^ and average control p^, the standard error was corrected by a chip design-specific additive constant. The additive constant was obtained by minimizing the coefficient of variation of the t-tests for each chip design. This standard error additive constant ensured that SNP selection was not biased to low or high standard errors, as there was no prior evidence that SNPs with low or high standard errors were more or less likely to be associated with the phenotype. The empirical p-values were computed from ranks of the corrected t-test p-values for each chip design by dividing the rank by the total number of passing SNPs on the chip design. See Figure 3 for a distribution of standard errors.

SNP selection criteria

The SNPs were selected from among SNPs that had at least 2 passing p^ values for cases and controls. Selected SNPs uniquely mapped onto human genome build 35 and had successfully designed assays. An initial significance threshold was set for the selection of SNPs where p^ differences were consistent with LD structure. For SNPs with lower consistency measures, we required a more significant p^ difference between cases and controls for entry into the individual genotyping stage of the study. The following criteria were used:

  1. Conforming LD blocks that contained SNPs with delta p^ values consistent with the LD structure were selected with p-value cutoff 0.05. A representative SNP with the most significant empirical p-value was selected from each of these conforming LD blocks. In addition, representative SNPs that had an empirical p-value < 0.020 were selected.
  2. This selection criterion was applied only to those SNPs that passed our internal quality control criteria in an independent study (39). From this set we selected SNPs that had an empirical p-value < 0.018, and were not selected using criterion 1. The empirical p-value cutoff 0.018 that was used for this selection corresponded to a cutoff that would select 40,000 SNPs if applied to all passing SNPs, and therefore represented a neutral significance cutoff.
  3. Only SNPs that did not pass our internal QC criteria in an independent study (39) were included in this selection. From this set we selected SNPs using an empirical p-value cutoff of 0.014. This slightly more stringent cutoff than criterion 2 was used because the pooled genotyping estimates of allele frequency differences were expected to be less reliable for this category of SNPs. In addition, the relative cutoffs for categories 2 and 3 preserve a similar ratio between these SNP sets as seen in the other study (39).

Stage II Individual genotyping

For individual genotyping, we designed a custom array to interrogate 41,402 SNPs that included SNPs selected from the pooled genotyping (39,213) and stratification and quality control SNPs (2,189). In Stage II, we performed individual genotyping on the original case and control samples and additional case and control subjects of European descent, for a final sample size of 1,929 individuals (1,050 cases and 879 controls).

Individual genotypes were determined by clustering all SNP scans in the 2-dimensional space defined by reference and alternate perfect match trimmed mean intensities. Trimmed mean intensities were computed as described above in section “Determination of Pooled Allele Frequency Estimates”. The genotype clustering procedure was an iterative algorithm developed as a combination of K-means and constrained multiple linear regressions. The K-means at each step reevaluated the cluster membership representing distinct diploid genotypes. The multiple linear regressions minimized the variance in p^ within each cluster while optimizing the regression lines’ common intersect. The common intersect defined a measure of common background that was used to adjust the allele frequencies for the next step of K-means. The K-means and multiple linear regression steps were iterated until the cluster membership and background estimates converged. The best number of clusters was selected by maximizing the total likelihood over the possible cluster counts of 1, 2 and 3 (representing the combinations of the 3 possible diploid genotypes). The total likelihood was composed of data likelihood and model likelihood. The data likelihood was determined using a normal mixture model for the distribution of p^ around the cluster means. The model likelihood was calculated using a prior distribution of expected cluster positions, resulting in optimal p^ positions of 0.8 for the homozygous reference cluster, 0.5 for the heterozygous cluster and 0.2 for the homozygous alternate cluster.

A genotyping quality metric was compiled for each genotype from 15 input metrics that described the quality of the SNP and the genotype. The genotyping quality metric correlated with a probability of having a discordant call between the Perlegen platform and outside genotyping platforms (i.e., non-Perlegen HapMap project genotypes). A system of 10 bootstrap aggregated regression trees was trained using an independent data set of concordance data between Perlegen genotypes and HapMap project genotypes. The trained predictor was then used to predict the genotyping quality for each of the genotypes in this data set.

Hardy Weinberg Equilibrium

Hardy Weinberg Equilibrium (HWE) was tested separately for cases and controls. SNPs that did not follow HWE at a level of p-value < 10-15 in either cases or controls were discarded. There were 859 and 797 autosomal SNPs excluded because of this extreme disequilibrium in cases and controls, respectively, and 765 of these SNPs were common to both groups. This level of deviation from HWE indicates issues with SNP genotyping and clustering. Because association with the phenotype can result in SNPs not being in HWE, SNPs with HWE p-values between 10-4 and 10-15 were visually inspected, and where problems with clustering were detected, the SNP was discarded from further analysis. This results in 31,960 SNPs available for analysis.

Population Stratification

In order to avoid false positive results due to cryptic population stratification in the larger sample, we repeated a STRUCTURE analysis in the expanded sample of 1929 subjects (37) using genotype data for 289 well performing SNPs (38). This again revealed no evidence of population admixture. Additionally, the non-inflated Q-Q plot of test statistics in the Stage II only samples (Figure 5) indicates a lack of population admixture correlated with case control status.

Figure 5
Q-Q plot of logistic regression ANOVA deviance produced from samples added to Stage I samples at Stage II. Because these samples are independent of Stage I samples used for the SNP selection from pooled genotyping the test statistic is expected to largely ...

Covariate analysis

The covariates available for individuals were sex, age, site (U.S. or Australia) and sample (first or second). Prior to performing genetic analyses, inspection of the data indicated that the covariates of gender and recruitment site were important predictors of case and control status and were used as covariates in the logistic regression model.

Genetic association

We developed an a priori analytic strategy so that we could then interpret our results and avoid issues of multiple testing from using varying methods of analysis. We chose to examine the total sample of 1929 individuals in the primary analysis because this had the greatest power to detect true findings (28). For our primary single SNP association analyses, we used logistic regression to incorporate the significant covariates sex and site (U.S., Australia), and tested the effect of genotype together with a genotype-by-sex interaction term using a standard likelihood-ratio chi-squared statistic with 2 degrees of freedom. This approach allowed us to detect SNPs having gender-specific effects as well as SNPs with similar effects in males and females. For these primary analyses, we coded genotype according to the number of “risk” alleles (0, 1 or 2) where the risk allele was defined to be the allele having higher frequency in cases than in controls. This coding was additive on the log scale and thus corresponded to a multiplicative genetic model. The full model was compared to a reduced model including gender and recruitment site only, and significance was assessed by a chi square test with 2 degrees of freedom. The resulting p-values were used to rank the SNPs.

Following these primary analyses, we further analyzed the top ranked SNPs to determine if there was significant evidence for alternative modes of transmission such as dominant or recessive models.

Figure 4
Plot of distributions of standard errors of SNPs selected using different criteria. The plot illustrates that delta p^ cutoff selects preferentially SNPs with high standard errors of delta p^, regular t-test preferentially selects SNPs with low standard ...


The authors wish to acknowledge the contributions of advisors to this project. The NIDA Genetics Consortium, with Jonathan Pollock, and NICSNP committees were vital to the success of the research. The Data Analysis Committee helped oversee analyses for the genome wide association studies and investigated methodological issues in association analyses. Further, the committee assisted in data management and data sharing functions. In addition to the authors, committee members included Andrew Bergen, Gerald Dunn, Mary Jeanne Kreek, Huijun Ring, Lei Yu, and Hongyu Zhao. At Perlegen Sciences, we would like to acknowledge the work of Laura Stuve, Curtis Kautzer, the genotyping laboratory, Laura Kamigaki, the sample group, and John Blanchard, Geoff Nilsen, and the bioinformatics and data quality groups for excellent technical and infrastructural support for this work performed under NIDA Contract HHSN271200477471C. This work is supported by NIH grants CA89392 from the National Cancer Institute, DA12854 and DA015129 from the National Institute on Drug Abuse, and the contract N01DA-0-7079 from NIDA. We are greatly appreciative for the assistance in manuscript preparation from Sherri Fisher. In memory of Theodore Reich, founding Principal Investigator of COGEND; we are indebted to his leadership in the establishment of COGEND, and acknowledge his seminal scientific contributions to the field.

Data Access: Phenotypes and genotypes are available through the NIDA Genetics Consortium to the scientific community at the time of publication.

Conflict of Interest Statement

Dennis G. Ballinger and Karel Konvicka are employed by Perlegen Sciences, Inc. With the exception of D. Ballinger and K. Konvicka, none of the authors or their immediate families are currently involved with, or have been involved with, any companies, trade associations, unions, litigants, or other groups with a direct financial interest in the subject matter or materials discussed in this manuscript in the past five years.


1. WHO The facts about smoking and health. 2006.
2. CDC Annual smoking-attributable mortality, years of potential life lost, and productivity losses--United States, 1997-2001. Morbidity & Mortality Weekly Report. 2005;54:625–628. [PubMed]
3. CDC Cigarette smoking among adults-United States, 2004. Morbidity & Mortality Weekly Report. 2005;54:1121–1124. [PubMed]
4. CDC Cigarette use among high school students--United States, 1991-2003. Morbidity & Mortality Weekly Report. 2004;53:499. [PubMed]
5. Bierut LJ, Dinwiddie SH, Begleiter H, Crowe RR, Hesselbrock V, Nurnberger JI, Jr., Porjesz B, Schuckit MA, Reich T. Familial transmission of substance dependence: alcohol, marijuana, cocaine, and habitual smoking: a report from the Collaborative Study on the Genetics of Alcoholism. Arch. Gen. Psychiatry. 1998;55:982–988. [PubMed]
6. Carmelli D, Swan GE, Robinette D, Fabsitz R. Genetic influence on smoking--a study of male twins. N. Engl. J. Med. 1992;327:829–833. [PubMed]
7. Heath AC, Martin NG. Genetic models for the natural history of smoking: evidence for a genetic influence on smoking persistence. Addict. Behav. 1993;18:19–34. [PubMed]
8. True WR, Xian H, Scherrer JF, Madden PA, Bucholz KK, Heath AC, Eisen SA, Lyons MJ, Goldberg J, Tsuang M. Common genetic vulnerability for nicotine and alcohol dependence in men. Arch. Gen. Psychiatry. 1999;56:655–661. [PubMed]
9. Madden PA, Heath AC, Pedersen NL, Kaprio J, Koskenvuo MJ, Martin NG. The genetics of smoking persistence in men and women: a multicultural study. Behav. Genet. 1999;29:423–431. [PubMed]
10. Lessov CN, Martin NG, Statham DJ, Todorov AA, Slutske WS, Bucholz KK, Heath AC, Madden PA. Defining nicotine dependence for genetic research: evidence from Australian twins. Psychol. Med. 2004;34:865–879. [PubMed]
11. Li MD, Ma JZ, Cheng R, Dupont RT, Williams NJ, Crews KM, Payne TJ, Elston RC. A genome-wide scan to identify loci for smoking rate in the Framingham Heart Study population. BMC Genet. 2003;4(Suppl 1):S103. [PMC free article] [PubMed]
12. Bierut LJ, Rice JP, Goate A, Hinrichs AL, Saccone NL, Foroud T, Edenberg HJ, Cloninger CR, Begleiter H, Conneally PM, et al. A genomic scan for habitual smoking in families of alcoholics: common and specific genetic factors in substance dependence. Am. J. Med. Genet. A. 2004;124:19–27. [PubMed]
13. Gelernter J, Liu X, Hesselbrock V, Page GP, Goddard A, Zhang H. Results of a genomewide linkage scan: support for chromosomes 9 and 11 loci increasing risk for cigarette smoking. Am. J. Med. Genet. B Neuropsychiatr. Genet. 2004;128:94–101. [PubMed]
14. Swan GE, Hops H, Wilhelmsen KC, Lessov-Schlaggar CN, Cheng LS, Hudmon KS, Amos CI, Feiler HS, Ring HZ, Andrews JA, et al. A genome-wide screen for nicotine dependence susceptibility loci. Am. J. Med. Genet. B Neuropsychiatr. Genet. 2006;141:354–360. [PMC free article] [PubMed]
15. Li MD, Beuten J, Ma JZ, Payne TJ, Lou XY, Garcia V, Duenes AS, Crews KM, Elston RC. Ethnic- and gender-specific association of the nicotinic acetylcholine receptor alpha4 subunit gene (CHRNA4) with nicotine dependence. Hum. Mol. Genet. 2005;14:1211–1219. [PubMed]
16. Beuten J, Ma JZ, Payne TJ, Dupont RT, Crews KM, Somes G, Williams NJ, Elston RC, Li MD. Single- and multilocus allelic variants within the GABA(B) receptor subunit 2 (GABAB2) gene are significantly associated with nicotine dependence. Am. J. Hum. Genet. 2005;76:859–864. [PubMed]
17. Feng Y, Niu T, Xing H, Xu X, Chen C, Peng S, Wang L, Laird N. A common haplotype of the nicotine acetylcholine receptor alpha 4 subunit gene is associated with vulnerability to nicotine addiction in men. Am. J. Hum. Genet. 2004;75:112–121. [PubMed]
18. Liu QR, Drgon T, Walther D, Johnson C, Poleskaya O, Hess J, Uhl GR. Pooled association genome scanning: validation and use to identify addiction vulnerability loci in two samples. Proc. Natl. Acad. Sci. U. S. A. 2005;102:11864–11869. [PubMed]
19. Craig AM, Graf ER, Linhoff MW. How to build a central synapse: clues from cell culture. Trends Neurosci. 2006;29:8–20. [PMC free article] [PubMed]
20. Iacono WG, Carlson SR, Malone SM, McGue M. P3 event-related potential amplitude and the risk for disinhibitory disorders in adolescent boys. Arch. Gen. Psychiatry. 2002;59:750–757. [PubMed]
21. Dobson-Stone C, Danek A, Rampoldi L, Hardie RJ, Chalmers RM, Wood NW, Bohlega S, Dotti MT, Federico A, Shizuka M, et al. Mutational spectrum of the CHAC gene in patients with chorea-acanthocytosis. Eur. J. Hum. Genet. 2002;10:773–781. [PubMed]
22. Zagranichnaya TK, Wu X, Villereal ML. Endogenous TRPC1, TRPC3, and TRPC7 proteins combine to form native store-operated channels in HEK-293 cells. J Biol. Chem. 2005;280:29559–29569. [PubMed]
23. Feng Z, Li W, Ward A, Piggott BJ, Larkspur ER, Sternberg PW, Xu XZ. A c. elegans model of nicotine-dependent behavior: Regulation by TRP-family channels. Cell. 2006;127:621–633. [PMC free article] [PubMed]
24. Ertekin-Taner N, Ronald J, Asahara H, Younkin L, Hella M, Jain S, Gnida E, Younkin S, Fadale D, Ohyagi Y, et al. Fine mapping of the alpha-T catenin gene to a quantitative trait locus on chromosome 10 in late-onset Alzheimer’s disease pedigrees. Hum. Mol. Genet. 2003;12:3133–3143. [PMC free article] [PubMed]
25. Busby V, Goossens S, Nowotny P, Hamilton G, Smemo S, Harold D, Turic D, Jehu L, Myers A, Womick M, et al. Alpha-T-catenin is expressed in human brain and interacts with the Wnt signaling pathway but is not responsible for linkage to chromosome 10 in Alzheimer’s disease. Neuromolecular Med. 2004;5:133–146. [PubMed]
26. Jeulin C, Guadagnini R, Marano F. Oxidant stress stimulates Ca2+- activated chloride channels in the apical activated membrane of cultured nonciliated human nasal epithelial cells. Am. J. Physiol. Lung Cell. Mol. Physiol. 2005;289:L636–L646. [PubMed]
27. Hegab AE, Sakamoto T, Uchida Y, Nomura A, Ishii Y, Morishima Y, Mochizuki M, Kimura T, Saitoh W, Massoud HH, et al. CLCA1 gene polymorphisms in chronic obstructive pulmonary disease. J. Med. Genet. 2004;41:e27. [PMC free article] [PubMed]
28. Skol AD, Scott LJ, Abecasis GR, Boehnke M. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat. Genet. 2006;38:209–213. [PubMed]
29. Breslau N, Novak SP, Kessler RC. Daily smoking and the subsequent onset of psychiatric disorders. Psychol. Med. 2004;34:323–333. [PubMed]
30. Breslau N, Novak SP, Kessler RC. Psychiatric disorders and stages of smoking. Biol. Psychiatry. 2004;55:69–76. [PubMed]
31. Grant BF, Hasin DS, Chou SP, Stinson FS, Dawson DA. Nicotine dependence and psychiatric disorders in the United States: results from the national epidemiologic survey on alcohol and related conditions. Arch. Gen. Psychiatry. 2004;61:1107–1115. [PubMed]
32. Lasser K, Boyd JW, Woolhandler S, Himmelstein DU, McCormick D, Bor DH. Smoking and mental illness: A population-based prevalence study. Jama. 2000;284:2606–2610. [PubMed]
33. American Psychiatric Association . Diagnostic and statistical manual of mental disorders. 4th ed. American Psychiatric Association; Washington DC: 1994.
34. Breslau N, Johnson EO. Predicting smoking cessation and major depression in nicotine-dependent smokers. Am. J. Public Health. 2000;90:1122–1127. [PubMed]
35. Heatherton TF, Kozlowski LT, Frecker RC, Fagerström KO. The Fagerström Test for Nicotine Dependence: a revision of the Fagerström Tolerance Questionnaire. Br. J. Addict. 1991;86:1119–1127. [PubMed]
36. Heatherton TF, Kozlowski LT, Frecker RC, Rickert W, Robinson J. Measuring the heaviness of smoking: using self-reported time to the first cigarette of the day and number of cigarettes smoked per day. Br. J. Addict. 1989;84:791–799. [PubMed]
37. Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155:945–959. [PubMed]
38. Hinds DA, Stokowski RP, Patil N, Konvicka K, Kershenobich D, Cox DR, Ballinger DG. Matching strategies for genetic association studies in structured populations. Am. J. Hum. Genet. 2004;74:317–325. [PubMed]
39. Hinds DA, Stuve LL, Nilsen GB, Halperin E, Eskin E, Ballinger DG, Frazer KA, Cox DR. Whole-genome patterns of common DNA variation in three human populations. Science. 2005;307:1072–1079. [PubMed]