The study population of this population-based case–control study has been described previously (
10). Briefly, all residents of Xuan Wei, China, from March 1995 to March 1996 were eligible for inclusion. Lung cancer cases with clinical symptoms and X-ray confirmation were identified at one of five hospitals servicing Xuan Wei County. Of the 135 eligible cases, 133 (99%) agreed to participate. To be enrolled, cases had to be histologically (
n
=

14) or cytologically (
n
=

91) confirmed or have died within 1 year of diagnosis (
n
=

17), since previous studies in Xuan Wei suggest that death within 1 year of clinical diagnosis of lung cancer is a strong indicator of lung cancer diagnosis (
13). Based on these criteria, 122 of 133 consenting cases (92%) were enrolled into the study.
Controls were selected from the Xuan Wei general population and were individually matched by sex, age (±2 years), village and type of fuel used for in-home cooking and heating at time of interview. The participation rate for controls was 100%. A detailed questionnaire evaluating smoking history, domestic fuel use history and other demographic information was administered by trained interviewers to cases and controls. This research protocol was approved by a United States Environmental Protection Agency Human Subjects Research Review Official, and informed consent was obtained from all study subjects.
Genotyping was performed on DNA extracted from sputum samples via phenol–chloroform extraction (
14). Candidate SNPs were identified through the SNP500Cancer database (
http://snp500cancer.nci.nih.gov/) and genotyped if they were potentially relevant for cancer or other human diseases, had possible functional significance or expanded gene coverage of previously identified candidate genes. High-throughput genotyping was successful for 122 (100%) cases and 111 (91%) controls with an Oligo Pool by the Illumina GoldenGate Assay (
http://www.illumina.com) at the National Cancer Institute’s Core Genotyping Facility (Gaithersburg, MD). Ten controls did not have ample DNA for genotyping. Duplicate samples (
n
=

21) of both cases and controls were randomly distributed throughout study plates to ensure quality control and determine the intra-subject concordance rate for all assays (>98%). Initially, 1442 SNPs were genotyped. Hardy–Weinberg equilibrium for each SNP was tested in controls with a Pearson χ
2 test or a Fisher’s exact test if any of the cell counts were less than five. After exclusion of 166 SNPs with low minor allele frequency (<0.01) and 16 SNPs with substantial deviation from Hardy–Weinberg equilibrium (
P
<

0.001), 1260 SNPs in 380 genes were left for analysis.
First, unconditional logistic regression was used to estimate the odds ratio and calculate the 95% confidence interval for the association between lung cancer risk independently for each SNP, using the homozygote of the common allele as the reference group and adjusting for age (<55 and ≥55 years), sex, smoking (0 pack years, >0 and <25 pack years and ≥25 pack years) and lifetime smoky coal exposure (<130 and ≥130 tons). Gene–dose effects for each SNP were estimated by a linear trend test by coding the genotypes based on the number of variant alleles present (0, 1 and 2). Interactions between the dominant model and lifetime smoky coal exposure were tested on the multiplicative scale for significant SNPs in the four significant cell cycle genes while adjusting for age, sex and smoking.
Second, gene-based analyses were performed on 380 genes. To assess the significance of the association between each gene and lung cancer, we used MatLab to perform a minP test that assesses the significance of the minimal
P-value in each gene using a permutation-based resampling procedure (1000 permutations) that takes into account the number of SNPs genotyped in each gene and their underlying linkage disequilibrium (LD) structure (
15). A gene was significantly associated with lung cancer if it had a minP ≤0.05, after adjustment for age (<55 and ≥55 years), sex, smoking (0 pack years, >0 and <25 pack years and ≥25 pack years) and lifetime smoky coal exposure (<130 and ≥130 tons). False discovery rates (FDRs) were calculated using the Benjamini–Hochberg method to evaluate the significance of the minP results within the cell cycle pathway (
16).
Third, haplotype blocks and structure were determined with Haploview using data from controls for the four significant genes in the cell cycle pathway with more than one SNP (
17). Haplotype frequencies were estimated using the expectation–maximization algorithm (
18). Haplotypes with frequencies <1% were excluded. The overall difference in haplotype frequencies between cases and controls was assessed using a global score test (
19). Haplotype odds ratios and 95% confidence intervals were calculated and adjusted for age (<55 and ≥55 years), sex, smoking (0 pack years, >0 and <25 pack years and ≥25 pack years) and lifetime smoky coal exposure (<130 and ≥130 tons). A sliding window 3-SNP haplotype approach was also performed for the two significant genes in the cell cycle pathway with more than three SNPs to comprehensively evaluate potential disease loci in small genetic regions that may have been overlooked with the single-locus analysis (
20).
Finally, pathway-based analysis was performed on all 1260 SNPs available for analysis. Genes were categorized into biological pathways using the GoMiner software (
http://discover.nci.nih.gov/gominer/), which utilizes the Gene Ontology database (
http://www.geneontology.org) to identify the biological processes and functions of the genes and classify them into biologically coherent categories. To test the significance of each pathway, the proportion of statistically significant genes versus non-significant genes for each pathway was compared with the proportion of significant genes versus non-significant genes in all the remaining pathways using the one-sample test for proportions (
21). Exact methods were used when cell counts were less than five. In addition, we used the rank truncated product method to evaluate the excess of highly significant SNPs within each pathway (
22). Since the rank truncated product test yielded similar results as the one-sample test for proportions, only the one-sample test for proportions results are reported.
All statistical methods were performed using SAS software, version 9.1 (SAS Institute, Cary, NC), unless stated elsewhere.