Primary analysis of case–control studies focuses on the relationship between disease D and a set of covariates of interest (Y, X). A secondary application of the case–control study, which is often invoked in modern genetic epidemiologic association studies, is to investigate the interrelationship between the covariates themselves. The task is complicated owing to the case–control sampling, where the regression of Y on X is different from what it is in the population. Previous work has assumed a parametric distribution for Y given X and derived semiparametric efficient estimation and inference without any distributional assumptions about X. We take up the issue of estimation of a regression function when Y given X follows a homoscedastic regression model, but otherwise the distribution of Y is unspecified. The semiparametric efficient approaches can be used to construct semiparametric efficient estimates, but they suffer from a lack of robustness to the assumed model for Y given X. We take an entirely different approach. We show how to estimate the regression parameters consistently even if the assumed model for Y given X is incorrect, and thus the estimates are model robust. For this we make the assumption that the disease rate is known or well estimated. The assumption can be dropped when the disease is rare, which is typically so for most case–control studies, and the estimation algorithm simplifies. Simulations and empirical examples are used to illustrate the approach.
Biased samples; Homoscedastic regression; Secondary data; Secondary phenotypes; Semiparametric inference; Two-stage samples
A recent genome-wide association study (GWAS) of subjects from Japan and South Korea reported a novel association between the TP63 locus on chromosome 3q28 and risk of lung adenocarcinoma (p = 7.3 × 10−12); however, this association did not achieve genome-wide significance (p < 10−7) among never-smoking males or females. To determine if this association with lung cancer risk is independent of tobacco use, we genotyped the TP63 SNPs reported by the previous GWAS (rs10937405 and rs4488809) in 3,467 never-smoking female lung cancer cases and 3,787 never-smoking female controls from 10 studies conducted in Taiwan, Mainland China, South Korea, and Singapore. Genetic variation in rs10937405 was associated with risk of lung adenocarcinoma [n = 2,529 cases; p = 7.1 × 10−8; allelic risk = 0.80, 95% confidence interval (CI) = 0.74–0.87]. There was also evidence of association with squamous cell carcinoma of the lung (n = 302 cases; p = 0.037; allelic risk = 0.82, 95% CI = 0.67–0.99). Our findings provide strong evidence that genetic variation in TP63 is associated with the risk of lung adenocarcinoma among Asian females in the absence of tobacco smoking.
We present a Bayesian approach to modeling dynamic smoking addiction behavior processes when cure is not directly observed due to censoring. Subject-specific probabilities model the stochastic transitions among three behavioral states: smoking, transient quitting, and permanent quitting (absorbent state). A multivariate normal distribution for random effects is used to account for the potential correlation among the subject-specific transition probabilities. Inference is conducted using a Bayesian framework via Markov chain Monte Carlo simulation. This framework provides various measures of subject-specific predictions, which are useful for policy-making, intervention development, and evaluation. Simulations are used to validate our Bayesian methodology and assess its frequentist properties. Our methods are motivated by, and applied to, the Alpha-Tocopherol, Beta-Carotene Lung Cancer Prevention study, a large (29,133 individuals) longitudinal cohort study of smokers from Finland.
Cure model; MCMC; Mixed-effects model; Prediction; Recurrent events; Smoking cessation
There has been a long-standing controversy in epidemiology with regard to an appropriate risk scale for testing interactions between genes (G) and environmental exposure (E ). Although interaction tests based on the logistic model—which approximates the multiplicative risk for rare diseases—have been more widely applied because of its convenience in statistical modeling, interactions under additive risk models have been regarded as closer to true biologic interactions and more useful in intervention-related decision-making processes in public health. It has been well known that exploiting a natural assumption of G-E independence for the underlying population can dramatically increase statistical power for detecting multiplicative interactions in case-control studies. However, the implication of the independence assumption for tests for additive interaction has not been previously investigated. In this article, the authors develop a likelihood ratio test for detecting additive interactions for case-control studies that incorporates the G-E independence assumption. Numerical investigation of power suggests that incorporation of the independence assumption can enhance the efficiency of the test for additive interaction by 2- to 2.5-fold. The authors illustrate their method by applying it to data from a bladder cancer study.
additive risk model; case-control studies; gene-environment independence; gene-environment interaction; multiplicative risk model
Pulmonary inflammation may contribute to lung cancer etiology. We conducted a broad evaluation of the association of single nucleotide polymorphisms (SNPs) in innate immunity and inflammation pathways with lung cancer risk, and conducted comparisons with a lung cancer genome wide association study (GWAS).
We included 378 lung cancer cases and 450 controls from the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial. An Illumina GoldenGate oligonucleotide pool assay was used to genotype 1,429 SNPs. Odds ratios (ORs) and 95% confidence intervals (CIs) were estimated for each SNP, and p-values for trend were calculated. For statistically significant SNPs (p-trend<0.05), we replicated our results with genotyped or imputed SNPs in the GWAS, and adjusted p-values for multiple testing.
In our PLCO analysis, we observed a significant association between 81 SNPs located in 44 genes and lung cancer (p-trend<0.05). Of these 81 SNPS, there was evidence for confirmation in the GWAS for 10 SNPs. However, after adjusting for multiple comparisons, the only SNP that remained significantly associated with lung cancer in the replication phase was rs4648127 (NFKB1; multiple testing adjusted p-trend=0.02). The CT/TT genotype of NFKB1 was associated with reduced odds of lung cancer in the PLCO study (OR=0.56; 95% CI 0.37–0.86) and the GWAS (OR=0.79; 95% CI 0.69–0.90).
We found a significant association between a variant in the NFKB1 gene and lung cancer risk. Our findings add to evidence implicating inflammation and immunity in lung cancer etiology.
lung cancer; genetics; inflammation; immunity; epidemiology
The interest in performing gene-environment interaction studies has seen a significant increase with the increase of advanced molecular genetics techniques. Practically, it became possible to investigate the role of environmental factors in disease risk and hence to investigate their role as genetic effect modifiers. The understanding that genetics is important in the uptake and metabolism of toxic substances is an example of how genetic profiles can modify important environmental risk factors to disease. Several rationales exist to set up gene-environment interaction studies and the technical challenges related to these studies – when the number of environmental or genetic risk factors is relatively small – has been described before.
In the post-genomic era, it is now possible to study thousands of genes and their interaction with the environment. This brings along a whole range of new challenges and opportunities. Despite a continuing effort in developing efficient methods and optimal bioinformatics infrastructures to deal with the available wealth of data, the challenge remains how to best present and analyze Genome-Wide Environmental Interaction (GWEI) studies involving multiple genetic and environmental factors. Since GWEIs are performed at the intersection of statistical genetics, bioinformatics and epidemiology, usually similar problems need to be dealt with as for Genome-Wide Association gene-gene Interaction (GWAI) studies. However, additional complexities need to be considered which are typical for large-scale epidemiological studies, but are also related to “joining” two heterogeneous types of data in explaining complex disease trait variation or for prediction purposes.
Genome-wide association studies; gene-environment interaction; post-GWAS analysis; association tests; exploratory methods
We report a new model to project the predictive performance of polygenic models based on the number and distribution of effect sizes for the underlying susceptibility alleles and the size of the training dataset. Using estimates of effect-size distribution and heritability derived from current studies, we project that while 45% of the variance of height has been attributed to common tagging Single Nucleotide Polymorphisms (SNP), a model trained on one million people may only explain 33.4% of variance of the trait. Current studies can identify 3.0%, 1.1%, and 7.0%, of the populations who are at two-fold or higher than average risk for Type 2 diabetes, coronary artery disease and prostate cancer, respectively. Tripling of sample sizes could elevate the percentages to 18.8%, 6.1%, and 12.2%, respectively. The utility of future polygenic models will depend on achievable sample sizes, underlying genetic architecture and information on other risk-factors, including family history.
Recent studies have identified common genetic variants that are unequivocally associated with central adiposity, BMI, and/or fasting plasma glucose among individuals of European descent. Our objective was to evaluate these associations in a population of Asian-Indians. We examined 16 single-nucleotide polymorphisms (SNPs) from loci previously linked to waist circumference, BMI, or fasting glucose in 1,129 Asian-Indians from New Delhi and Trivandrum. Trained medical staff measured waist circumference, height, and weight. Fasting plasma glucose was measured from collected blood specimens. Genotype–phenotype associations were evaluated using linear regression, with adjustments for age, gender, religion, and study region. For gene–environment interaction tests, total physical activity (PA) during the past 7 days was assessed by the International Physical Activity Questionnaire (IPAQ). The T allele at the FTO rs3751812 locus was associated with increased waist circumference (per allele effect of +1.58 cm, Ptrend = 0.0015) after Bonferroni adjustment for multiple testing (Padj = 0.04). We also found a nominally statistically significant FTO–PA interaction (Pinteraction = 0.008). Among participants with <81 metabolic equivalent (MET)-h/wk of PA, the rs3751812 variant was associated with increased waist size (+2.68 cm; 95% confidence interval (CI) = 1.24, 4.12), but not among those with 212+ MET-h/wk (−1.79 cm; 95% CI = −4.17, 0.58). No other variant had statistically significant associations, although statistical power was modest. In conclusion, we confirmed that an FTO variant associated with central adiposity in European populations is associated with central adiposity among Asian-Indians and corroborated prior reports indicating that high PA attenuates FTO-related genetic susceptibility to adiposity.
Background Some, but not all, observational studies have suggested that taller stature is associated with a significant increased risk of glioma. In a pooled analysis of observational studies, we investigated the strength and consistency of this association, overall and for major sub-types, and investigated effect modification by genetic susceptibility to the disease.
Methods We standardized and combined individual-level data on 1354 cases and 4734 control subjects from 13 prospective and 2 case–control studies. Pooled odds ratios (ORs) and 95% confidence intervals (CIs) for glioma and glioma sub-types were estimated using logistic regression models stratified by sex and adjusted for birth cohort and study. Pooled ORs were additionally estimated after stratifying the models according to seven recently identified glioma-related genetic variants.
Results Among men, we found a positive association between height and glioma risk (≥190 vs 170–174 cm, pooled OR = 1.70, 95% CI: 1.11–2.61; P-trend = 0.01), which was slightly stronger after restricting to cases with glioblastoma (pooled OR = 1.99, 95% CI: 1.17–3.38; P-trend = 0.02). Among women, these associations were less clear (≥175 vs 160–164 cm, pooled OR for glioma = 1.06, 95% CI: 0.70–1.62; P-trend = 0.22; pooled OR for glioblastoma = 1.36, 95% CI: 0.77–2.39; P-trend = 0.04). In general, we did not observe evidence of effect modification by glioma-related genotypes on the association between height and glioma risk.
Conclusion An association of taller adult stature with glioma, particularly for men and stronger for glioblastoma, should be investigated further to clarify the role of environmental and genetic determinants of height in the etiology of this disease.
Height; brain cancer; glioma; cancer; epidemiology
Epidemiological studies have yielded inconsistent associations between vitamin D status and prostate cancer risk, and few studies have evaluated whether the associations vary by disease aggressiveness. We investigated the association between vitamin D status, as determined by serum 25-hydroxy-vitamin D [25(OH)D] level, and risk of prostate cancer in a case–control study nested within the Prostate, Lung, Colorectal, and Ovarian (PLCO) Cancer Screening Trial.
The study included 749 case patients with incident prostate cancer who were diagnosed 1 to 8 years after blood draw and 781 control subjects who were frequency-matched by age at cohort entry, time since initial screening, and calendar year of cohort entry. All study participants were selected from the trial screening arm (which includes annual standardized prostate cancer screening). Conditional logistic regression was used to estimate odds ratios (ORs) with 95% confidence intervals (CIs) by quintile of 25(OH)D. Statistical tests were two-sided.
No statistically significant trend in overall prostate cancer risk was observed with increasing serum season-standardized 25(OH)D level. However, serum 25(OH)D concentrations greater than the lowest quintile (Q1) associated with increased risk of aggressive (Gleason sum ≥7 or clinical stage III or IV) disease (ORs for Q2 vs Q1 = 1.20, 95% CI = 0.80 to 1.81, for Q3 vs Q1 =1.96, 95% CI = 1.34 to 2.87, for Q4 vs Q1 = 1.61, 95% CI = 1.09 to 2.38, and for Q5 vs Q1 = 1.37, 95% CI = 0.92 to 2.05; Ptrend = .05). The rates of aggressive prostate cancer for increasing quintiles of serum 25(OH)2D were 406, 479, 780, 633, and 544 per 100,000 person-years. In exploratory analyses, these associations with aggressive disease were consistent across subgroups defined by age, family history of prostate cancer, diabetes, body mass index, vigorous physical activity, calcium intake, study center, season of blood collection, and assay batch.
The findings of this large prospective study do not support the hypothesis that vitamin D is associated with decreased risk of prostate cancer; indeed, higher circulating 25(OH)D concentrations may be associated with increased risk of aggressive disease.
25-hydroxy-vitamin D; prostate cancer
We show how to use reports of cancer in family members to discover additional genetic associations or confirm previous findings in genome-wide association (GWA) studies conducted in case-control, cohort, or cross-sectional studies. Our novel family-history-based approach allows economical association studies for multiple cancers, without genotyping of relatives (as required in family studies), follow-up of participants (as required in cohort studies), or oversampling of specific cancer cases, (as required in case-control studies). We empirically evaluate the performance of the proposed family-history-based approach in studying associations with prostate and ovarian cancers, using data from GWA studies previously conducted within the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial. The family-history-based method may be particularly useful for investigating genetic susceptibility to rare diseases, for which accruing cases may be very difficult, by using disease information from non-genotyped relatives of participants in multiple case-control and cohort studies designed primarily for other purposes.
To estimate the likely number and predictive strength of cancer-associated single nucleotide polymorphisms (SNPs) that are yet to be discovered for seven common cancers.
From the statistical power of published genome-wide association studies, we estimated the number of undetected susceptibility loci and the distribution of effect sizes for all cancers. Assuming a log-normal model for risks and multiplicative relative risks for SNPs, family history (FH), and known risk factors, we estimated the area under the receiver operating characteristic curve (AUC) and the proportion of patients with risks above risk thresholds for screening. From additional prevalence data, we estimated the positive predictive value and the ratio of non–patient cases to patient cases (false-positive ratio) for various risk thresholds.
Age-specific discriminatory accuracy (AUC) for models including FH and foreseeable SNPs ranged from 0.575 for ovarian cancer to 0.694 for prostate cancer. The proportions of patients in the highest decile of population risk ranged from 16.2% for ovarian cancer to 29.4% for prostate cancer. The corresponding false-positive ratios were 241 for colorectal cancer, 610 for ovarian cancer, and 138 or 280 for breast cancer in women age 50 to 54 or 40 to 44 years, respectively.
Foreseeable common SNP discoveries may not permit identification of small subsets of patients that contain most cancers. Usefulness of screening could be diminished by many false positives. Additional strong risk factors are needed to improve risk discrimination.
A recent genome-wide association study of bladder cancer identified the UGT1A gene cluster on chromosome 2q37.1 as a novel susceptibility locus. The UGT1A cluster encodes a family of UDP-glucuronosyltransferases (UGTs), which facilitate cellular detoxification and removal of aromatic amines. Bioactivated forms of aromatic amines found in tobacco smoke and industrial chemicals are the main risk factors for bladder cancer. The association within the UGT1A locus was detected by a single nucleotide polymorphism (SNP) rs11892031. Now, we performed detailed resequencing, imputation and genotyping in this region. We clarified the original genetic association detected by rs11892031 and identified an uncommon SNP rs17863783 that explained and strengthened the association in this region (allele frequency 0.014 in 4035 cases and 0.025 in 5284 controls, OR = 0.55, 95%CI = 0.44–0.69, P = 3.3 × 10−7). Rs17863783 is a synonymous coding variant Val209Val within the functional UGT1A6.1 splicing form, strongly expressed in the liver, kidney and bladder. We found the protective T allele of rs17863783 to be associated with increased mRNA expression of UGT1A6.1 in in-vitro exontrap assays and in human liver tissue samples. We suggest that rs17863783 may protect from bladder cancer by increasing the removal of carcinogens from bladder epithelium by the UGT1A6.1 protein. Our study shows an example of genetic and functional role of an uncommon protective genetic variant in a complex human disease, such as bladder cancer.
In follow-up of a recent genome-wide association study (GWAS) that identified a locus in chromosome 2p21 associated with risk for renal cell carcinoma (RCC), we conducted a fine mapping analysis of a 120 kb region that includes EPAS1. We genotyped 59 tagged common single-nucleotide polymorphisms (SNPs) in 2278 RCC and 3719 controls of European background and observed a novel signal for rs9679290 [P = 5.75 × 10−8, per-allele odds ratio (OR) = 1.27, 95% confidence interval (CI): 1.17–1.39]. Imputation of common SNPs surrounding rs9679290 using HapMap 3 and 1000 Genomes data yielded two additional signals, rs4953346 (P = 4.09 × 10−14) and rs12617313 (P = 7.48 × 10−12), both highly correlated with rs9679290 (r2 > 0.95), but interestingly not correlated with the two SNPs reported in the GWAS: rs11894252 and rs7579899 (r2 < 0.1 with rs9679290). Genotype analysis of rs12617313 confirmed an association with RCC risk (P = 1.72 × 10−9, per-allele OR = 1.28, 95% CI: 1.18–1.39) In conclusion, we report that chromosome 2p21 harbors a complex genetic architecture for common RCC risk variants.
The question of which statistical approach is the most effective for investigating gene-environment (G-E) interactions in the context of genome-wide association studies (GWAS) remains unresolved. By using 2 case-control GWAS (the Nurses’ Health Study, 1976–2006, and the Health Professionals Follow-up Study, 1986–2006) of type 2 diabetes, the authors compared 5 tests for interactions: standard logistic regression-based case-control; case-only; semiparametric maximum-likelihood estimation of an empirical-Bayes shrinkage estimator; and 2-stage tests. The authors also compared 2 joint tests of genetic main effects and G-E interaction. Elevated body mass index was the exposure of interest and was modeled as a binary trait to avoid an inflated type I error rate that the authors observed when the main effect of continuous body mass index was misspecified. Although both the case-only and the semiparametric maximum-likelihood estimation approaches assume that the tested markers are independent of exposure in the general population, the authors did not observe any evidence of inflated type I error for these tests in their studies with 2,199 cases and 3,044 controls. Both joint tests detected markers with known marginal effects. Loci with the most significant G-E interactions using the standard, empirical-Bayes, and 2-stage tests were strongly correlated with the exposure among controls. Study findings suggest that methods exploiting G-E independence can be efficient and valid options for investigating G-E interactions in GWAS.
case-control studies; case study; diabetes mellitus, type 2; epidemiologic methods; genome-wide association study; genotype-environment interaction
Several methods for screening gene-environment interaction have recently been proposed that address the issue of using gene-environment independence in a data-adaptive way. In this report, the authors present a comparative simulation study of power and type I error properties of 3 classes of procedures: 1) the standard 1-step case-control method; 2) the case-only method that requires an assumption of gene-environment independence for the underlying population; and 3) a variety of hybrid methods, including empirical-Bayes, 2-step, and model averaging, that aim at gaining power by exploiting the assumption of gene-environment independence and yet can protect against false positives when the independence assumption is violated. These studies suggest that, although the case-only method generally has maximum power, it has the potential to create substantial false positives in large-scale studies even when a small fraction of markers are associated with the exposure under study in the underlying population. All the hybrid methods perform well in protecting against such false positives and yet can retain substantial power advantages over standard case-control tests. The authors conclude that, for future genome-wide scans for gene-environment interactions, major power gain is possible by using alternatives to standard case-control analysis. Whether a case-only type scan or one of the hybrid methods should be used depends on the strength and direction of gene-environment interaction and association, the level of tolerance for false positives, and the nature of replication strategies.
case-control studies; efficiency; familywise error rate; genome-wide association study; profile likelihood; robustness; shrinkage
Over the past several years, genome-wide association studies (GWAS) have succeeded in identifying hundreds of genetic markers associated with common diseases. However, most of these markers confer relatively small increments of risk and explain only a small proportion of familial clustering. To identify obstacles to future progress in genetic epidemiology research and provide recommendations to NIH for overcoming these barriers, the National Cancer Institute sponsored a workshop entitled “Next Generation Analytic Tools for Large-Scale Genetic Epidemiology Studies of Complex Diseases” on September 15–16, 2010. The goal of the workshop was to facilitate discussions on (1) statistical strategies and methods to efficiently identify genetic and environmental factors contributing to the risk of complex disease; and (2) how to develop, apply, and evaluate these strategies for the design, analysis, and interpretation of large-scale complex disease association studies in order to guide NIH in setting the future agenda in this area of research. The workshop was organized as a series of short presentations covering scientific (gene-gene and gene-environment interaction, complex phenotypes, and rare variants and next generation sequencing) and methodological (simulation modeling and computational resources and data management) topic areas. Specific needs to advance the field were identified during each session and are summarized.
gene-gene interactions; gene-environment interactions; rare variants; next generation sequencing; complex phenotypes; simulations; computational resources
In an analysis of 31,717 cancer cases and 26,136 cancer-free controls drawn from 13 genome-wide association studies (GWAS), we observed large chromosomal abnormalities in a subset of clones from DNA obtained from blood or buccal samples. Mosaic chromosomal abnormalities, either aneuploidy or copy-neutral loss of heterozygosity, of size >2 Mb were observed in autosomes of 517 individuals (0.89%) with abnormal cell proportions between 7% and 95%. In cancer-free individuals, the frequency increased with age; 0.23% under 50 and 1.91% between 75 and 79 (p=4.8×10−8). Mosaic abnormalities were more frequent in individuals with solid-tumors (0.97% versus 0.74% in cancer-free individuals, OR=1.25, p=0.016), with a stronger association for cases who had DNA collected prior to diagnosis or treatment (OR=1.45, p=0.0005). Detectable clonal mosaicism was common in individuals for whom DNA was collected at least one year prior to diagnosis of leukemia compared to cancer-free individuals (OR=35.4, p=3.8×10−11). These findings underscore the importance of the role and time-dependent nature of somatic events in the etiology of cancer and other late-onset diseases.
We conducted a genome-wide association study (GWAS) of breast cancer by genotyping 528,173 single nucleotide polymorphisms (SNPs) in 1,145 cases of invasive breast cancer among postmenopausal white women, and 1,142 controls. We identified a set of four SNPs in intron 2 of FGFR2, a tyrosine kinase receptor previously shown to be amplified and/or over-expressed in some breast cancers, as highly associated with breast cancer and we confirmed this association in 1,776 cases and 2,072 controls from three additional studies. In both association testing and ancestral recombination graph analysis, FGFR2 haplotypes were associated with risk of breast cancer. Across the four studies the association with all four SNPs was highly statistically significant (Ptrend for the most strongly associated SNP, rs1219648 = 1.1 × 10−10; population attributable risk = 16%). Four SNPs at other chromosomal loci most strongly associated with breast cancer in the initial GWAS were not associated with risk in the three replication studies. Our summary results from the GWAS are freely available online in a form that should speed the identification of additional loci conferring risk.
We introduce an innovative multilocus test for disease association. It is an extension of an existing score test that gains power over alternative methods by incorporating a parsimonious one-degree-of-freedom model for interaction. We use our method in applications designed to detect interactions that generate hypotheses about the functionality of prostate cancer (PRCA) susceptibility regions.
Our proposed score test is designed to gain additional power through the use of a retrospective likelihood that exploits an assumption of independence between unlinked loci in the underlying population. Its performance is validated through simulation. The method is used in conditional scans with data from stage II of the Cancer Genetic Markers of Susceptibility PRCA genome-wide association study.
Our proposed method increases power to detect susceptibility loci in diverse settings. It identified two high-ranking, biologically interesting interactions: (1) rs748120 of NR2C2 and subregions of 8q24 that contain independent susceptibility loci specific to PRCA and (2) rs4810671 of SULF2 and both JAZF1 and HNF1B that are associated with PRCA and type 2 diabetes.
Our score test is a promising multilocus tool for genetic epidemiology. The results of our applications suggest functionality for poorly understood PRCA susceptibility regions. They motivate replication study.
Gene-gene interaction; Score test; Prostate cancer
Genome-wide and candidate-gene association studies of bladder cancer have identified 10 susceptibility loci thus far. We conducted a meta-analysis of two previously published genome-wide scans (4501 cases and 6076 controls of European background) and followed up the most significant association signals [17 single nucleotide polymorphisms (SNPs) in 10 genomic regions] in 1382 cases and 2201 controls from four studies. A combined analysis adjusted for study center, age, sex, and smoking status identified a novel susceptibility locus that mapped to a region of 18q12.3, marked by rs7238033 (P = 8.7 × 10–9; allelic odds ratio 1.20 with 95% CI: 1.13–1.28) and two highly correlated SNPs, rs10775480/rs10853535 (r2= 1.00; P = 8.9 × 10–9; allelic odds ratio 1.16 with 95% CI: 1.10–1.22). The signal localizes to the solute carrier family 14 member 1 gene, SLC14A1, a urea transporter that regulates cellular osmotic pressure. In the kidney, SLC14A1 regulates urine volume and concentration whereas in erythrocytes it determines the Kidd blood groups. Our findings suggest that genetic variation in SLC14A1 could provide new etiological insights into bladder carcinogenesis.
Genetic association studies, thus far, have focused on the analysis of individual main effects of SNP markers. Nonetheless, there is a clear need for modeling epistasis or gene-gene interactions to better understand the biologic basis of existing associations. Tree-based methods have been widely studied as tools for building prediction models based on complex variable interactions. An understanding of the power of such methods for the discovery of genetic associations in the presence of complex interactions is of great importance. Here, we systematically evaluate the power of three leading algorithms: random forests (RF), Monte Carlo logic regression (MCLR), and multifactor dimensionality reduction (MDR).
We use the algorithm-specific variable importance measures (VIMs) as statistics and employ permutation-based resampling to generate the null distribution and associated p values. The power of the three is assessed via simulation studies. Additionally, in a data analysis, we evaluate the associations between individual SNPs in pro-inflammatory and immunoregulatory genes and the risk of non-Hodgkin lymphoma.
The power of RF is highest in all simulation models, that of MCLR is similar to RF in half, and that of MDR is consistently the lowest.
Our study indicates that the power of RF VIMs is most reliable. However, in addition to tuning parameters, the power of RF is notably influenced by the type of variable (continuous vs. categorical) and the chosen VIM.
Genetic associations; Power; Random forests; SNP; Variable importance measure
A prospective study of diet and cancer has not been conducted in India; consequently, little is known regarding follow-up rates or the completeness and accuracy of cancer case ascertainment.
We assessed follow-up in the India Health Study (IHS; 4,671 participants aged 35–69 residing in New Delhi, Mumbai, or Trivandrum). We evaluated the impact of medical care access and relocation, re-contacted the IHS participants to estimate follow-up rates, and conducted separate studies of cancer cases to evaluate registry coverage (604 cases in Trivandrum) and the accuracy of self- and proxy-reporting (1600 cases in New Delhi and Trivandrum).
Over 97% of people reported seeing a doctor and 85% had lived in their current residence for over six years. The 2-year follow-up rate was 91% for Trivandrum and 53% for New Delhi. No cancer cases were missed among public institutions participating in the surveillance program in Trivandrum during 2003–04; but there are likely to be unmatched cases (ranging from 5 to13% of total cases) from private hospitals in the Trivandrum registry, as there are no mandatory reporting requirements. Vital status was obtained for 36% of cancer cases in New Delhi as compared to 78% in Trivandrum after a period of 4 years.
A prospective cohort study of cancer may be feasible in some centers in India with active follow-up to supplement registry data. Inclusion of cancers diagnosed at private institutions, unique identifiers for individuals, and computerized medical information would likely improve cancer registries.
Cancer; end-point; follow-up; registry; prospective cohort; India
Next Generation Sequencing represents a powerful tool for detecting genetic variation associated with human disease. Because of the high cost of this technology, it is critical that we develop efficient study designs that consider the trade-off between the number of subjects (n) and the coverage depth (μ). How we divide our resources between the two can greatly impact study success, particularly in pilot studies. We propose a strategy for selecting the optimal combination of n and μ for studies aimed at detecting rare variants and for studies aimed at detecting associations between rare or uncommon variants and disease. For detecting rare variants, we find the optimal coverage depth to be between 2 and 8 reads when using the likelihood ratio test. For association studies, we find the strategy of sequencing all available subjects to be preferable. In deriving these combinations, we provide a detailed analysis describing the distribution of depth across a genome and the depth needed to identify a minor allele in an individual. The optimal coverage depth depends on the aims of the study, and the chosen depth can have a large impact on study success.
next generation sequencing; sequencing depth; study design; rare variants