In studies of case-parent trios, we define copy number variants (CNVs) in the offspring that differ from the parental copy numbers as de novo and of interest for their potential functional role in disease. Among the leading array-based methods for discovery of de novo CNVs in case-parent trios is the joint hidden Markov model (HMM) implemented in the PennCNV software. However, the computational demands of the joint HMM are substantial and the extent to which false positive identifications occur in case-parent trios has not been well described. We evaluate these issues in a study of oral cleft case-parent trios.
Our analysis of the oral cleft trios reveals that genomic waves represent a substantial source of false positive identifications in the joint HMM, despite a wave-correction implementation in PennCNV. In addition, the noise of low-level summaries of relative copy number (log R ratios) is strongly associated with batch and correlated with the frequency of de novo CNV calls. Exploiting the trio design, we propose a univariate statistic for relative copy number referred to as the minimum distance that can reduce technical variation from probe effects and genomic waves. We use circular binary segmentation to segment the minimum distance and maximum a posteriori estimation to infer de novo CNVs from the segmented genome. Compared to PennCNV on simulated data, MinimumDistance identifies fewer false positives on average and is comparable to PennCNV with respect to false negatives. Genomic waves contribute to discordance of PennCNV and MinimumDistance for high coverage de novo calls, while highly concordant calls on chromosome 22 were validated by quantitative PCR. Computationally, MinimumDistance provides a nearly 8-fold increase in speed relative to the joint HMM in a study of oral cleft trios.
Our results indicate that batch effects and genomic waves are important considerations for case-parent studies of de novo CNV, and that the minimum distance is an effective statistic for reducing technical variation contributing to false de novo discoveries. Coupled with segmentation and maximum a posteriori estimation, our algorithm compares favorably to the joint HMM with MinimumDistance being much faster.
Trios; Oral cleft; Copy number variants; de novo; High-throughput arrays; Segmentation; batch effects; Genomic waves
Clonal mosaicism for large chromosomal anomalies (duplications, deletions and uniparental disomy) was detected using SNP microarray data from over 50,000 subjects recruited for genome-wide association studies. This detection method requires a relatively high frequency of cells (>5–10%) with the same abnormal karyotype (presumably of clonal origin) in the presence of normal cells. The frequency of detectable clonal mosaicism in peripheral blood is low (<0.5%) from birth until 50 years of age, after which it rises rapidly to 2–3% in the elderly. Many of the mosaic anomalies are characteristic of those found in hematological cancers and identify common deleted regions that pinpoint the locations of genes previously associated with hematological cancers. Although only 3% of subjects with detectable clonal mosaicism had any record of hematological cancer prior to DNA sampling, those without a prior diagnosis have an estimated 10-fold higher risk of a subsequent hematological cancer (95% confidence interval = 6–18).
Nucleotide excision repair (NER) is responsible for protecting DNA in skin cells against ultraviolet radiation-induced damage. Using a candidate pathway approach, a matched case-control study nested within a prospective, community-based cohort was carried out to test the hypothesis that single nucleotide polymorphisms (SNPs) in NER genes are associated with susceptibility to non-melanoma skin cancer (NMSC). Histologically-confirmed cases of NMSC (n=900) were matched to controls (n=900) on age, gender, and skin type. Associations were measured between NMSC and 221 SNPs in 26 NER genes. Using the additive model, two tightly linked functional SNPs in ERCC6 were significantly associated with increased risk of NMSC: rs2228527 (odds ratio (OR) 1.57, 95% confidence interval (CI) 1.20 – 2.05), and rs2228529 (OR 1.57, 95% CI 1.20 – 2.05). These associations were confined to basal cell carcinoma of the skin (BCC) (rs2228529, OR 1.78, 95% CI 1.30 – 2.44; rs2228527 OR 1.78, 95% CI 1.31 – 2.43). These hypothesis-generating findings suggest functional variants in ERCC6 may be associated with an increased risk of NMSC that may be specific to BCC.
Case–parent trio studies concerned with children affected by a disease and their parents aim to detect single nucleotide polymorphisms (SNPs) showing a preferential transmission of alleles from the parents to their affected offspring. A popular statistical test for detecting such SNPs associated with disease in this study design is the genotypic transmission/disequilibrium test (gTDT) based on a conditional logistic regression model, which usually needs to be fitted by an iterative procedure. In this article, we derive exact closed-form solutions for the parameter estimates of the conditional logistic regression models when testing for an additive, a dominant, or a recessive effect of a SNP, and show that such analytic parameter estimates also exist when considering gene–environment interactions with binary environmental variables. Because the genetic model underlying the association between a SNP and a disease is typically unknown, it might further be beneficial to use the maximum over the gTDT statistics for the possible effects of a SNP as test statistic. We therefore propose a procedure enabling a fast computation of the test statistic and the permutation-based p-value of this MAX gTDT. All these methods are applied to whole-genome scans of the case–parent trios from the International Cleft Consortium. These applications show our procedures dramatically reduce the required computing time compared to the conventional iterative methods allowing, for example, the analysis of hundreds of thousands of SNPs in a few minutes instead of several hours.
Conditional logistic regression; Family-based design; Genome-wide association studies; Genotypic transmission/disequilibrium test; International Cleft Consortium; MAX test
Long chain polyunsaturated fatty acids (LC-PUFAs) are essential for brain structure, development, and function, and adequate dietary quantities of LC-PUFAs are thought to have been necessary for both brain expansion and the increase in brain complexity observed during modern human evolution. Previous studies conducted in largely European populations suggest that humans have limited capacity to synthesize brain LC-PUFAs such as docosahexaenoic acid (DHA) from plant-based medium chain (MC) PUFAs due to limited desaturase activity. Population-based differences in LC-PUFA levels and their product-to-substrate ratios can, in part, be explained by polymorphisms in the fatty acid desaturase (FADS) gene cluster, which have been associated with increased conversion of MC-PUFAs to LC-PUFAs. Here, we show evidence that these high efficiency converter alleles in the FADS gene cluster were likely driven to near fixation in African populations by positive selection ∼85 kya. We hypothesize that selection at FADS variants, which increase LC-PUFA synthesis from plant-based MC-PUFAs, played an important role in allowing African populations obligatorily tethered to marine sources for LC-PUFAs in isolated geographic regions, to rapidly expand throughout the African continent 60–80 kya.
Non-syndromic cleft palate (CP) is a common birth defect with a complex and heterogeneous etiology involving both genetic and environmental risk factors. We conducted a genome wide association study (GWAS) using 550 case-parent trios, ascertained through a CP case collected in an international consortium. Family based association tests of single nucleotide polymorphisms (SNP) and three common maternal exposures (maternal smoking, alcohol consumption and multivitamin supplementation) were used in a combined 2 df test for gene (G) and gene-environment (G×E) interaction simultaneously, plus a separate 1 df test for G×E interaction alone. Conditional logistic regression models were used to estimate effects on risk to exposed and unexposed children. While no SNP achieved genome wide significance when considered alone, markers in several genes attained or approached genome wide significance when G×E interaction was included. Among these, MLLT3 and SMC2 on chromosome 9 showed multiple SNPs resulting in increased risk if the mother consumed alcohol during the peri-conceptual period (3 months prior to conception through the first trimester). TBK1 on chr. 12 and ZNF236 on chr. 18 showed multiple SNPs associated with higher risk of CP in the presence of maternal smoking. Additional evidence of reduced risk due to G×E interaction in the presence of multivitamin supplementation was observed for SNPs in BAALC on chr. 8. These results emphasize the need to consider G×E interaction when searching for genes influencing risk to complex and heterogeneous disorders, such as non-syndromic CP.
The receptor tyrosine kinase-like orphan receptor 2 (ROR2) gene has been recently shown to play important roles in palatal development in animal models and resides in the chromosomal region linked to non syndromic cleft lip with or without cleft palate in humans. The aim of this study was to investigate the possible association between ROR2 gene and non-syndromic oral clefts.
Here we tested 38 eligible single-nucleotide polymorphisms (SNPs) in ROR2 gene in 297 non-syndromic cleft lip with or without cleft palate and in 82 non-syndromic cleft palate case parent trios recruited from Asia and Maryland. Family Based Association Test was used to test for deviation from Mendelian inheritance. Plink software was used to test potential parent of origin effect. Possible maternally mediated in utero effects were assessed using the TRIad Multi-Marker approach under an assumption of mating symmetry in the population.
Significant evidence of linkage and association was shown for 3 SNPs (rs7858435, rs10820914 and rs3905385) among 57 Asian non-syndromic cleft palate trios in Family Based Association Tests. P values for these 3 SNPs equaled to 0.000068, 0.000115 and 0.000464 respectively which were all less than the significance level (0.05/38=0.0013) adjusted by strict Bonferroni correction. Relevant odds ratios for the risk allele were 3.42 (1.80–6.50), 3.45 (1.75–6.67) and 2.94 (1.56–5.56), respectively. Statistical evidence of linkage and association was not shown for study groups other than non-syndromic cleft palate. Neither evidence for parent-of-origin nor maternal genotypic effect was shown for any of the ROR2 markers in our analysis for all study groups.
Our results provided evidence of linkage and association between the ROR2 gene and a gene controlling risk to non-syndromic cleft palate.
receptor tyrosine kinase-like orphan receptor 2; cleft lip; cleft palate; association; transmission disequilibrium test
Genotyping platforms such as Affymetrix can be used to assess genotype-phenotype as well as copy number-phenotype associations at millions of markers. While genotyping algorithms are largely concordant when assessed on HapMap samples, tools to assess copy number changes are more variable and often discordant. One explanation for the discordance is that copy number estimates are susceptible to systematic differences between groups of samples that were processed at different times or by different labs. Analysis algorithms that do not adjust for batch effects are prone to spurious measures of association. The R package crlmm implements a multilevel model that adjusts for batch effects and provides allele-specific estimates of copy number. This paper illustrates a workflow for the estimation of allele-specific copy number and integration of the marker-level estimates with complimentary Bioconductor software for inferring regions of copy number gain or loss. All analyses are performed in the statistical environment R.
copy number; batch effects; robust; multilevel model; high-throughput; oligonucleotide array
Motivation: Changes in the copy number of chromosomal DNA segments [copy number variants (CNVs)] have been implicated in human variation, heritable diseases and cancers. Microarray-based platforms are the current established technology of choice for studies reporting these discoveries and constitute the benchmark against which emergent sequence-based approaches will be evaluated. Research that depends on CNV analysis is rapidly increasing, and systematic platform assessments that distinguish strengths and weaknesses are needed to guide informed choice.
Results: We evaluated the sensitivity and specificity of six platforms, provided by four leading vendors, using a spike-in experiment. NimbleGen and Agilent platforms outperformed Illumina and Affymetrix in accuracy and precision of copy number dosage estimates. However, Illumina and Affymetrix algorithms that leverage single nucleotide polymorphism (SNP) information make up for this disadvantage and perform well at variant detection. Overall, the NimbleGen 2.1M platform outperformed others, but only with the use of an alternative data analysis pipeline to the one offered by the manufacturer.
Availability: The data is available from http://rafalab.jhsph.edu/cnvcomp/.
Contact: firstname.lastname@example.org; email@example.com; firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics online.
The Bone Morphogenetic Protein 4 gene (BMP4) is located in chromosome 14q22-q23 which has shown evidence of linkage for isolated nonsyndromic cleft lip with or without cleft palate (NSCL/P) in a genome wide linkage analysis of human multiplex families. BMP4 has been shown to play crucial roles in lip and palatal development in animal models. Several candidate gene association analyses also supported its potential risk for NSCL/P, however, results across these association studies have been inconsistent. The aim of the current study was to test for possible association between markers in and around the BMP4 gene and NSCL/P in Asian and Maryland trios.
Family Based Association Test was used to test for deviation from Mendelian assortment for 12 SNPs in and around BMP4. Nominal significant evidence of linkage and association was seen for three SNPs (rs10130587, rs2738265 and rs2761887) in 221 Asian trios and for one SNP (rs762642) in 76 Maryland trios. Statistical significance still held for rs10130587 after Bonferroni correction (corrected p = 0.019) among the Asian group. Estimated odds ratio for carrying the apparent high risk allele at this SNP was 1.61 (95%CI = 1.20, 2.18).
Our results provided further evidence of association between BMP4 and NSCL/P.
A major goal of genetic association studies concerned with single nucleotide polymorphisms (SNPs) is the detection of SNPs exhibiting an impact on the risk of developing a disease. Typically, this problem is approached by testing each of the SNPs individually. This, however, can lead to an inaccurate measurement of the influence of the SNPs on the disease risk, in particular, if SNPs only show an effect when interacting with other SNPs, as the multivariate structure of the data is ignored. In this article, we propose a testing procedure based on logic regression that takes this structure into account and therefore enables a more appropriate quantification of importance and ranking of the SNPs than marginal testing. Since even SNP interactions often exhibit only a moderate effect on the disease risk, it can be helpful to also consider sets of SNPs (e.g. SNPs belonging to the same gene or pathway) to borrow strength across these SNP sets and to identify those genes or pathways comprising SNPs that are most consistently associated with the response. We show how the proposed procedure can be adapted for testing SNP sets, and how it can be applied to blocks of SNPs in linkage disequilibrium (LD) to overcome problems caused by LD.
Feature selection; GENICA; Importance measure; logicFS; Logic regression
Submicroscopic changes in chromosomal DNA copy number dosage are common and have been implicated in many heritable diseases and cancers. Recent high-throughput technologies have a resolution that permits the detection of segmental changes in DNA copy number that span thousands of base pairs in the genome. Genomewide association studies (GWAS) may simultaneously screen for copy number phenotype and single nucleotide polymorphism (SNP) phenotype associations as part of the analytic strategy. However, genomewide array analyses are particularly susceptible to batch effects as the logistics of preparing DNA and processing thousands of arrays often involves multiple laboratories and technicians, or changes over calendar time to the reagents and laboratory equipment. Failure to adjust for batch effects can lead to incorrect inference and requires inefficient post hoc quality control procedures to exclude regions that are associated with batch. Our work extends previous model-based approaches for copy number estimation by explicitly modeling batch and using shrinkage to improve locus-specific estimates of copy number uncertainty. Key features of this approach include the use of biallelic genotype calls from experimental data to estimate batch-specific and locus-specific parameters of background and signal without the requirement of training data. We illustrate these ideas using a study of bipolar disease and a study of chromosome 21 trisomy. The former has batch effects that dominate much of the observed variation in the quantile-normalized intensities, while the latter illustrates the robustness of our approach to a data set in which approximately 27% of the samples have altered copy number. Locus-specific estimates of copy number can be plotted on the copy number scale to investigate mosaicism and guide the choice of appropriate downstream approaches for smoothing the copy number as a function of physical position. The software is open source and implemented in the R package crlmm at Bioconductor (http:www.bioconductor.org).
Bioinformatics; Hierarchical models; DNA copy number variations; Single nucleotide polymorphism array
Ensemble methods (such as Bagging and Random Forests) take advantage of unstable base learners (such as decision trees) to improve predictions, and offer measures of variable importance useful for variable selection. LogicFS (Schwender & Ickstadt, 2008) has been proposed as such an ensemble learner for case-control studies when interactions of single nucleotide polymorphisms (SNPs) are of particular interest. LogicFS uses bootstrap samples of the data and employs the Boolean trees derived in logic regression (Ruczinski et al., 200) is base learners to create ensembles of models that allow for the quantification of the contributions of epistatic interactions to the disease risk. In this article, we propose an extension of logicFS suitable for case-parent trio data, and derive an additional importance measure that is much less influenced by linkage disequilibrium between SNPs than the measure originally used in logicFS. We illustrate the performance of the novel procedure in simulation studies and in a case study of 461 case-parent trios with autistic children.
Family-based association study; gene-gene interaction; epistatic interaction; trio logic regression; logicFS; autism
We sought to replicate the association between the kinesin-like protein 6 (KIF6) Trp719Arg polymorphism (rs20455) and clinical coronary artery disease (CAD).
Recent prospective studies suggest that carriers of the 719Arg allele in KIF6 are at increased risk of clinical CAD compared with non-carriers.
The KIF6 Trp719Arg polymorphism (rs20455) was genotyped in nineteen case-control studies of non-fatal CAD either as part of a genome-wide association study or in a formal attempt to replicate the initial positive reports.
Over 17 000 cases and 39 000 controls of European descent as well as a modest number of South Asians, African Americans, Hispanics, East Asians, and admixed cases and controls were successfully genotyped. None of the nineteen studies demonstrated an increased risk of CAD in carriers of the 719Arg allele compared with non-carriers. Regression analyses and fixed effect meta-analyses ruled out with high degree of confidence an increase of ≥2% in the risk of CAD among European 719Arg carriers. We also observed no increase in the risk of CAD among 719Arg carriers in the subset of Europeans with early onset disease (<50 years of age for males and <60 years for females) compared with similarly aged controls as well as all non-European subgroups.
The KIF6 Trp719Arg polymorphism was not associated with the risk of clinical CAD in this large replication study.
kinesin-like protein 6; KIF6; coronary artery disease; myocardial infarction; polymorphism
Due predominantly to cigarette smoking, lung cancer is the leading cancer-related cause of death worldwide. Cruciferous vegetables may reduce lung cancer risk. The association between intake of cruciferous vegetables and lung cancer risk was investigated in the CLUE II study, a community-based cohort established in 1989.
We matched 274 incident cases of lung cancer diagnosed from 1990–2005 to 1089 cancer-free controls on age, sex, and cigarette smoking. Dietary information was collected at baseline. Multivariable odds ratios (ORs) and 95% confidence intervals (CIs) were calculated using conditional logistic regression.
Intake of cruciferous vegetables were inversely associated with lung cancer risk (highest-versus-lowest fourth: OR Q4vsQ1=0. 57; 95%CI = 0.38–0.85; p-trend=0.01). The inverse associations held true for former smokers (ORQ4vsQ1=0.49; 95%CI = 0.27–0.92; p-trend=0.05) and current smokers (ORQ4vsQ1)= 0.52; 95%CI = 0.29–0.95; p-trend=0.02).
After careful control of cigarette smoking, higher intake of cruciferous vegetable was associated with lower risk of lung cancer.
The observed inversed association coupled with the accumulating evidence suggests that cruciferous vegetables are inversely associated lung cancer risk and this association seems to hold true beyond the confounding effects of cigarette smoking.
Cruciferous vegetables; isothiocyanates; lung cancer risk
A 58kb region on chromosome 9p21.3 has consistently shown strong association with coronary artery disease (CAD) in multiple genome-wide association studies in populations of European and East Asian ancestry. In this study we sought to further characterize the role of genetic variants in 9p21.3 in African American individuals.
Methods and Results
Apparently healthy African American siblings (n=548) of patients with documented CAD <60 years of age were genotyped and followed for incident CAD for up to 17 years. Tests of association for 86 SNPs across the 9p21.3 region in a GEE logistic framework under an additive model adjusting for traditional risk factors, family, follow-up time, and population stratification were performed. A single SNP within the CDKN2B gene met stringent criteria for statistical significance, including permutation-based evaluations. This variant, rs3217989, was common (minor allele [G] frequency 0.242), conveyed protection against CAD (OR=0.19, 95% CI: 0.07 to 0.50, p=0.0008) and was replicated in a combined analysis of two additional case/control studies of prevalent CAD/MI in African Americans (n=990, p=0.024, OR= 0.779, 95% CI: 0.626-0.968).
This is the first report of a CAD association signal in a population of African ancestry with a common variant within the CDKN2B gene, independent from previous findings in European and East Asian ancestry populations. The findings demonstrate a significant protective effect against incident CAD in African American siblings of persons with premature CAD, with replication in a combination of two additional African American cohorts.
African American; CDKN2B; Coronary Artery Disease; Genetics; 9p21
Arachidonic acid (AA) is a long-chain omega-6 polyunsaturated fatty acid (PUFA) synthesized from the precursor dihomo-gamma-linolenic acid (DGLA) that plays a vital role in immunity and inflammation. Variants in the Fatty Acid Desaturase (FADS) family of genes on chromosome 11q have been shown to play a role in PUFA metabolism in populations of European and Asian ancestry; no work has been done in populations of African ancestry to date.
In this study, we report that African Americans have significantly higher circulating levels of plasma AA (p = 1.35 × 10-48) and lower DGLA levels (p = 9.80 × 10-11) than European Americans. Tests for association in N = 329 individuals across 80 nucleotide polymorphisms (SNPs) in the Fatty Acid Desaturase (FADS) locus revealed significant association with AA, DGLA and the AA/DGLA ratio, a measure of enzymatic efficiency, in both racial groups (peak signal p = 2.85 × 10-16 in African Americans, 2.68 × 10-23 in European Americans). Ancestry-related differences were observed at an upstream marker previously associated with AA levels (rs174537), wherein, 79-82% of African Americans carry two copies of the G allele compared to only 42-45% of European Americans. Importantly, the allelic effect of the G allele, which is associated with enhanced conversion of DGLA to AA, on enzymatic efficiency was similar in both groups.
We conclude that the impact of FADS genetic variants on PUFA metabolism, specifically AA levels, is likely more pronounced in African Americans due to the larger proportion of individuals carrying the genotype associated with increased FADS1 enzymatic conversion of DGLA to AA.
While genome-wide association studies (GWAS) have primarily examined populations of European ancestry, more recent studies often involve additional populations, including admixed populations such as African Americans and Latinos. In admixed populations, linkage disequilibrium (LD) exists both at a fine scale in ancestral populations and at a coarse scale (admixture-LD) due to chromosomal segments of distinct ancestry. Disease association statistics in admixed populations have previously considered SNP association (LD mapping) or admixture association (mapping by admixture-LD), but not both. Here, we introduce a new statistical framework for combining SNP and admixture association in case-control studies, as well as methods for local ancestry-aware imputation. We illustrate the gain in statistical power achieved by these methods by analyzing data of 6,209 unrelated African Americans from the CARe project genotyped on the Affymetrix 6.0 chip, in conjunction with both simulated and real phenotypes, as well as by analyzing the FGFR2 locus using breast cancer GWAS data from 5,761 African-American women. We show that, at typed SNPs, our method yields an 8% increase in statistical power for finding disease risk loci compared to the power achieved by standard methods in case-control studies. At imputed SNPs, we observe an 11% increase in statistical power for mapping disease loci when our local ancestry-aware imputation framework and the new scoring statistic are jointly employed. Finally, we show that our method increases statistical power in regions harboring the causal SNP in the case when the causal SNP is untyped and cannot be imputed. Our methods and our publicly available software are broadly applicable to GWAS in admixed populations.
This paper presents improved methodologies for the analysis of genome-wide association studies in admixed populations, which are populations that came about by the mixing of two or more distant continental populations over a few hundred years (e.g., African Americans or Latinos). Studies of admixed populations offer the promise of capturing additional genetic diversity compared to studies over homogeneous populations such as Europeans. In admixed populations, correlation between genetic variants exists both at a fine scale in the ancestral populations and at a coarse scale due to chromosomal segments of distinct ancestry. Disease association statistics in admixed populations have previously considered either one or the other type of correlation, but not both. In this work we develop novel statistical methods that account for both types of genetic correlation, and we show that the combined approach attains greater statistical power than that achieved by applying either approach separately. We provide analysis of simulated and real data from major studies performed in African-American men and women to show the improvement obtained by our methods over the standard methods for analyzing association studies in admixed populations.
Chronic kidney disease (CKD) has a heritable component and is an important global public health problem because of its high prevalence and morbidity.1 We conducted genome-wide association studies (GWAS) to identify susceptibility loci for glomerular filtration rate estimated by serum creatinine (eGFRcrea), cystatin C (eGFRcys), and CKD (eGFRcrea<60 ml/min/1.73m2) in European-ancestry participants of four populations-based cohorts (ARIC, CHS, FHS, RS; n=19,877, 2,388 CKD cases), and tested for external replication in 21,466 participants (1,932 CKD cases). Significant associations (p<5*10−8) were identified for SNPs with  CKD at the UMOD locus;  eGFRcrea at the UMOD, SHROOM3, and GATM/SPATA5L1 loci;  eGFRcys at the CST and STC1 loci. UMOD encodes the most common protein in human urine, Tamm-Horsfall protein,2 and rare mutations in UMOD cause Mendelian forms of kidney disease.3 Our findings provide new insights into CKD pathogenesis and underscore the importance of common genetic variants influencing renal function and disease.
chronic kidney disease; renal function; epidemiology; genetics; genome-wide association study; single nucleotide polymorphism
Simulation-based assessment is a popular and frequently necessary approach to evaluation of statistical procedures. Sometimes overlooked is the ability to take advantage of underlying mathematical relations and we focus on this aspect. We show how to take advantage of large-sample theory when conducting a simulation using the analysis of genomic data as a motivating example. The approach uses convergence results to provide an approximation to smaller-sample results, results that are available only by simulation. We consider evaluating and comparing a variety of ranking-based methods for identifying the most highly associated SNPs in a genome-wide association study, derive integral equation representations of the pre-posterior distribution of percentiles produced by three ranking methods, and provide examples comparing performance. These results are of interest in their own right and set the framework for a more extensive set of comparisons.
Efficient simulation; ranking procedures; SNP identification
Single nucleotide polymorphisms (SNPs) in thymic stromal lymphopoietin (TSLP) have been associated with IgE (in girls) and asthma (in general). We sought to determine whether TSLP SNPs are associated with asthma in a sex-specific fashion.
We conducted regular and sex-stratified analyses of association between SNPs in TSLP and asthma in families of asthmatic children in Costa Rica. Significant findings were replicated in white and African-American participants in the Childhood Asthma Management Program, in African Americans in the Genomic Research on Asthma in the African Diaspora study, in whites and Hispanics in the Children’s Health Study, and in whites in the Framingham Heart Study (FHS).
Two SNPs in TSLP (rs1837253 and rs2289276) were significantly associated with a reduced risk of asthma in combined analyses of all cohorts (p values of 2×10−5 and 1×10−5, respectively). In a sex-stratified analysis, the T allele of rs1837253 was significantly associated with a reduced risk of asthma in males only (p= 3×10−6). Alternately, the T allele of rs2289276 was significantly associated with a reduced risk of asthma in females only (p= 2×10−4). Findings for rs2289276 were consistent in all cohorts except the FHS.
TSLP variants are associated with asthma in a sex-specific fashion.
asthma; genetic association; sex-specific; thymic stromal lymphopoietin; TSLP
Model-based approaches for combining gene expression data from multiple high throughput platforms can be sensitive to technological artifacts when the number of samples in each platform is small. This paper proposes simple tools for quantifying concordance in a small study of pancreatic cancer cells lines with an emphasis on visualizations that uncover intra- and inter-platform variation. Using this approach, we identify several transcripts from the integrative analysis whose over-or under-expression in pancreatic cancer cell lines was validated by qPCR.
microarrays; cross-platform; rank statistics; differential gene expression
Although multiple genes have been identified as genetic risk factors for isolated, non-syndromic cleft lip with/without cleft palate (CL/P), a complex and heterogeneous birth defect, interferon regulatory factor 6 gene (IRF6) is one of the best documented genetic risk factors. In this study, we tested for association between markers in IRF6 and CL/P in 326 Chinese case–parent trios, considering gene–environment interaction for two common maternal exposures, and parent-of-origin effects. CL/P case–parent trios from three sites in mainland China and Taiwan were genotyped for 22 single nucleotide polymorphisms (SNPs) in IRF6. The transmission disequilibrium test was used to test for marginal effects of individual SNPs. We used PBAT to screen the SNPs and haplotypes for gene–environment (G × E) interaction and conditional logistic regression models to quantify effect sizes for SNP–environment interaction. After Bonferroni correction, 14 SNPs showed statistically significant association with CL/P. Evidence of G × E interaction was found for both maternal exposures, multivitamin supplementation and environmental tobacco smoke (ETS). Two SNPs showed evidence of interaction with multivitamin supplementation in conditional logistic regression models (rs2076153 nominal P = 0.019, rs17015218 nominal P = 0.012). In addition, rs1044516 yielded evidence for interaction with maternal ETS (nominal P = 0.041). Haplotype analysis using PBAT also suggested interaction between SNPs in IRF6 and both multivitamin supplementation and ETS. However, no evidence for maternal genotypic effects or significant parent-of-origin effects was seen in these data. These results suggest IRF6 gene may influence risk of CL/P through interaction with multivitamin supplementation and ETS in the Chinese population.
Candidate gene association studies (CGAS) are a useful epidemiologic approach to drawing inferences about relations between genes and disease, especially when experimental data support the involvement of specific biochemical pathways. The value of CGAS is apparent when allele frequencies are low, effect sizes are small, or the study population is limited or unique. CGAS is also valuable for validating previous reports of genetic associations with disease in different populations. Despite the many advantages, the information generated from CGAS is sometimes compromised because of either inefficient study design or suboptimal analytical approaches. Here the authors discuss issues related to the study design and statistical analyses of CGAS that can help to optimize their usefulness and information content. These issues include judicious hypothesis-driven selection of biochemical pathways, genes, and single nucleotide polymorphisms, as well as appropriate quality control and analytical procedures for measuring main effects and for evaluating environmental exposure modifications and interactions. A study design algorithm using the example of DNA repair genes and cancer is presented for purposes of illustration.
cancer; data analysis; DNA repair; genetic epidemiology; genome-wide association study; haplotypes; polymorphism, single nucleotide; research design