1.  Unsupervised gene set testing based on random matrix theory 
BMC Bioinformatics  2016;17:442.
Gene set testing, or pathway analysis, is a bioinformatics technique that performs statistical testing on biologically meaningful sets of genomic variables. Although originally developed for supervised analyses, i.e., to test the association between gene sets and an outcome variable, gene set testing also has important unsupervised applications, e.g., p-value weighting. For unsupervised testing, however, few effective gene set testing methods are available with support especially poor for several biologically relevant use cases.
In this paper, we describe two new unsupervised gene set testing methods based on random matrix theory, the Marc̆enko-Pastur Distribution Test (MPDT) and the Tracy-Widom Test (TWT), that support both self-contained and competitive null hypotheses. For the self-contained case, we contrast our proposed tests with the classic multivariate test based on a modified likelihood ratio criterion. For the competitive case, we compare the new tests against a competitive version of the classic test and our recently developed Spectral Gene Set Enrichment (SGSE) method. Evaluation of the TWT and MPDT methods is based on both simulation studies and a weighted p-value analysis of two real gene expression data sets using gene sets drawn from MSigDB collections.
The MPDT and TWT methods are novel and effective tools for unsupervised gene set analysis with superior statistical performance relative to existing techniques and the ability to generate biologically important results on real genomic data sets.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-016-1299-8) contains supplementary material, which is available to authorized users.
PMCID: PMC5096314  PMID: 27809777
Gene set testing; Pathway analysis; Random matrix theory; Tracy-Widom; Marc̆enko-Pastur
2.  Identification of shared and unique susceptibility pathways among cancers of the lung, breast, and prostate from genome-wide association studies and tissue-specific protein interactions 
Human Molecular Genetics  2015;24(25):7406-7420.
Results from genome-wide association studies (GWAS) have indicated that strong single-gene effects are the exception, not the rule, for most diseases. We assessed the joint effects of germline genetic variations through a pathway-based approach that considers the tissue-specific contexts of GWAS findings. From GWAS meta-analyses of lung cancer (12 160 cases/16 838 controls), breast cancer (15 748 cases/18 084 controls) and prostate cancer (14 160 cases/12 724 controls) in individuals of European ancestry, we determined the tissue-specific interaction networks of proteins expressed from genes that are likely to be affected by disease-associated variants. Reactome pathways exhibiting enrichment of proteins from each network were compared across the cancers. Our results show that pathways associated with all three cancers tend to be broad cellular processes required for growth and survival. Significant examples include the nerve growth factor (P = 7.86 × 10−33), epidermal growth factor (P = 1.18 × 10−31) and fibroblast growth factor (P = 2.47 × 10−31) signaling pathways. However, within these shared pathways, the genes that influence risk largely differ by cancer. Pathways found to be unique for a single cancer focus on more specific cellular functions, such as interleukin signaling in lung cancer (P = 1.69 × 10−15), apoptosis initiation by Bad in breast cancer (P = 3.14 × 10−9) and cellular responses to hypoxia in prostate cancer (P = 2.14 × 10−9). We present the largest comparative cross-cancer pathway analysis of GWAS to date. Our approach can also be applied to the study of inherited mechanisms underlying risk across multiple diseases in general.
PMCID: PMC4664175  PMID: 26483192
3.  The role of haplotype in 15q25.1 locus in lung cancer risk: results of scanning chromosome 15 
Carcinogenesis  2015;36(11):1275-1283.
The rs588765-rs16969968 haplotype modifies lung cancer risk more than effects from individual variations at rs16969968 or rs588765, and therefore may be a marker of genetic susceptibility to lung cancer even among never-smokers. This knowledge may facilitate our understanding of lung cancer etiology.
The role of haplotypes and the interaction of haplotypes and smoking in lung cancer risk have not been well characterized. We analyzed data from an Italian population-based, case–control study with 1815 lung cancer patients and 1959 healthy controls in discovery, and performed a validation using a case–control study with 2983 lung cancer patients and 3553 healthy controls of European ancestry for replication. Sliding window haplotype analysis within chromosome 15, evaluating 4722250 haplotypes and pair-wise haplotype analysis identified that CHRNA5 rs588765-rs16969968 was the most significant haplotype associated with lung cancer risk (omnibus P = 8.35×10−15 in discovery and 7.26×10−14 in replication), and improved the prediction of case status over that provided by the individual SNPs rs16969968 or rs588765 (likelihood ratio test P = 0.006 for rs16969968 and 3.83×10−14 for rs588765 in discovery, 0.009 for rs16969968 and 4.62×10−13 for rs588765 in replication, compared with rs588765-rs16969968). Compared with the wild-type homozygous diplotype, CA/CA homozygote exhibited an approximately 2-fold increase risk for lung cancer (OR = 2.12; 95% CI 1.46–3.07 in discovery, and OR = 2.01; 95% CI 1.51–2.67 in replication). Even among never-smokers, CA/CA homozygote showed an increased risk of lung cancer with borderline significance in discovery (adjusted OR = 1.75, 95% CI 0.96–3.19) and statistical significance in replication (adjusted OR = 2.10, 95% CI 1.12–3.96), compared with combined genotypes (CG/CG + CG/TG). Accordingly, rs588765-rs16969968 may be a genetic marker to lung cancer risk, even among never-smokers.
PMCID: PMC4635666  PMID: 26282330
4.  RNA-seq analysis of lung adenocarcinomas reveals different gene expression profiles between smoking and nonsmoking patients 
Lung adenocarcinoma is caused by the combination of genetic and environmental effects, and smoking plays an important role in the disease development. Exploring the gene expression profile and identifying genes that are shared or vary between smokers and nonsmokers with lung adenocarcinoma will provide insights into the etiology of this complex cancer. We obtained RNA-seq data from paired normal and tumor tissues from 34 nonsmoking and 34 smoking patients with lung adenocarcinoma (GEO: GSE40419). R Bioconductor, edgeR, was adopted to conduct differential gene expression analysis between paired normal and tumor tissues. A generalized linear model was applied to identify genes that were differentially expressed in nonsmoker and smoker patients as well as genes that varied between these two groups. We identified 2273 genes that showed differential expression with FDR<0.05 and |logFC| >1 in nonsmoker tumor versus normal tissues; 3030 genes in the smoking group; and 1967 genes were common to both groups. Sixty-eight and 70 % of the identified genes were downregulated in nonsmoking and smoking groups, respectively. The 20 genes such as SPP1, SPINK1, and FAM83A with largest fold changes in smokers also showed similar large and highly significant fold changes in nonsmokers and vice versa, showing commonalities in expression changes for adenocarcinomas in both smokers and nonsmokers for these genes. We also identified 175 genes that were significantly differently expressed between tumor samples from nonsmoker and smoker patients. Gene expression profile varied substantially between smoker and nonsmoker patients with lung adenocarcinoma. Smoking patients overall showed far more complicated disease mechanism and have more dysregulation in their gene expression profiles. Our study reveals pathogenetic differences in smoking and nonsmoking patients with lung adenocarcinoma from tran-scriptome analysis. We provided a list of candidate genes for further study for disease detection and treatment in both smoking and nonsmoking patients with lung adenocarcinoma.
PMCID: PMC4674426  PMID: 26081616
RNA-seq; Expression analysis; Smoking; Lung cancer; Lung adenocarcinoma
5.  Focused Analysis of Exome Sequencing Data for Rare Germline Mutations in Familial and Sporadic Lung Cancer 
The association between smoking induced chronic obstructive pulmonary disease (COPD) and lung cancer (LC) is well documented. Recent genome-wide association studies (GWAS) have identified 28 susceptibility loci for LC, 10 for COPD, 32 for smoking behavior (SM), and 63 for pulmonary function (PF), totaling 107 non-overlapping loci. Given that common variants have been found to be associated with LC in GWAS, exome sequencing of these high-priority regions has great potential to identify novel rare causal variants.
Patients and Methods
Using a variation of the extreme phenotype approach, we selected 48 sporadic LC patients reporting heavy smoking histories, 37 of whom also exhibited carefully documented severe COPD (in whom smoking is considered the overwhelming determinant), and 54 unique familial LC cases from families with at least three first-degree relatives with LC (who are likely enriched for genomic effects), to search for disease-causing rare germline mutations.
By focusing on exome profiles of the 107 target loci, we identified two key rare mutations. A heterozygous p.Arg696Cys variant in the Coiled-Coil Domain Containing 147 (CCDC147) gene at 10q25.1 was identified in one sporadic and two familial cases. The minor allele frequency (MAF) of this variant in the 1000 Genomes (TG) database is 0.0026. The p.Val26Met variant in Dopamine Beta-Hydroxylase (DBH) gene at 9q34.2 was identified in two sporadic cases; MAF of this mutation is 0.0034 from the TG database. We also observed three suggestive rare mutations on 15q25.1 IREB2/CHRNA5/CHRNB4.
Our results demonstrated highly disruptive risk-conferring CCDC147 and DBH mutations.
PMCID: PMC4714038  PMID: 26762739
Exome sequencing; Single nucleotide variants (SNV); Lung cancer (LC); Chronic obstructive pulmonary disease (COPD); Familial and Sporadic
6.  Fine mapping of chromosome 5p15.33 based on a targeted deep sequencing and high density genotyping identifies novel lung cancer susceptibility loci 
Carcinogenesis  2015;37(1):96-105.
Based on deep targeted sequencing and Axiom data in 10 lung cancer studies, our fine mapping analysis identified multiple novel lung cancer susceptibility variants in 5p15.33 region. It also demonstrated that telomere length is a key mechanism of these associations.
Chromosome 5p15.33 has been identified as a lung cancer susceptibility locus, however the underlying causal mechanisms were not fully elucidated. Previous fine-mapping studies of this locus have relied on imputation or investigated a small number of known, common variants. This study represents a significant advance over previous research by investigating a large number of novel, rare variants, as well as their underlying mechanisms through telomere length. Variants for this fine-mapping study were identified through a targeted deep sequencing (average depth of coverage greater than 4000×) of 576 individuals. Subsequently, 4652 SNPs, including 1108 novel SNPs, were genotyped in 5164 cases and 5716 controls of European ancestry. After adjusting for known risk loci, rs2736100 and rs401681, we identified a new, independent lung cancer susceptibility variant in LPCAT1: rs139852726 (OR = 0.46, P = 4.73×10–9), and three new adenocarcinoma risk variants in TERT: rs61748181 (OR = 0.53, P = 2.64×10–6), rs112290073 (OR = 1.85, P = 1.27×10–5), rs138895564 (OR = 2.16, P = 2.06×10–5; among young cases, OR = 3.77, P = 8.41×10–4). In addition, we found that rs139852726 (P = 1.44×10–3) was associated with telomere length in a sample of 922 healthy individuals. The gene-based SKAT-O analysis implicated TERT as the most relevant gene in the 5p15.33 region for adenocarcinoma (P = 7.84×10–7) and lung cancer (P = 2.37×10–5) risk. In this largest fine-mapping study to investigate a large number of rare and novel variants within 5p15.33, we identified novel lung and adenocarcinoma susceptibility loci with large effects and provided support for the role of telomere length as the potential underlying mechanism.
PMCID: PMC4715236  PMID: 26590902
7.  Component-wise gradient boosting and false discovery control in survival analysis with high-dimensional covariates 
Bioinformatics  2015;32(1):50-57.
Motivation: Technological advances that allow routine identification of high-dimensional risk factors have led to high demand for statistical techniques that enable full utilization of these rich sources of information for genetics studies. Variable selection for censored outcome data as well as control of false discoveries (i.e. inclusion of irrelevant variables) in the presence of high-dimensional predictors present serious challenges. This article develops a computationally feasible method based on boosting and stability selection. Specifically, we modified the component-wise gradient boosting to improve the computational feasibility and introduced random permutation in stability selection for controlling false discoveries.
Results: We have proposed a high-dimensional variable selection method by incorporating stability selection to control false discovery. Comparisons between the proposed method and the commonly used univariate and Lasso approaches for variable selection reveal that the proposed method yields fewer false discoveries. The proposed method is applied to study the associations of 2339 common single-nucleotide polymorphisms (SNPs) with overall survival among cutaneous melanoma (CM) patients. The results have confirmed that BRCA2 pathway SNPs are likely to be associated with overall survival, as reported by previous literature. Moreover, we have identified several new Fanconi anemia (FA) pathway SNPs that are likely to modulate survival of CM patients.
Availability and implementation: The related source code and documents are freely available at
PMCID: PMC4757968  PMID: 26382192
8.  Obesity Early in Adulthood Increases Risk but Does Not Affect Outcomes of Hepatocellular Carcinoma 
Gastroenterology  2015;149(1):119-129.
Despite the significant association between obesity and several cancers, it has been difficult to establish an association between obesity and hepatocellular carcinoma (HCC). Patients with HCC often have ascites, making it a challenge to accurately determine body mass index (BMI), and many factors contribute to the development of HCC. We performed a case–control study to investigate whether obesity early in adulthood affects risk, age of onset, or outcomes of patients with HCC.
We interviewed 622 patients newly diagnosed with HCC from January 2004 through December 2013, along with 660 healthy controls (frequency-matched by age and sex) to determine weights, heights, and body sizes (self-reported) at various ages before HCC development or enrollment as controls. Multivariable logistic and Cox regression analyses were performed to determine the independent effects of early obesity on risk for HCC and patient outcomes, respectively. BMI was calculated, and patients with a BMI ≥30 kg/m2 were considered obese.
Obesity in early adulthood (age, mid-20s to mid-40s) is a significant risk factor for HCC. The estimated odds ratios (OR) and 95% confidence intervals (CI) were 2.6 (1.4–4.4), 2.3 (1.2–4.4), and 3.6 (1.5–8.9) for the entire population, men, and women, respectively. Each unit increase in BMI at early adulthood was associated with a 3.89-month decrease in age at HCC diagnosis (P<.001). Moreover, there is a synergistic interaction between obesity and hepatitis virus infection. However, we found no effect of obesity on the overall survival of patients with HCC.
Early adulthood obesity is associated with increased risk of developing HCC at a young age in the absence of major HCC risk factors, with no effect on outcomes of patients with HCC.
PMCID: PMC4778392  PMID: 25836985
obesity; HCC; case-control; risk factor
9.  Association between Adult Height and Risk of Colorectal, Lung, and Prostate Cancer: Results from Meta-analyses of Prospective Studies and Mendelian Randomization Analyses 
PLoS Medicine  2016;13(9):e1002118.
Observational studies examining associations between adult height and risk of colorectal, prostate, and lung cancers have generated mixed results. We conducted meta-analyses using data from prospective cohort studies and further carried out Mendelian randomization analyses, using height-associated genetic variants identified in a genome-wide association study (GWAS), to evaluate the association of adult height with these cancers.
Methods and Findings
A systematic review of prospective studies was conducted using the PubMed, Embase, and Web of Science databases. Using meta-analyses, results obtained from 62 studies were summarized for the association of a 10-cm increase in height with cancer risk. Mendelian randomization analyses were conducted using summary statistics obtained for 423 genetic variants identified from a recent GWAS of adult height and from a cancer genetics consortium study of multiple cancers that included 47,800 cases and 81,353 controls. For a 10-cm increase in height, the summary relative risks derived from the meta-analyses of prospective studies were 1.12 (95% CI 1.10, 1.15), 1.07 (95% CI 1.05, 1.10), and 1.06 (95% CI 1.02, 1.11) for colorectal, prostate, and lung cancers, respectively. Mendelian randomization analyses showed increased risks of colorectal (odds ratio [OR] = 1.58, 95% CI 1.14, 2.18) and lung cancer (OR = 1.10, 95% CI 1.00, 1.22) associated with each 10-cm increase in genetically predicted height. No association was observed for prostate cancer (OR = 1.03, 95% CI 0.92, 1.15). Our meta-analysis was limited to published studies. The sample size for the Mendelian randomization analysis of colorectal cancer was relatively small, thus affecting the precision of the point estimate.
Our study provides evidence for a potential causal association of adult height with the risk of colorectal and lung cancers and suggests that certain genetic factors and biological pathways affecting adult height may also affect the risk of these cancers.
In a Mendelian randomisation study Pierce and colleagues show a genetic association between adult height and increased risk of colorectal and lung cancer.
Author Summary
Why Was This Study Done?
Several previous observational studies have examined the association between adult height and risk of cancers of the lung, colon/rectum, and prostate; however, it remains unclear whether adult height is indeed related to the risk of these cancers.
What Did the Researchers Do and Find?
We conducted a systematic review and meta-analysis of prospective cohort studies that examined the association between adult height and the risk of colorectal, lung, and prostate cancers.
To overcome inherent limitations of observational study designs, we conducted Mendelian randomization analyses using genetic data generated from a large multi-center consortium study including 47,800 cases and 81,353 controls.
In the meta-analysis of the prospective observational studies, we found a 12% increased risk of colorectal cancer, a 7% increased risk of prostate cancer, and a 6% increased risk of lung cancer for every ten-centimeter increase in height, and this increased risk was corroborated in the Mendelian randomization analyses for colorectal (58%) and lung cancer (10%).
What Do These Findings Mean?
Our study provides strong evidence for an association between adult height and risk of colorectal and lung cancer, and suggests that certain genetic and biological factors that affect height may also affect the risk of these cancers.
However, our meta-analysis was limited to published studies, and the sample size for the Mendelian randomization analysis for colorectal cancer was relatively small, affecting the precision of the risk estimate.
PMCID: PMC5012582  PMID: 27598322
10.  Deciphering associations for lung cancer risk through imputation and analysis of 12 316 cases and 16 831 controls 
European Journal of Human Genetics  2015;23(12):1723-1728.
Recent genome-wide association studies have identified common variants at multiple loci influencing lung cancer risk. To decipher the genetic basis of the association signals at 3q28, 5p15.33, 6p21.33, 9p21 and 12p13.33, we performed a meta-analysis of data from five genome-wide association studies in populations of European ancestry totalling 12 316 lung cancer cases and 16 831 controls using imputation to recover untyped genotypes. For four of the regions, it was possible to refine the association signal identifying a smaller region of interest likely to harbour the functional variant. Our analysis did not provide evidence that any of the associations at the loci being a consequence of synthetic associations rather than linkage disequilibrium with a common risk variant at these risk loci.
PMCID: PMC4795209  PMID: 25804397
11.  A global test for gene‐gene interactions based on random matrix theory 
Genetic Epidemiology  2016;40(8):689-701.
Statistical interactions between markers of genetic variation, or gene‐gene interactions, are believed to play an important role in the etiology of many multifactorial diseases and other complex phenotypes. Unfortunately, detecting gene‐gene interactions is extremely challenging due to the large number of potential interactions and ambiguity regarding marker coding and interaction scale. For many data sets, there is insufficient statistical power to evaluate all candidate gene‐gene interactions. In these cases, a global test for gene‐gene interactions may be the best option. Global tests have much greater power relative to multiple individual interaction tests and can be used on subsets of the markers as an initial filter prior to testing for specific interactions. In this paper, we describe a novel global test for gene‐gene interactions, the global epistasis test (GET), that is based on results from random matrix theory. As we show via simulation studies based on previously proposed models for common diseases including rheumatoid arthritis, type 2 diabetes, and breast cancer, our proposed GET method has superior performance characteristics relative to existing global gene‐gene interaction tests. A glaucoma GWAS data set is used to demonstrate the practical utility of the GET method.
PMCID: PMC5132142  PMID: 27386793
gene‐gene interaction; random matrix theory; global test
12.  TERT Polymorphism rs2736100-C Is Associated with EGFR Mutation-Positive Non-Small Cell Lung Cancer 
Epidermal growth factor receptor (EGFR) mutation-positive (EGFRmut+) non-small cell lung cancer (NSCLC) may be a unique orphan disease. Previous studies suggested that the telomerase reverse transcriptase (TERT) gene polymorphism is associated with demographic and clinical features strongly associated with EGFR mutations, e.g. adenocarcinoma histology, never-smoking history and female gender. We aim to test the association between TERT polymorphism and EGFRmut+ NSCLC.
Experimental Design
We conducted a genetic association study in Chinese NSCLC patients (n=714) and healthy controls (n=2,520), between the rs2736100 polymorphism and EGFRmut+ NSCLC. We further tested the association between the EGFR mutation status and mean leukocyte telomere length (LTL). The potential function of rs2736100 in lung epithelial cells was also explored.
The rs2736100-C allele was significantly associated with EGFRmut+ NSCLC (OR=1.52, 95%CI=1.28–1.80, p=1.6×10−6) but not EGFRmut− NSCLC (OR=1.07, 95%CI=0.92–1.24, p=0.4). While NSCLC patients as a whole have significantly longer LTL compared to healthy controls (p≤10−13), the EGFRmut+ patients have even longer LTL compared to EGFRmut-patients (p=0.008). Meanwhile, rs2736100 was significantly associated with TERT mRNA expression in both normal and tumor lung tissues. All results remained significant after controlling for age, gender, smoking status and histology (p<0.05 for all tests). Moreover, the rs2736100 DNA sequence has an allele-specific affinity to nuclear proteins extracted from lung epithelial cells, which led to an altered enhancer activity of the sequence in vitro.
Our study suggests that telomerase and telomere function may be essential for carcinogenesis of EGFRmut+ NSCLC. Further investigation for the underlying mechanism is warranted.
PMCID: PMC4644673  PMID: 26149460
EGFR mutation; NSCLC; TERT; rs2736100; genetic association
13.  Identification of lung cancer histology-specific variants applying Bayesian framework variant prioritization approaches within the TRICL and ILCCO consortia 
Brenner, Darren R. | Amos, Christopher I. | Brhane, Yonathan | Timofeeva, Maria N. | Caporaso, Neil | Wang, Yufei | Christiani, David C. | Bickeböller, Heike | Yang, Ping | Albanes, Demetrius | Stevens, Victoria L. | Gapstur, Susan | McKay, James | Boffetta, Paolo | Zaridze, David | Szeszenia-Dabrowska, Neonilia | Lissowska, Jolanta | Rudnai, Peter | Fabianova, Eleonora | Mates, Dana | Bencko, Vladimir | Foretova, Lenka | Janout, Vladimir | Krokan, Hans E. | Skorpen, Frank | Gabrielsen, Maiken E. | Vatten, Lars | Njølstad, Inger | Chen, Chu | Goodman, Gary | Lathrop, Mark | Vooder, Tõnu | Välk, Kristjan | Nelis, Mari | Metspalu, Andres | Broderick, Peter | Eisen, Timothy | Wu, Xifeng | Zhang, Di | Chen, Wei | Spitz, Margaret R. | Wei, Yongyue | Su, Li | Xie, Dong | She, Jun | Matsuo, Keitaro | Matsuda, Fumihiko | Ito, Hidemi | Risch, Angela | Heinrich, Joachim | Rosenberger, Albert | Muley, Thomas | Dienemann, Hendrik | Field, John K. | Raji, Olaide | Chen, Ying | Gosney, John | Liloglou, Triantafillos | Davies, Michael P.A. | Marcus, Michael | McLaughlin, John | Orlow, Irene | Han, Younghun | Li, Yafang | Zong, Xuchen | Johansson, Mattias | Liu, Geoffrey | Tworoger, Shelley S. | Le Marchand, Loic | Henderson, Brian E. | Wilkens, Lynne R. | Dai, Juncheng | Shen, Hongbing | Houlston, Richard S. | Landi, Maria T. | Brennan, Paul | Hung, Rayjean J.
Carcinogenesis  2015;36(11):1314-1326.
Using information including variant physical and functional properties, we applied multiple variant prioritization techniques in 13 lung cancer genomic studies. We identified and validated novel regions highlighting the utility of using prioritization analyses to search for robust signals.
Large-scale genome-wide association studies (GWAS) have likely uncovered all common variants at the GWAS significance level. Additional variants within the suggestive range (0.0001> P > 5×10−8) are, however, still of interest for identifying causal associations. This analysis aimed to apply novel variant prioritization approaches to identify additional lung cancer variants that may not reach the GWAS level. Effects were combined across studies with a total of 33456 controls and 6756 adenocarcinoma (AC; 13 studies), 5061 squamous cell carcinoma (SCC; 12 studies) and 2216 small cell lung cancer cases (9 studies). Based on prior information such as variant physical properties and functional significance, we applied stratified false discovery rates, hierarchical modeling and Bayesian false discovery probabilities for variant prioritization. We conducted a fine mapping analysis as validation of our methods by examining top-ranking novel variants in six independent populations with a total of 3128 cases and 2966 controls. Three novel loci in the suggestive range were identified based on our Bayesian framework analyses: KCNIP4 at 4p15.2 (rs6448050, P = 4.6×10−7) and MTMR2 at 11q21 (rs10501831, P = 3.1×10−6) with SCC, as well as GAREM at 18q12.1 (rs11662168, P = 3.4×10−7) with AC. Use of our prioritization methods validated two of the top three loci associated with SCC (P = 1.05×10−4 for KCNIP4, represented by rs9799795) and AC (P = 2.16×10−4 for GAREM, represented by rs3786309) in the independent fine mapping populations. This study highlights the utility of using prior functional data for sequence variants in prioritization analyses to search for robust signals in the suggestive range.
PMCID: PMC4635669  PMID: 26363033
15.  Cross Cancer Genomic Investigation of Inflammation Pathway for Five Common Cancers: Lung, Ovary, Prostate, Breast, and Colorectal Cancer 
Inflammation has been hypothesized to increase the risk of cancer development as an initiator or promoter, yet no large-scale study of inherited variation across cancer sites has been conducted.
We conducted a cross-cancer genomic analysis for the inflammation pathway based on 48 genome-wide association studies within the National Cancer Institute GAME-ON Network across five common cancer sites, with a total of 64 591 cancer patients and 74 467 control patients. Subset-based meta-analysis was used to account for possible disease heterogeneity, and hierarchical modeling was employed to estimate the effect of the subcomponents within the inflammation pathway. The network was visualized by enrichment map. All statistical tests were two-sided.
We identified three pleiotropic loci within the inflammation pathway, including one novel locus in Ch12q24 encoding SH2B3 (rs3184504), which reached GWAS significance with a P value of 1.78 x 10–8, and it showed an association with lung cancer (P = 2.01 x 10–6), colorectal cancer (GECCO P = 6.72x10-6; CORECT P = 3.32x10-5), and breast cancer (P = .009). We also identified five key subpathway components with genetic variants that are relevant for the risk of these five cancer sites: inflammatory response for colorectal cancer (P = .006), inflammation related cell cycle gene for lung cancer (P = 1.35x10-6), and activation of immune response for ovarian cancer (P = .009). In addition, sequence variations in immune system development played a role in breast cancer etiology (P = .001) and innate immune response was involved in the risk of both colorectal (P = .022) and ovarian cancer (P = .003).
Genetic variations in inflammation and its related subpathway components are keys to the development of lung, colorectal, ovary, and breast cancer, including SH2B3, which is associated with lung, colorectal, and breast cancer.
PMCID: PMC4675100  PMID: 26319099
16.  FastPop: a rapid principal component derived method to infer intercontinental ancestry using genetic data 
BMC Bioinformatics  2016;17:122.
Identifying subpopulations within a study and inferring intercontinental ancestry of the samples are important steps in genome wide association studies. Two software packages are widely used in analysis of substructure: Structure and Eigenstrat. Structure assigns each individual to a population by using a Bayesian method with multiple tuning parameters. It requires considerable computational time when dealing with thousands of samples and lacks the ability to create scores that could be used as covariates. Eigenstrat uses a principal component analysis method to model all sources of sampling variation. However, it does not readily provide information directly relevant to ancestral origin; the eigenvectors generated by Eigenstrat are sample specific and thus cannot be generalized to other individuals.
We developed FastPop, an efficient R package that fills the gap between Structure and Eigenstrat. It can: 1, generate PCA scores that identify ancestral origins and can be used for multiple studies; 2, infer ancestry information for data arising from two or more intercontinental origins. We demonstrate the use of FastPop using 2318 SNP markers selected from the genome based on high variability among European, Asian and West African (African) populations. We conducted an analysis of 505 Hapmap samples with European, African or Asian ancestry along with 19661 additional samples of unknown ancestry. The results from FastPop are highly consistent with those obtained by Structure across the 19661 samples we studied. The correlations of the results between FastPop and Structure are 0.99, 0.97 and 0.99 for European, African and Asian ancestry scores, respectively. Compared with Structure, FastPop is more efficient as it finished ancestry inference for 19661 samples in 16 min compared with 21–24 h required by Structure. FastPop also provided scores based on SNP weights so the scores of reference population can be applied to other studies provided the same set of markers are used. We also present application of the method for studying four continental populations (European, Asian, African, and Native American).
We developed an algorithm that can infer ancestries on data involving two or more intercontinental origins. It is efficient for analyzing large datasets. Additionally the PCA derived scores can be applied to multiple data sets to ensure the same ancestry analysis is applied to all studies.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-016-0965-1) contains supplementary material, which is available to authorized users.
PMCID: PMC4784403  PMID: 26961892
Population structure; Principal component; Ancestry; Genome-wide association study
17.  Integrated pathway and epistasis analysis reveals interactive effect of genetic variants at TERF1 and AFAP1L2 loci on melanoma risk 
Genome-wide association studies (GWASs) have characterized 13 loci associated with melanoma, which only account for a small part of melanoma risk. To identify new genes with too small an effect to be detected individually but which collectively influence melanoma risk and/or show interactive effects, we used a two-step analysis strategy including pathway analysis of genome-wide SNP data, in a first step, and epistasis analysis within significant pathways, in a second step. Pathway analysis, using the gene-set enrichment analysis (GSEA) approach and the gene ontology (GO) database, was applied to the outcomes of MELARISK (3,976 subjects) and MDACC (2,827 subjects) GWASs. Cross-gene SNP-SNP interaction analysis within melanoma-associated GOs was performed using the INTERSNP software. Five GO categories were significantly enriched in genes associated with melanoma (FDR≤5% in both studies): response to light stimulus, regulation of mitotic cell cycle, induction of programmed cell death, cytokine activity and oxidative phosphorylation. Epistasis analysis, within each of the five significant GOs, showed significant evidence for interaction for one SNP pair at TERF1 and AFAP1L2 loci (pmeta-int =2.0×10−7, which met both the pathway and overall multiple-testing corrected thresholds that are equal to 9.8×10−7 and 2.0×10−7, respectively) and suggestive evidence for another pair involving correlated SNPs at the same loci (pmeta-int =3.6×10−6 ). This interaction has important biological relevance given the key role of TERF1 in telomere biology and the reported physical interaction between TERF1 and AFAP1L2 proteins. This finding brings a novel piece of evidence for the emerging role of telomere dysfunction into melanoma development.
PMCID: PMC4566921  PMID: 25892537
Genome-wide association studies; melanoma; pathway analysis; gene-gene interaction
18.  A Novel Genetic Variant in Long Non-coding RNA Gene NEXN-AS1 is Associated with Risk of Lung Cancer 
Scientific Reports  2016;6:34234.
Lung cancer etiology is multifactorial, and growing evidence has indicated that long non-coding RNAs (lncRNAs) are important players in lung carcinogenesis. We performed a large-scale meta-analysis of 690,564 SNPs in 15,531 autosomal lncRNAs by using datasets from six previously published genome-wide association studies (GWASs) from the Transdisciplinary Research in Cancer of the Lung (TRICL) consortium in populations of European ancestry. Previously unreported significant SNPs (P value < 1 × 10−7) were further validated in two additional independent lung cancer GWAS datasets from Harvard University and deCODE. In the final meta-analysis of all eight GWAS datasets with 17,153 cases and 239,337 controls, a novel risk SNP rs114020893 in the lncRNA NEXN-AS1 region at 1p31.1 remained statistically significant (odds ratio = 1.17; 95% confidence interval = 1.11–1.24; P = 8.31 × 10−9). In further in silico analysis, rs114020893 was predicted to change the secondary structure of the lncRNA. Our finding indicates that SNP rs114020893 of NEXN-AS1 at 1p31.1 may contribute to lung cancer susceptibility.
PMCID: PMC5054367  PMID: 27713484
19.  Genetic Risk Can Be Decreased: Quitting Smoking Decreases and Delays Lung Cancer for Smokers With High and Low CHRNA5 Risk Genotypes — A Meta-Analysis 
EBioMedicine  2016;11:219-226.
Recent meta-analyses show that individuals with high risk variants in CHRNA5 on chromosome 15q25 are likely to develop lung cancer earlier than those with low-risk genotypes. The same high-risk genetic variants also predict nicotine dependence and delayed smoking cessation. It is unclear whether smoking cessation confers the same benefits in terms of lung cancer risk reduction for those who possess CHRNA5 risk variants versus those who do not.
Meta-analyses examined the association between smoking cessation and lung cancer risk in 15 studies of individuals with European ancestry who possessed varying rs16969968 genotypes (N = 12,690 ever smokers, including 6988 cases of lung cancer and 5702 controls) in the International Lung Cancer Consortium.
Smoking cessation (former vs. current smokers) was associated with a lower likelihood of lung cancer (OR = 0.48, 95%CI = 0.30–0.75, p = 0.0015). Among lung cancer patients, smoking cessation was associated with a 7-year delay in median age of lung cancer diagnosis (HR = 0.68, 95%CI = 0.61–0.77, p = 4.9 ∗ 10–10). The CHRNA5 rs16969968 risk genotype (AA) was associated with increased risk and earlier diagnosis for lung cancer, but the beneficial effects of smoking cessation were very similar in those with and without the risk genotype.
We demonstrate that quitting smoking is highly beneficial in reducing lung cancer risks for smokers regardless of their CHRNA5 rs16969968 genetic risk status. Smokers with high-risk CHRNA5 genotypes, on average, can largely eliminate their elevated genetic risk for lung cancer by quitting smoking- cutting their risk of lung cancer in half and delaying its onset by 7 years for those who develop it. These results: 1) underscore the potential value of smoking cessation for all smokers, 2) suggest that CHRNA5 rs16969968 genotype affects lung cancer diagnosis through its effects on smoking, and 3) have potential value for framing preventive interventions for those who smoke.
•CHRNA5 rs16969968 confers risk for earlier lung cancer diagnosis, but quitting produces benefit regardless of genotype.•Smokers can cut their risk of lung cancer in half and delay its onset by 7 years among those diagnosed.•Precision prevention allows clinicians to provide personalized health benefits of smoking cessation.
This is a report on whether smoking cessation confers the same benefits in terms of lung cancer risk reduction for those who possess CHRNA5 risk variants versus those who do not. We determined that quitting smoking is highly beneficial in reducing lung cancer risk levels for smokers regardless of their CHRNA5 rs16969968 genetic risk status. Although CHRNA5 rs16969968 increases risk for earlier lung cancer by 4 years, quitting produces essentially the same benefit for smokers with either high or low genetic risks. Smokers can cut their risk of lung cancer in half and delay its onset by 7 years among those diagnosed. These results are important for smokers to prevent cancer. On average, smokers at all genetic risk levels can largely eliminate their elevated risk for lung cancer by quitting smoking.
PMCID: PMC5049934  PMID: 27543155
Smoking cessation; Genetics; Meta-analysis; Lung cancer
20.  Molecular profiling of intrahepatic and extrahepatic cholangiocarcinoma using next generation sequencing 
Cholangiocarcinoma is a heterogeneous malignant process, which is further classified into intrahepatic cholangiocarcinoma (ICC) and extrahepatic cholangiocarcinoma (ECC). The poor prognosis of the disease is partly due to the lack of understanding of the disease mechanism. Multiple gene alterations identified by various molecular techniques have been described recently. As a result, multiple targeted therapies for ICC and ECC are being developed. In this study, we identified and compared somatic mutations in ICC and ECC patients using next generation sequencing (NGS) (Ampliseq Cancer Hotspot Panel v2 and Ion Torrent 318v2 chips). Eleven of 16 samples passed internal quality control established for NGS testing. ICC cases (n = 3) showed IDH1 (33.3%) and NRAS (33.3%) mutations. Meanwhile, TP53 (75%), KRAS (50%), and BRAF (12.5%) mutations were identified in ECC cases (n = 8). Our study confirmed the molecular heterogeneity of ICC and ECC using NGS. This information will be important for individual patients as targeted therapies for ICC and ECC become available in the future.
PMCID: PMC4591249  PMID: 26189129
Intrahepatic cholangiocarcinoma; Extrahepatic cholangiocarcinoma; Next generation sequencing; Somatic mutations; Targeted therapy
22.  Mutations of HNRNPA0 and WIF1 predispose members of a large family to multiple cancers 
Familial cancer  2015;14(2):297-306.
We studied a large family that presented a strong familial susceptibility to multiple early onset cancers including prostate, breast, colon, and several other uncommon cancers. Through targeted gene, linkage, and whole genome sequencing analyses, we show that the presence of a variant in the regulatory region of HNRNPA0 associated with elevated cancer incidence in this family (Hazard ratio = 7.20, p = 0.0004). Whole genome sequencing identified a second rare protein changing mutation of WIF1 that interacted with the HNRNPA0 variant resulting in extremely high risk for cancer in carriers of mutations in both genes (p = 1.98 × 10–13). Analysis of downstream targets of the mutations in these two genes showed that the HNRNPA0 mutation affected expression patterns in the PI3 kinase and ERK/MAPK signaling pathways, while the WIF1 variant influenced expression of genes that play a role in NAD biosynthesis. This is a first report of variation in HNRNPA0 influencing common cancers or of a striking interaction between rare variants coexisting in an extended pedigree and jointly affecting cancer risk.
PMCID: PMC4589301  PMID: 25716654
Whole genome sequencing; Expression analysis; Linkage analysis; Complex disease; Colon cancer; Prostate cancer
23.  Clinical Genotyping of Non–Small Cell Lung Cancers Using Targeted Next-Generation Sequencing: Utility of Identifying Rare and Co-mutations in Oncogenic Driver Genes1 
Neoplasia (New York, N.Y.)  2016;18(9):577-583.
Detection of somatic mutations in non–small cell lung cancers (NSCLCs), especially adenocarcinomas, is important for directing patient care when targeted therapy is available. Here, we present our experience with genotyping NSCLC using the Ion Torrent Personal Genome Machine (PGM) and the AmpliSeq Cancer Hotspot Panel v2. We tested 453 NSCLC samples from 407 individual patients using the 50 gene AmpliSeq Cancer Hotspot Panel v2 from May 2013 to July 2015. Using 10 ng of DNA, up to 11 samples were simultaneously sequenced on the Ion Torrent PGM (316 and 318 chips). We identified variants with the Ion Torrent Variant Caller Plugin, and Golden Helix's SVS software was used for annotation and prediction of the significance of the variants. Three hundred ninety-eight samples were successfully sequenced (12.1% failure rate). In all, 633 variants in 41 genes were detected with a median of 2 (range of 0 to 7) variants per sample. Mutations detected in BRAF, EGFR, ERBB2, KRAS, NRAS, and PIK3CA were considered potentially actionable and were identified in 237 samples, most commonly in KRAS (37.9%), EGFR (11.1%), BRAF (4.8%), and PIK3CA (4.3%). In our patient population, all mutations in EGFR, KRAS, and BRAF were mutually exclusive. The Ion Torrent Ampliseq technology can be utilized on small biopsy and cytology specimens, requires very little input DNA, and can be applied in clinical laboratories for genotyping of NSCLC. This targeted next-generation sequencing approach allows for detection of common and also rare mutations that are clinically actionable in multiple patients simultaneously.
PMCID: PMC5031899  PMID: 27659017
24.  C-Reactive Protein As a Marker of Melanoma Progression 
Journal of Clinical Oncology  2015;33(12):1389-1396.
To investigate the association between blood levels of C-reactive protein (CRP) in patients with melanoma and overall survival (OS), melanoma-specific survival (MSS), and disease-free survival.
Patients and Methods
Two independent sets of plasma samples from a total of 1,144 patients with melanoma (587 initial and 557 confirmatory) were available for CRP determination. Kaplan-Meier method and Cox regression were used to evaluate the relationship between CRP and clinical outcome. Among 115 patients who underwent sequential blood draws, we evaluated the relationship between change in disease status and change in CRP using nonparametric tests.
Elevated CRP level was associated with poorer OS and MSS in the initial, confirmatory, and combined data sets (combined data set: OS hazard ratio, 1.44 per unit increase of logarithmic CRP; 95% CI, 1.30 to 1.59; P < .001; MSS hazard ratio, 1.51 per unit increase of logarithmic CRP; 95% CI, 1.36 to 1.68; P < .001). These findings persisted after multivariable adjustment. As compared with CRP < 10 mg/L, CRP ≥ 10 mg/L conferred poorer OS in patients with any-stage, stage I/II, or stage III/IV disease and poorer disease-free survival in those with stage I/II disease. In patients who underwent sequential evaluation of CRP, an association was identified between an increase in CRP and melanoma disease progression.
CRP is an independent prognostic marker in patients with melanoma. CRP measurement should be considered for incorporation into prospective studies of outcome in patients with melanoma and clinical trials of systemic therapies for those with melanoma.
PMCID: PMC4397281  PMID: 25779565
25.  The causal relevance of body mass index in different histological types of lung cancer: A Mendelian randomization study 
Scientific Reports  2016;6:31121.
Body mass index (BMI) is inversely associated with lung cancer risk in observational studies, even though it increases the risk of several other cancers, which could indicate confounding by tobacco smoking or reverse causality. We used the two-sample Mendelian randomization (MR) approach to circumvent these limitations of observational epidemiology by constructing a genetic instrument for BMI, based on results from the GIANT consortium, which was evaluated in relation to lung cancer risk using GWAS results on 16,572 lung cancer cases and 21,480 controls. Results were stratified by histological subtype, smoking status and sex. An increase of one standard deviation (SD) in BMI (4.65 Kg/m2) raised the risk for lung cancer overall (OR = 1.13; P = 0.10). This was driven by associations with squamous cell (SQ) carcinoma (OR = 1.45; P = 1.2 × 10−3) and small cell (SC) carcinoma (OR = 1.81; P = 0.01). An inverse trend was seen for adenocarcinoma (AD) (OR = 0.82; P = 0.06). In stratified analyses, a 1 SD increase in BMI was inversely associated with overall lung cancer in never smokers (OR = 0.50; P = 0.02). These results indicate that higher BMI may increase the risk of certain types of lung cancer, in particular SQ and SC carcinoma.
PMCID: PMC4973233  PMID: 27487993

