Search tips
Search criteria

Results 1-25 (29)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  LPEseq: Local-Pooled-Error Test for RNA Sequencing Experiments with a Small Number of Replicates 
PLoS ONE  2016;11(8):e0159182.
RNA-Sequencing (RNA-Seq) provides valuable information for characterizing the molecular nature of the cells, in particular, identification of differentially expressed transcripts on a genome-wide scale. Unfortunately, cost and limited specimen availability often lead to studies with small sample sizes, and hypothesis testing on differential expression between classes with a small number of samples is generally limited. The problem is especially challenging when only one sample per each class exists. In this case, only a few methods among many that have been developed are applicable for identifying differentially expressed transcripts. Thus, the aim of this study was to develop a method able to accurately test differential expression with a limited number of samples, in particular non-replicated samples. We propose a local-pooled-error method for RNA-Seq data (LPEseq) to account for non-replicated samples in the analysis of differential expression. Our LPEseq method extends the existing LPE method, which was proposed for microarray data, to allow examination of non-replicated RNA-Seq experiments. We demonstrated the validity of the LPEseq method using both real and simulated datasets. By comparing the results obtained using the LPEseq method with those obtained from other methods, we found that the LPEseq method outperformed the others for non-replicated datasets, and showed a similar performance with replicated samples; LPEseq consistently showed high true discovery rate while not increasing the rate of false positives regardless of the number of samples. Our proposed LPEseq method can be effectively used to conduct differential expression analysis as a preliminary design step or for investigation of a rare specimen, for which a limited number of samples is available.
PMCID: PMC4988759  PMID: 27532300
3.  Validation of Prediction Models for Mismatch Repair Gene Mutations in Koreans 
Lynch syndrome, the commonest hereditary colorectal cancer syndrome, is caused by germline mutations in mismatch repair (MMR) genes. Three recently developed prediction models for MMR gene mutations based on family history and clinical features (MMRPredict, PREMM1,2,6, and MMRPro) have been validated only in Western countries. In this study, we propose validating these prediction models in the Korean population.
Materials and Methods
We collected MMR gene analysis data from 188 individuals in the Korean Hereditary Tumor Registry. The probability of gene mutation was calculated using three prediction models, and the overall diagnostic value of each model compared using receiver operator characteristic (ROC) curves and area under the ROC curve (AUC). Quantitative test characteristics were calculated at sensitivities of 90%, 95%, and 98%.
Of the individuals analyzed, 101 satisfied Amsterdam criteria II, and 87 were suspected hereditary nonpolyposis colorectal cancer. MMR mutations were identified in 62 of the 188 subjects (33.0%). All three prediction models showed a poor predictive value of AUC (MMRPredict, 0.683; PREMM1,2,6, 0.709; MMRPro, 0.590). Within the range of acceptable sensitivity (> 90%), PREMM1,2,6 demonstrated higher specificity than the other models.
In the Korean population, overall predictive values of the three models (MMRPredict, PREMM1,2,6, MMRPro) for MMR gene mutations are poor, compared with their performance in Western populations. A new prediction model is therefore required for the Korean population to detect MMR mutation carriers, reflecting ethnic differences in genotype-phenotype associations.
PMCID: PMC4843726  PMID: 26044159
Hereditary nonpolyposis colorectal neoplasms; Prediction model; Genetic testing; Mismatch repair gene
4.  Alcohol intake and cardiovascular risk factors: A Mendelian randomisation study 
Scientific Reports  2015;5:18422.
Mendelian randomisation studies from Asia suggest detrimental influences of alcohol on cardiovascular risk factors, but such associations are observed mainly in men. The absence of associations of genetic variants (e.g. rs671 in ALDH2) with such risk factors in women – who drank little in these populations – provides evidence that the observations are not due to genetic pleiotropy. Here, we present a Mendelian randomisation study in a South Korean population (3,365 men and 3,787 women) that 1) provides robust evidence that alcohol consumption adversely affects several cardiovascular disease risk factors, including blood pressure, waist to hip ratio, fasting blood glucose and triglyceride levels. Alcohol also increases HDL cholesterol and lowers LDL cholesterol. Our study also 2) replicates sex differences in associations which suggests pleiotropy does not underlie the associations, 3) provides further evidence that association is not due to pleiotropy by showing null effects in male non-drinkers, and 4) illustrates a way to measure population-level association where alcohol intake is stratified by sex. In conclusion, population-level instrumental variable estimation (utilizing interaction of rs671 in ALDH2 and sex as an instrument) strengthens causal inference regarding the largely adverse influence of alcohol intake on cardiovascular health in an Asian population.
PMCID: PMC4685310  PMID: 26687910
5.  A Significant Increase in the Incidence of Central Precocious Puberty among Korean Girls from 2004 to 2010 
PLoS ONE  2015;10(11):e0141844.
Few studies have explored the trends in central precocious puberty (CPP) in Asian populations. This study assessed the prevalence and annual incidence of CPP among Korean children.
Using data from the Korean Health Insurance Review Agency from 2004 to 2010, we reviewed the records of 21,351 children, including those registered with a diagnosis of CPP for the first time and those diagnosed with CPP who were treated with gonadotropin-releasing hormone analogs.
The prevalence of CPP was 55.9 per 100,000 girls and 1.7 per 100,000 boys, respectively. The overall incidence of CPP was 15.3 per 100,000 girls, and 0.6 per 100,000 boys. The annual incidence of CPP in girls significantly increased from 3.3 to 50.4 per 100,000 girls; whereas in boys, it gradually increased from 0.3 to 1.2 per 100,000 boys. The annual incidence of CPP in girls consistently increased at all ages year by year, with greater increases at older ages (≥6 years of age), and smaller increases in girls aged < 6 years. In contrast, the annual incidence remained relatively constant in boys aged < 8 years, while a small increase was observed only in boys aged 8 years. The increase of annual incidence showed significant differences depending on age and gender (P <0.0001).
The annual incidence of CPP has substantially increased among Korean girls over the past 7 years. Continued monitoring of CPP trends among Korean children will be informative.
PMCID: PMC4634943  PMID: 26539988
6.  Adjusting heterogeneous ascertainment bias for genetic association analysis with extended families 
BMC Medical Genetics  2015;16:62.
In family-based association analysis, each family is typically ascertained from a single proband, which renders the effects of ascertainment bias heterogeneous among family members. This is contrary to case–control studies, and may introduce sample or ascertainment bias. Statistical efficiency is affected by ascertainment bias, and careful adjustment can lead to substantial improvements in statistical power. However, genetic association analysis has often been conducted using family-based designs, without addressing the fact that each proband in a family has had a great influence on the probability for each family member to be affected.
We propose a powerful and efficient statistic for genetic association analysis that considered the heterogeneity of ascertainment bias among family members, under the assumption that both prevalence and heritability of disease are available. With extensive simulation studies, we showed that the proposed method performed better than the existing methods, particularly for diseases with large heritability.
We applied the proposed method to the genome-wide association analysis of Alzheimer’s disease. Four significant associations with the proposed method were found.
Our significant findings illustrated the practical importance of this new analysis method.
Electronic supplementary material
The online version of this article (doi:10.1186/s12881-015-0198-6) contains supplementary material, which is available to authorized users.
PMCID: PMC4593209  PMID: 26286599
Family-based association analysis; Ascertainment; Liability model
7.  Evaluation of Penalized and Nonpenalized Methods for Disease Prediction with Large-Scale Genetic Data 
BioMed Research International  2015;2015:605891.
Owing to recent improvement of genotyping technology, large-scale genetic data can be utilized to identify disease susceptibility loci and this successful finding has substantially improved our understanding of complex diseases. However, in spite of these successes, most of the genetic effects for many complex diseases were found to be very small, which have been a big hurdle to build disease prediction model. Recently, many statistical methods based on penalized regressions have been proposed to tackle the so-called “large P and small N” problem. Penalized regressions including least absolute selection and shrinkage operator (LASSO) and ridge regression limit the space of parameters, and this constraint enables the estimation of effects for very large number of SNPs. Various extensions have been suggested, and, in this report, we compare their accuracy by applying them to several complex diseases. Our results show that penalized regressions are usually robust and provide better accuracy than the existing methods for at least diseases under consideration.
PMCID: PMC4539442  PMID: 26346893
8.  On the Estimation of Heritability with Family-Based and Population-Based Samples 
BioMed Research International  2015;2015:671349.
For a family-based sample, the phenotypic variance-covariance matrix can be parameterized to include the variance of a polygenic effect that has then been estimated using a variance component analysis. However, with the advent of large-scale genomic data, the genetic relationship matrix (GRM) can be estimated and can be utilized to parameterize the variance of a polygenic effect for population-based samples. Therefore narrow sense heritability, which is both population and trait specific, can be estimated with both population- and family-based samples. In this study we estimate heritability from both family-based and population-based samples, collected in Korea, and the heritability estimates from the pooled samples were, for height, 0.60; body mass index (BMI), 0.32; log-transformed triglycerides (log TG), 0.24; total cholesterol (TCHL), 0.30; high-density lipoprotein (HDL), 0.38; low-density lipoprotein (LDL), 0.29; systolic blood pressure (SBP), 0.23; and diastolic blood pressure (DBP), 0.24. Furthermore, we found differences in how heritability is estimated—in particular the amount of variance attributable to common environment in twins can be substantial—which indicates heritability estimates should be interpreted with caution.
PMCID: PMC4538414  PMID: 26339629
9.  Family-based association analysis: a fast and efficient method of multivariate association analysis with multiple variants 
BMC Bioinformatics  2015;16(1):46.
Many disease phenotypes are outcomes of the complicated interplay between multiple genes, and multiple phenotypes are affected by a single or multiple genotypes. Therefore, joint analysis of multiple phenotypes and multiple markers has been considered as an efficient strategy for genome-wide association analysis, and in this work we propose an omnibus family-based association test for the joint analysis of multiple genotypes and multiple phenotypes.
The proposed test can be applied for both quantitative and dichotomous phenotypes, and it is robust under the presence of population substructure, as long as large-scale genomic data is available. Using simulated data, we showed that our method is statistically more efficient than the existing methods, and the practical relevance is illustrated by application of the approach to obesity-related phenotypes.
The proposed method may be more statistically efficient than the existing methods. The application was developed in C++ and is available at the following URL:
PMCID: PMC4339744  PMID: 25887481
Family-based association analysis; Multiple variants; Multiple phenotypes
10.  On the Analysis of a Repeated Measure Design in Genome-Wide Association Analysis 
Longitudinal data enables detecting the effect of aging/time, and as a repeated measures design is statistically more efficient compared to cross-sectional data if the correlations between repeated measurements are not large. In particular, when genotyping cost is more expensive than phenotyping cost, the collection of longitudinal data can be an efficient strategy for genetic association analysis. However, in spite of these advantages, genome-wide association studies (GWAS) with longitudinal data have rarely been analyzed taking this into account. In this report, we calculate the required sample size to achieve 80% power at the genome-wide significance level for both longitudinal and cross-sectional data, and compare their statistical efficiency. Furthermore, we analyzed the GWAS of eight phenotypes with three observations on each individual in the Korean Association Resource (KARE). A linear mixed model allowing for the correlations between observations for each individual was applied to analyze the longitudinal data, and linear regression was used to analyze the first observation on each individual as cross-sectional data. We found 12 novel genome-wide significant disease susceptibility loci that were then confirmed in the Health Examination cohort, as well as some significant interactions between age/sex and SNPs.
PMCID: PMC4276614  PMID: 25464127
longitudinal data; cross-sectional data; Korean Association Resource (KARE) cohort; Health Examinee (HEXA) cohort
11.  Targeted exon sequencing fails to identify rare coding variants with large effect in rheumatoid arthritis 
Although it has been suggested that rare coding variants could explain the substantial missing heritability, very few sequencing studies have been performed in rheumatoid arthritis (RA). We aimed to identify novel functional variants with rare to low frequency using targeted exon sequencing of RA in Korea.
We analyzed targeted exon sequencing data of 398 genes selected from a multifaceted approach in Korean RA patients (n = 1,217) and controls (n = 717). We conducted a single-marker association test and a gene-based analysis of rare variants. For meta-analysis or enrichment tests, we also used ethnically matched independent samples of Korean genome-wide association studies (GWAS) (n = 4,799) or immunochip data (n = 4,722).
After stringent quality control, we analyzed 10,588 variants of 398 genes from 1,934 Korean RA case controls. We identified 13 nonsynonymous variants with nominal association in single-variant association tests. In a meta-analysis, we did not find any novel variant with genome-wide significance for RA risk. Using a gene-based approach, we identified 17 genes with nominal burden signals. Among them, VSTM1 showed the greatest association with RA (P = 7.80 × 10−4). In the enrichment test using Korean GWAS, although the significant signal appeared to be driven by total genic variants, we found no evidence for enriched association of coding variants only with RA.
We were unable to identify rare coding variants with large effect to explain the missing heritability for RA in the current targeted resequencing study. Our study raises skepticism about exon sequencing of targeted genes for complex diseases like RA.
Electronic supplementary material
The online version of this article (doi:10.1186/s13075-014-0447-7) contains supplementary material, which is available to authorized users.
PMCID: PMC4203956  PMID: 25267259
12.  Relationship of Vitamin D Binding Protein Polymorphisms and Lung Function in Korean Chronic Obstructive Pulmonary Disease 
Yonsei Medical Journal  2014;55(5):1318-1325.
Multiple genetic factors are associated with chronic obstructive pulmonary disease (COPD). The association of gene encoding vitamin D binding protein (VDBP, GC) with COPD has been controversial. We sought to investigate the types of GC variants in the Korean population and determine the association of GC variants with COPD and lung function in the Korean population.
Materials and Methods
The study cohort consisted of 203 COPD patients and 157 control subjects. GC variants were genotyped by the restriction fragment-length polymorphism method. Repeated measures of lung function data were analyzed using a linear mixed model including sex, age, height, and pack-years of smoking to investigate the association of GC genetic factors and lung function.
GC1F variant was most frequently observed in COPD (46.1%) and controls (42.0%). GC1S variant (29.0% vs. 21.4%; p=0.020) and genotype 1S-1S (8.3% vs. 3.4%; p=0.047) were more commonly detected in control than COPD. According to linear mixed model analysis including controls and COPD, subjects with genotype 1S-1S had 0.427 L higher forced expiratory volume in 1 second (FEV1) than those with other genotypes (p=0.029). However, interaction between the genotype and smoking pack-year was found to be particularly significant among subjects with genotype 1S-1S; FEV1 decreased by 0.014 L per smoking pack-year (p=0.001).
This study suggested that GC polymorphism might be associated with lung function and risk of COPD in Korean population. GC1S variant and genotype 1S-1S were more frequently observed in control than in COPD. Moreover, GC1S variant was more common in non-decliners than in rapid decliners among COPD.
PMCID: PMC4108818  PMID: 25048491
Vitamin D binding protein; polymorphism; lung function; chronic obstructive pulmonary disease
13.  Identification of a genetic variant at 2q12.1 associated with blood pressure in East-Asians by genome-wide scan including gene-environment interactions 
BMC Medical Genetics  2014;15:65.
Genome-wide association studies have identified many genetic loci associated with blood pressure (BP). Genetic effects on BP can be altered by environmental exposures via multiple biological pathways. Especially, obesity is one of important environmental risk factors that can have considerable effect on BP and it may interact with genetic factors. Given that, we aimed to test whether genetic factors and obesity may jointly influence BP.
We performed meta-analyses of genome-wide association data for systolic blood pressure (SBP) and diastolic blood pressure (DBP) that included analyses of interaction between single nucleotide polymorphisms (SNPs) and the obesity-related anthropometric measures, body mass index (BMI), height, weight, and waist/hip ratio (WHR) in East-Asians (n = 12,030).
We identified that rs13390641 on 2q12.1 demonstrated significant association with SBP when the interaction between SNPs and BMI was considered (P < 5 × 10 -8). The gene located nearest to rs13390641, TMEM182, encodes transmembrane protein 182. In stratified analyses, the effect of rs13390641 on BP was much stronger in obese individuals (BMI ≥ 30) than non-obese individuals and the effect of BMI on BP was strongest in individuals with the homozygous A allele of rs13390641.
Our analyses that included interactions between SNPs and environmental factors identified a genetic variant associated with BP that was overlooked in standard analyses in which only genetic factors were included. This result also revealed a potential mechanism that integrates genetic factors and obesity related traits in the development of high BP.
PMCID: PMC4059884  PMID: 24903457
Blood pressure; Genome-wide scan; Gene-environment interaction; Meta-analysis; Obesity
14.  On the Meta-Analysis of Genome-Wide Association Studies: A Robust and Efficient Approach to Combine Population and Family-Based Studies 
Human Heredity  2012;73(1):35-46.
For the meta-analysis of genome-wide association studies, we propose a new method to adjust for the population stratification and a linear mixed approach that combines family-based and unrelated samples. The proposed approach achieves similar power levels as a standard meta-analysis which combines the different test statistics or p values across studies. However, by virtue of its design, the proposed approach is robust against population admixture and stratification, and no adjustments for population admixture and stratification, even in unrelated samples, are required. Using simulation studies, we examine the power of the proposed method and compare it to standard approaches in the meta-analysis of genome-wide association studies. The practical features of the approach are illustrated with a meta-analysis of three genome-wide association studies for Alzheimer's disease. We identify three single nucleotide polymorphisms showing significant genome-wide association with affection status. Two single nucleotide polymorphisms are novel and will be verified in other populations in our follow-up study.
PMCID: PMC3322629  PMID: 22261799
Meta-analysis; Genome-wide study; Population stratification
15.  ‘Location, Location, Location’: a spatial approach for rare variant analysis and an application to a study on non-syndromic cleft lip with or without cleft palate 
Bioinformatics  2012;28(23):3027-3033.
Motivation: For the analysis of rare variants in sequence data, numerous approaches have been suggested. Fixed and flexible threshold approaches collapse the rare variant information of a genomic region into a test statistic with reduced dimensionality. Alternatively, the rare variant information can be combined in statistical frameworks that are based on suitable regression models, machine learning, etc. Although the existing approaches provide powerful tests that can incorporate information on allele frequencies and prior biological knowledge, differences in the spatial clustering of rare variants between cases and controls cannot be incorporated. Based on the assumption that deleterious variants and protective variants cluster or occur in different parts of the genomic region of interest, we propose a testing strategy for rare variants that builds on spatial cluster methodology and that guides the identification of the biological relevant segments of the region. Our approach does not require any assumption about the directions of the genetic effects.
Results: In simulation studies, we assess the power of the clustering approach and compare it with existing methodology. Our simulation results suggest that the clustering approach for rare variants is well powered, even in situations that are ideal for standard methods. The efficiency of our spatial clustering approach is not affected by the presence of rare variants that have opposite effect size directions. An application to a sequencing study for non-syndromic cleft lip with or without cleft palate (NSCL/P) demonstrates its practical relevance. The proposed testing strategy is applied to a genomic region on chromosome 15q13.3 that was implicated in NSCL/P etiology in a previous genome-wide association study, and its results are compared with standard approaches.
Availability: Source code and documentation for the implementation in R will be provided online. Currently, the R-implementation only supports genotype data. We currently are working on an extension for VCF files.
PMCID: PMC3516147  PMID: 23044548
16.  Genome-Wide Association Analysis of Body Mass in Chronic Obstructive Pulmonary Disease 
Cachexia, whether assessed by body mass index (BMI) or fat-free mass index (FFMI), affects a significant proportion of patients with chronic obstructive pulmonary disease (COPD), and is an independent risk factor for increased mortality, increased emphysema, and more severe airflow obstruction. The variable development of cachexia among patients with COPD suggests a role for genetic susceptibility. The objective of the present study was to determine genetic susceptibility loci involved in the development of low BMI and FFMI in subjects with COPD. A genome-wide association study (GWAS) of BMI was conducted in three independent cohorts of European descent with Global Initiative for Chronic Obstructive Lung Disease stage II or higher COPD: Evaluation of COPD Longitudinally to Identify Predictive Surrogate End-Points (ECLIPSE; n = 1,734); Norway-Bergen cohort (n = 851); and a subset of subjects from the National Emphysema Treatment Trial (NETT; n = 365). A genome-wide association of FFMI was conducted in two of the cohorts (ECLIPSE and Norway). In the combined analyses, a significant association was found between rs8050136, located in the first intron of the fat mass and obesity–associated (FTO) gene, and BMI (P = 4.97 × 10−7) and FFMI (P = 1.19 × 10−7). We replicated the association in a fourth, independent cohort consisting of 502 subjects with COPD from COPDGene (P = 6 × 10−3). Within the largest contributing cohort of our analysis, lung function, as assessed by forced expiratory volume at 1 second, varied significantly by FTO genotype. Our analysis suggests a potential role for the FTO locus in the determination of anthropomorphic measures associated with COPD.
PMCID: PMC3266061  PMID: 21037115
chronic obstructive pulmonary disease genetics; chronic obstructive pulmonary disease epidemiology; chronic obstructive pulmonary disease metabolism; genome-wide association study
17.  On the genome-wide analysis of copy number variants in family-based designs: Methods for combining family-based and population based information for testing dichotomous or quantitative traits, or completely ascertained samples 
Genetic Epidemiology  2010;34(6):582-590.
We propose a new approach for the analysis of copy number variants (CNVs)for genome-wide association studies in family-based designs. Our new overall association test combines the between-family component and the within-family component of the data so that the new test statistic is fully efficient and, at the same time, achieves the complete robustness against population-admixture and stratification, as classical family-based association tests that are based only on the between-family component. Although all data are incorporated into the test statistic, an adjustment for genetic confounding is not needed, not even for the between-family component. The new test statistic is valid for testing either quantitative or dichotomous phenotypes. If external CNV data are available, the approach can also be used in completely ascertained samples. Similar to the approach by Ionita-Laza et al.(1), the proposed test statistic does not required a CNV-calling algorithm and is based directly on the CNV probe intensity data. We show, via simulation studies, that our methodology increases the power of the FBAT statistic to levels comparable to those of population-based designs. The advantages of the approach in practice are demonstrated by an application to a genome-wide association study for body mass index (BMI).
PMCID: PMC3349936  PMID: 20718041
18.  Maximizing the Power of Genome-Wide Association Studies: A Novel Class of Powerful Family-Based Association Tests 
Statistics in biosciences  2009;1(2):125-143.
For genome-wide association studies in family-based designs, a new, universally applicable approach is proposed. Using a modified Liptak’s method, we combine the p-value of the family-based association test (FBAT) statistic with the p-value for the Van Steen-statistic. The Van Steen-statistic is independent of the FBAT-statistic and utilizes information that is ignored by traditional FBAT-approaches. The new test statistic takes advantages of all available information about the genetic association, while, by virtue of its design, it achieves complete robustness against confounding due to population stratification. The approach is suitable for the analysis of almost any trait type for which FBATs are available, e.g. binary, continuous, time to-onset, multivariate, etc. The efficiency and the validity of the new approach depend on the specification of a nuisance/tuning parameter and the weight parameters in the modified Liptak’s method. For different trait types and ascertainment conditions, we discuss general guidelines for the optimal specification of the tuning parameter and the weight parameters. Our simulation experiments and an application to an Alzheimer study show the validity and the efficiency of the new method, which achieves power levels that are comparable to those of population-based approaches.
PMCID: PMC3349940  PMID: 22582089
FBAT; Liptak’s method; Tuning parameter
19.  Duplex-specific nuclease efficiently removes rRNA for prokaryotic RNA-seq 
Nucleic Acids Research  2011;39(20):e140.
Next-generation sequencing has great potential for application in bacterial transcriptomics. However, unlike eukaryotes, bacteria have no clear mechanism to select mRNAs over rRNAs; therefore, rRNA removal is a critical step in sequencing-based transcriptomics. Duplex-specific nuclease (DSN) is an enzyme that, at high temperatures, degrades duplex DNA in preference to single-stranded DNA. DSN treatment has been successfully used to normalize the relative transcript abundance in mRNA-enriched cDNA libraries from eukaryotic organisms. In this study, we demonstrate the utility of this method to remove rRNA from prokaryotic total RNA. We evaluated the efficacy of DSN to remove rRNA by comparing it with the conventional subtractive hybridization (Hyb) method. Illumina deep sequencing was performed to obtain transcriptomes from Escherichia coli grown under four growth conditions. The results clearly showed that our DSN treatment was more efficient at removing rRNA than the Hyb method was, while preserving the original relative abundance of mRNA species in bacterial cells. Therefore, we propose that, for bacterial mRNA-seq experiments, DSN treatment should be preferred to Hyb-based methods.
PMCID: PMC3203590  PMID: 21880599
20.  Using the Optimal Robust Receiver Operating Characteristic (ROC) Curve for Predictive Genetic Tests 
Biometrics  2009;66(2):586-593.
Current ongoing genome-wide association studies represent a powerful approach to uncover common unknown genetic variants causing common complex diseases. The discovery of these genetic variants offers an important opportunity for early disease prediction, prevention and individualized treatment. We describe here a method of combining multiple genetic variants for early disease prediction, based on the optimality theory of the likelihood ratio. Such theory simply shows that the receiver operating characteristic (ROC) curve based on the likelihood ratio (LR) has maximum performance at each cutoff point and that the area under the ROC curve (AUC) so obtained is highest among that of all approaches. Through simulations and a real data application, we compared it with the commonly used logistic regression and classification tree approaches. The three approaches show similar performance if we know the underlying disease model. However, for most common diseases we have little prior knowledge of the disease model and in this situation the new method has an advantage over logistic regression and classification tree approaches. We applied the new method to the Type 1 diabetes genome-wide association data from the Wellcome Trust Case Control Consortium. Based on five single nucleotide polymorphisms (SNPs), the test reaches medium level classification accuracy. With more genetic findings to be discovered in the future, we believe a predictive genetic test for Type 1 diabetes can be successfully constructed and eventually implemented for clinical use.
PMCID: PMC3039874  PMID: 19508241
Backward clustering; Classification tree; Cross validation; Logistic regression
21.  Single-Marker and Two-Marker Association Tests for Unphased Case-Control Genotype Data, with a Power Comparison 
Genetic epidemiology  2010;34(1):67-77.
In case-control Single Nucleotide Polymorphism (SNP) data, the Allele frequency, Hardy Weinberg Disequilibrium (HWD) and Linkage Disequilibrium (LD) contrast tests are three distinct sources of information about genetic association. While all three tests are typically developed in a retrospective context, we show that prospective logistic regression models may be developed that correspond conceptually to the retrospective tests. This approach provides a flexible framework for conducting a systematic series of association analyses using unphased genotype data and any number of covariates. For a single stage study, two single-marker tests and four two-marker tests are discussed. The true association models are derived and they allow us to understand why a model with only a linear term will generally fit well for a SNP in weak LD with a causal SNP, whatever the disease model, but not for a SNP in high LD with a non-additive disease SNP. We investigate the power of the association tests using real LD parameters from chromosome 11 in the HapMap CEU population data. Among the single-marker tests, the allelic test has on average the most power in the case of an additive disease; but, for dominant, recessive and heterozygote disadvantage diseases, the genotypic test has the most power. Among the six two-marker tests, the Allelic-LD contrast test, which incorporates linear terms for two markers and their interaction term, provides the most reliable power overall for the cases studied. Therefore, our result supports incorporating an interaction term as well as linear terms in multi-marker tests.
PMCID: PMC2796706  PMID: 19557751
Allele frequency contrast test; LD contrast test; HWD contrast test; Genome-wide Association
22.  Variants in FAM13A are associated with chronic obstructive pulmonary disease 
Nature genetics  2010;42(3):200-202.
Substantial evidence suggests that there is genetic susceptibility to chronic obstructive pulmonary disease (COPD). To identify common genetic risk variants, we performed a genome-wide association study in 2940 cases and 1380 smoking controls with normal lung function. We demonstrate a novel susceptibility locus at 4q22.1 in FAM13A (rs7671167, OR=0.76, P=8.6×10−8) and provide evidence of replication in one case-control and two family-based cohorts (for all studies, combined P=1.2×10−11).
PMCID: PMC2828499  PMID: 20173748
23.  Phase uncertainty in case-control association studies 
Genetic epidemiology  2009;33(6):463-478.
The possible evidence for association comprises three types of information: differences between cases and controls in allele frequencies, in parameters for Hardy Weinberg disequilibrium (HWD), and in parameters for linkage disequilibrium (LD). LD between marker and disease alleles results in a difference in at least one of the three types of parameters [Won and Elston, 2008]. However, the parameters for LD require knowledge about phase, which is usually unknown, making the LD contrast test without modification infeasible in practice. Methods for handling phase uncertainty are: (1) the most probable haplotype pair for each individual can be considered as the true phase; (2) a weighted average of haplotypes can be used; (3) we can consider the composite LD, which does not require any information about phase. We compare these methods to handle phase uncertainty in terms of validity and efficiency, and the effect on them of HWD in the population, at the same time confirming results for the three types of information. When the LD between markers is high, the LD contrast test that uses a weighted average of haplotypes or the most probable haplotypes to calculate the LD is recommended, but otherwise the LD contrast test that uses the composite LD is recommended. We conclude that, even though the difference in allele frequencies is usually the most informative test except in the case of a recessive disease, the LD contrast test can be more powerful if the markers are dense enough.
PMCID: PMC2838926  PMID: 19194981
linkage disequilibrium; haplotype phase; self replication
24.  The effect of multiple genetic variants in predicting the risk of type 2 diabetes 
BMC Proceedings  2009;3(Suppl 7):S49.
While recently performed genome-wide association studies have advanced the identification of genetic variants predisposing to type 2 diabetes (T2D), the potential application of these novel findings for disease prediction and prevention has not been well studied. Diabetes prediction and prevention have become urgent issues owing to the rapidly increasing prevalence of diabetes and its associated mortality, morbidity, and health care cost. New prediction approaches using genetic markers could facilitate early identification of high risk sub-groups of the population so that appropriate prevention methods could be effectively applied to delay, or even prevent, disease onset.
This paper assessed 18 recently identified T2D loci for their potential role in diabetes prediction. We built a new predictive genetic test for T2D using the Framingham Heart Study dataset. Using logistic regression and 15 additional loci, the new test was slightly improved over the existing test using just three loci. A formal comparison between the two tests suggests no significant improvement. We further formed a predictive genetic test for identifying early onset T2D and found higher classification accuracy for this test, not only indicating that these 18 loci have great potential for predicting early onset T2D, but also suggesting that they may play important roles in causing early-onset T2D.
To further improve the test's accuracy, we applied a newly developed nonparametric method capable of capturing high order interactions to the data, but it did not outperform a logistic regression that only considers single-locus effects. This could be explained by the absence of gene-gene interactions among the 18 loci.
PMCID: PMC2795948  PMID: 20018041
25.  On the Analysis of Genome-Wide Association Studies in Family-Based Designs: A Universal, Robust Analysis Approach and an Application to Four Genome-Wide Association Studies 
PLoS Genetics  2009;5(11):e1000741.
For genome-wide association studies in family-based designs, we propose a new, universally applicable approach. The new test statistic exploits all available information about the association, while, by virtue of its design, it maintains the same robustness against population admixture as traditional family-based approaches that are based exclusively on the within-family information. The approach is suitable for the analysis of almost any trait type, e.g. binary, continuous, time-to-onset, multivariate, etc., and combinations of those. We use simulation studies to verify all theoretically derived properties of the approach, estimate its power, and compare it with other standard approaches. We illustrate the practical implications of the new analysis method by an application to a lung-function phenotype, forced expiratory volume in one second (FEV1) in 4 genome-wide association studies.
Author Summary
In genome-wide association studies, the multiple testing problem and confounding due to population stratification have been intractable issues. Family-based designs have considered only the transmission of genotypes from founder to nonfounder to prevent sensitivity to the population stratification, which leads to the loss of information. Here we propose a novel analysis approach that combines mutually independent FBAT and screening statistics in a robust way. The proposed method is more powerful than any other, while it preserves the complete robustness of family-based association tests, which only achieves much smaller power level. Furthermore, the proposed method is virtually as powerful as population-based approaches/designs, even in the absence of population stratification. By nature of the proposed method, it is always robust as long as FBAT is valid, and the proposed method achieves the optimal efficiency if our linear model for screening test reasonably explains the observed data in terms of covariance structure and population admixture. We illustrate the practical relevance of the approach by an application in 4 genome-wide association studies.
PMCID: PMC2777973  PMID: 19956679

Results 1-25 (29)