Affymetrix Axiom single nucleotide polymorphism (SNP) arrays provide a cost-effective, high-density, and high-throughput genotyping solution for population-optimized analyses. However, no public software is available for the integrated genomic analysis of hybridization intensities and genotypes for this new-generation population-optimized genotyping platform.
A set of statistical methods was developed for an integrated analysis of allele frequency (AF), allelic imbalance (AI), loss of heterozygosity (LOH), long contiguous stretch of homozygosity (LCSH), and copy number variation or alteration (CNV/CNA) on the basis of SNP probe hybridization intensities and genotypes. This study analyzed 3,236 samples that were genotyped using different SNP platforms. The proposed AF adjustment method considerably increased the accuracy of AF estimation. The proposed quick circular binary segmentation algorithm for segmenting copy number reduced the computation time of the original segmentation method by 30–67 %. The proposed CNV/CNA detection, which integrates AI and LOH/LCSH detection, had a promising true positive rate and well-controlled false positive rate in simulation studies. Moreover, our real-time quantitative polymerase chain reaction experiments successfully validated the CNVs/CNAs that were identified in the Axiom data analyses using the proposed methods; some of the validated CNVs/CNAs were not detected in the Affymetrix Array 6.0 data analysis using the Affymetrix Genotyping Console. All the analysis functions are packaged into the ALICE (AF/LOH/LCSH/AI/CNV/CNA Enterprise) software.
ALICE and the used genomic reference databases, which can be downloaded from http://hcyang.stat.sinica.edu.tw/software/ALICE.html, are useful resources for analyzing genomic data from the Axiom and other SNP arrays.
Electronic supplementary material
The online version of this article (doi:10.1186/s12864-016-2478-8) contains supplementary material, which is available to authorized users.
Microarray; Single-nucleotide polymorphism (SNP); Fluorescence intensity; Allele frequency (AF); Allelic imbalance (AI); Loss of heterozygosity (LOH); Long contiguous stretch of homozygosity (LCSH); Copy number variation or alteration (CNV/CNA); Circular binary segmentation (CBS); AF/LOH/LCSH/AI/CNV/CNA Enterprise (ALICE)
Homozygosity disequilibrium (HD) describes a nonrandom pattern of sizable runs of homozygosity (ROH) that deviated from a random distribution of homozygotes and heterozygotes in the genome. In this study, we developed a double-weight local polynomial model for estimating homozygosity intensity. This new estimation method enables considering the local property and genetic information of homozygosity in the human genome when detecting regions of HD. By using this new method, we estimated whole-genome homozygosity intensities by analyzing real whole-genome sequencing data of 959 related individuals from 20 large pedigrees provided by Genetic Analysis Workshop 19 (GAW19). Through the analysis, we derived the distribution of HD in the human genome and provided evidence for the genetic component of natural variation in HD. Generalized estimating equation analysis for 855 related individuals was performed to identify regions of HD associated with diastolic blood pressure (DBP), systolic blood pressure, and hypertension (HTN), with concomitant adjustment for age and sex. We identified one DBP-associated and 2 HTN-associated regions of HD. We also studied the gene regulation of HD by analyzing the real whole-genome transcription data of 647 individuals. A set of gene expressions regulated by the DBP- and HTN-associated regions of HD was identified. Finally, we conducted simulation studies to evaluate the performance of our homozygosity association test. The results showed that the association test had a high power and that type 1 error was controlled. The methods have been integrated into our developed Loss-of-Heterozygosity Analysis Suite software, which can be downloaded at http://www.stat.sinica.edu.tw/hsinchou/genetics/loh/LOHAS.htm.
Heroin addiction is a complex psychiatric disorder with a chronic course and a high relapse rate, which results from the interaction between genetic and environmental factors. Heroin addiction has a substantial heritability in its etiology; hence, identification of individuals with a high genetic propensity to heroin addiction may help prevent the occurrence and relapse of heroin addiction and its complications. The study aimed to identify a small set of genetic signatures that may reliably predict the individuals with a high genetic propensity to heroin addiction. We first measured the transcript level of 13 genes (RASA1, PRKCB, PDK1, JUN, CEBPG, CD74, CEBPB, AUTS2, ENO2, IMPDH2, HAT1, MBD1, and RGS3) in lymphoblastoid cell lines in a sample of 124 male heroin addicts and 124 male control subjects using real-time quantitative PCR. Seven genes (PRKCB, PDK1, JUN, CEBPG, CEBPB, ENO2, and HAT1) showed significant differential expression between the 2 groups. Further analysis using 3 statistical methods including logistic regression analysis, support vector machine learning analysis, and a computer software BIASLESS revealed that a set of 4 genes (JUN, CEBPB, PRKCB, ENO2, or CEBPG) could predict the diagnosis of heroin addiction with the accuracy rate around 85% in our dataset. Our findings support the idea that it is possible to identify genetic signatures of heroin addiction using a small set of expressed genes. However, the study can only be considered as a proof-of-concept study. As the establishment of lymphoblastoid cell line is a laborious and lengthy process, it would be more practical in clinical settings to identify genetic signatures for heroin addiction directly from peripheral blood cells in the future study.
biomarker; diagnosis; genetic signatures; heroin addiction
Methadone maintenance treatment (MMT) is commonly used for controlling opioid dependence, preventing withdrawal symptoms, and improving the quality of life of heroin-dependent patients. A steady-state plasma concentration of methadone enantiomers, a measure of methadone metabolism, is an index of treatment response and efficacy of MMT. Although the methadone metabolism pathway has been partially revealed, no genome-wide pharmacogenomic study has been performed to identify genetic determinants and characterize genetic mechanisms for the plasma concentrations of methadone R- and S-enantiomers. This study was the first genome-wide pharmacogenomic study to identify genes associated with the plasma concentrations of methadone R- and S-enantiomers and their respective metabolites in a methadone maintenance cohort. After data quality control was ensured, a dataset of 344 heroin-dependent patients in the Han Chinese population of Taiwan who underwent MMT was analyzed. Genome-wide single-locus and haplotype-based association tests were performed to analyze four quantitative traits: the plasma concentrations of methadone R- and S-enantiomers and their respective metabolites. A significant single nucleotide polymorphism (SNP), rs17180299 (raw p = 2.24 × 10−8), was identified, accounting for 9.541% of the variation in the plasma concentration of the methadone R-enantiomer. In addition, 17 haplotypes were identified on SPON1, GSG1L, and CYP450 genes associated with the plasma concentration of methadone S-enantiomer. These haplotypes accounted for approximately one-fourth of the variation of the overall S-methadone plasma concentration. The association between the S-methadone plasma concentration and CYP2B6, SPON1, and GSG1L were replicated in another independent study. A gene expression experiment revealed that CYP2B6, SPON1, and GSG1L can be activated concomitantly through a constitutive androstane receptor (CAR) activation pathway. In conclusion, this study revealed new genes associated with the plasma concentration of methadone, providing insight into the genetic foundation of methadone metabolism. The results can be applied to predict treatment responses and methadone-related deaths for individualized MMTs.
Methadone maintenance treatment (MMT), among the most effective therapies for heroin-dependent patients, reduces craving and withdrawal symptoms, increases treatment compliance, and improves the quality of life of patients. The plasma concentration of methadone is a primary index for quantifying and determining therapy responses to MMT. This study was the first whole-genome pharmacogenomic study on MMT to locate genomic regions associated with the plasma concentration of methadone. The analysis identified a single nucleotide polymorphism (SNP) marker (rs17180299) and 17 haplotypes on the SPON1, GSG1L, and CYP450 genes, including CYP2B6 significantly associated with the plasma concentrations of methadone enantiomers. The identified genetic variations accounted for approximately 10% and 25% of the variations in plasma concentrations of methadone R- and S-enantiomers, respectively. The identified genetic variations have afforded insight into the genetic mechanism of the metabolism of MMT, and have potential to pave the way towards individualized MMTs for heroin-dependent patients.
In the analysis of current genomic data, application of machine learning and data mining techniques has become more attractive given the rising complexity of the projects. As part of the Genetic Analysis Workshop 19, approaches from this domain were explored, mostly motivated from two starting points. First, assuming an underlying structure in the genomic data, data mining might identify this and thus improve downstream association analyses. Second, computational methods for machine learning need to be developed further to efficiently deal with the current wealth of data.
In the course of discussing results and experiences from the machine learning and data mining approaches, six common messages were extracted. These depict the current state of these approaches in the application to complex genomic data. Although some challenges remain for future studies, important forward steps were taken in the integration of different data types and the evaluation of the evidence. Mining the data for underlying genetic or phenotypic structure and using this information in subsequent analyses proved to be extremely helpful and is likely to become of even greater use with more complex data sets.
Homozygosity disequilibrium (HD), a nonrandom sizable run of homozygosity in the genome, may be related to the evolution of populations and may also confer susceptibility to disease. No studies have investigated HD using whole genome sequencing (WGS) analysis. In this study, we used an enhanced version of Loss-Of-Heterozygosity Analysis Suite (LOHAS) software to investigate HD through analysis of real and simulated WGS data sets provided by Genetic Analysis Workshop 18. Using a local polynomial model, we derived whole-genome profiles of homozygosity intensities for 959 individuals and characterized the patterns of HD. Generalized estimating equation analysis for 855 related samples was performed to examine the association between patterns of HD and 3 phenotypes of interest, namely diastolic blood pressure, systolic blood pressure, and hypertension status, with covariate adjustments for age and gender. We found that 4.48% of individuals in this study carried sizable runs of homozygosity (ROHs). Distributions of the length of ROHs were derived and revealed a familial aggregation of HD. Genome-wide homozygosity association analysis identified 5 and 3 ROHs associated with diastolic blood pressure and hypertension, respectively. These regions contain genes associated with calcium channels (CACNA1S), renin catalysis (REN), blood groups (ABO), apolipoprotein (APOA5), and cardiovascular diseases (RASGRP1). Simulation studies showed that our homozygosity association tests controlled type 1 error well and had a promising power. This study provides a useful analysis tool for studying HD and allows us to gain a deeper understanding of HD in the human genome.
Pulse pressure (PP) is a risk factor for cardiovascular disease. It has been reported that ambulatory blood pressure (BP) and nighttime BP parameters are heritable traits. However, the genetic association of pulse pressure and its clinical impact remain undetermined.
Method and Results
We conducted a genome-wide association study of PP using ambulatory BP monitoring in young-onset hypertensive patients and found a significant association between nighttime PP and SNP rs897876 (p = 0.009) at chromosome 2p14, which contains the predicted gene FLJ16124. Young-onset hypertension patients carrying TT genotypes at rs897876 had higher nighttime PP than those with CT and CC genotypes (TT, 41.6±7.3 mm Hg; CT, 39.1±6.0 mm Hg; CC, 38.9±6.3 mm Hg; p<0.05,). The T risk allele resulted in a cumulative increase in nighttime PP (β = 1.036 mm Hg, se. = 0.298, p<0.001 per T allele). An independent community-based cohort containing 3325 Taiwanese individuals (mean age, 50.2 years) was studied to investigate the genetic impact of rs897876 polymorphisms in determining future cardiovascular events. After an average 7.79±0.28 years of follow-up, the TT genotype of rs897876 was independently associated with an increased risk (in a recessive model) of coronary artery disease (HR, 2.20; 95% CI, 1.20–4.03; p = 0.01) and total cardiovascular events (HR, 1.99; 95% CI, 1.29–3.06; p = 0.002), suggesting that the TT genotype of rs897876C, which is associated with nighttime pulse pressure in young-onset hypertension patients, could be a genetic prognostic factor of cardiovascular events in the general cohort.
The TT genotype of rs897876C at 2p14 identified in young-onset hypertensive had higher nighttime PP and could be a genetic prognostic factor of cardiovascular events in the general cohort in Taiwan.
Gene-based analysis has become popular in genomic research because of its appealing biological and statistical properties compared with those of a single-locus analysis. However, only a few, if any, studies have discussed a mapping of expression quantitative trait loci (eQTL) in a gene-based framework. Neither study has discussed ancestry-informative eQTL nor investigated their roles in pharmacogenetics by integrating single nucleotide polymorphism (SNP)-based eQTL (s-eQTL) and gene-based eQTL (g-eQTL).
In this g-eQTL mapping study, the transcript expression levels of genes (transcript-level genes; T-genes) were correlated with the SNPs of genes (sequence-level genes; S-genes) by using a method of gene-based partial least squares (PLS). Ancestry-informative transcripts were identified using a rank-score-based multivariate association test, and ancestry-informative eQTL were identified using Fisher’s exact test. Furthermore, key ancestry-predictive eQTL were selected in a flexible discriminant analysis. We analyzed SNPs and gene expression of 210 independent people of African-, Asian- and European-descent. We identified numerous cis- and trans-acting g-eQTL and s-eQTL for each population by using PLS. We observed ancestry information enriched in eQTL. Furthermore, we identified 2 ancestry-informative eQTL associated with adverse drug reactions and/or drug response. Rs1045642, located on MDR1, is an ancestry-informative eQTL (P = 2.13E-13, using Fisher’s exact test) associated with adverse drug reactions to amitriptyline and nortriptyline and drug responses to morphine. Rs20455, located in KIF6, is an ancestry-informative eQTL (P = 2.76E-23, using Fisher’s exact test) associated with the response to statin drugs (e.g., pravastatin and atorvastatin). The ancestry-informative eQTL of drug biotransformation genes were also observed; cross-population cis-acting expression regulators included SPG7, TAP2, SLC7A7, and CYP4F2. Finally, we also identified key ancestry-predictive eQTL and established classification models with promising training and testing accuracies in separating samples from close populations.
In summary, we developed a gene-based PLS procedure and a SAS macro for identifying g-eQTL and s-eQTL. We established data archives of eQTL for global populations. The program and data archives are accessible at http://www.stat.sinica.edu.tw/hsinchou/genetics/eQTL/HapMapII.htm. Finally, the results from our investigations regarding the interrelationship between eQTL, ancestry information, and pharmacodynamics provide rich resources for future eQTL studies and practical applications in population genetics and medical genetics.
Gene-based approach; Expression quantitative trait locus (eQTL); Partial least squares (PLS); Ancestry-informative marker (AIM); Pharmacogenetics; Adverse drug reaction; Drug response; Drug biotransformation
Although the genetic basis of androgenic alopecia has been clearly established, little is known about its non-genetic causes, such as environmental and lifestyle factors.
This study investigated blood and urine heavy metals concentrations, environmental exposure factors, personal behaviors, dietary intakes and the genotypes of related susceptibility genes in patients with androgenic alopecia (AGA).
Age, AGA level, residence area, work hours, sleep patterns, cigarette usage, alcohol consumption, betel nut usage, hair treatments, eating habits, body heavy metals concentrations and rs1998076, rs913063, rs1160312 and rs201571 SNP genotype data were collected from 354 men. Logistic regression analysis was performed to examine whether any of the factors displayed odds ratios (ORs) indicating association with moderate to severe AGA (≧IV). Subsequently, Hosmer-Lemeshow, Nagelkerke R2 and accuracy tests were conducted to help establish an optimal model.
Moderate to severe AGA was associated with the AA genotype of rs1160312 (22.50, 95% CI 3.99–126.83), blood vanadium concentration (0.02, 95% CI 0.01–0.04), and regular consumption of soy bean drinks (0.23, 95% CI 0.06–0.85), after adjustment for age. The results were corroborated by the Hosmer-Lemeshow test (P = 0.73), Nagelkerke R2 (0.59), accuracy test (0.816) and area under the curve (AUC; 0.90, 0.847–0.951) analysis.
Blood vanadium and frequent soy bean drink consumption may provide protect effects against AGA. Accordingly, blood vanadium concentrations, the AA genotype of rs1160312 and frequent consumption of soy bean drinks are associated with AGA.
Schizophrenia is a highly heritable disease with a polygenic mode of inheritance. Many studies have contributed to our understanding of the genetic underpinnings of schizophrenia, but little is known about how interactions among genes affect the risk of schizophrenia. This study aimed to assess the associations and interactions among genes that confer vulnerability to schizophrenia and to examine the moderating effect of neuropsychological impairment.
We analyzed 99 SNPs from 10 candidate genes in 1,512 subject samples. The permutation-based single-locus, multi-locus association tests, and a gene-based multifactorial dimension reduction procedure were used to examine genetic associations and interactions to schizophrenia.
We found that no single SNP was significantly associated with schizophrenia. However, a risk haplotype, namely A-T-C of the SNP triplet rsDAO7-rsDAO8-rsDAO13 of the DAO gene, was strongly associated with schizophrenia. Interaction analyses identified multiple between-gene and within-gene interactions. Between-gene interactions including DAO*DISC1,
DAO*NRG1 and DAO*RASD2 and a within-gene interaction for CACNG2 were found among schizophrenia subjects with severe sustained attention deficits, suggesting a modifying effect of impaired neuropsychological functioning. Other interactions such as the within-gene interaction of DAO and the between-gene interaction of DAO and PTK2B were consistently identified regardless of stratification by neuropsychological dysfunction. Importantly, except for the within-gene interaction of CACNG2, all of the identified risk haplotypes and interactions involved SNPs from DAO.
These results suggest that DAO, which is involved in the N-methyl-d-aspartate receptor regulation, signaling and glutamate metabolism, is the master gene of the genetic associations and interactions underlying schizophrenia. Besides, the interaction between DAO and RASD2 has provided an insight in integrating the glutamate and dopamine hypotheses of schizophrenia.
Angiotensin-converting enzyme (ACE) has been implicated in multiple biological system, particularly cardiovascular diseases. However, findings associating ACE insertion/deletion polymorphism with hypertension or other related traits are inconsistent. Therefore, in a two-stage approach, we aimed to fine-map ACE in order to narrow-down the function-specific locations. We genotyped 31 single nucleotide polymorphisms (SNPs) of ACE from 1168 individuals from 305 young-onset (age ≤40) hypertension pedigrees, and found four linkage disequilibrium (LD) blocks. A tag-SNP, rs1800764 on LD block 2, upstream of and near the ACE promoter, was significantly associated with young-onset hypertension (p = 0.04). Tag-SNPs on all LD blocks were significantly associated with ACE activity (p-value: 10–16 to <10–33). The two regions most associated with ACE activity were found between exon13 and intron18 and between intron 20 and 3′UTR, as revealed by measured haplotype analysis. These two major QTLs of ACE activity and the moderate effect variant upstream of ACE promoter for young-onset hypertension were replicated by another independent association study with 842 subjects.
The plasma adiponectin level, a potential upstream and internal facet of metabolic and cardiovascular diseases, has a reasonably high heritability. Whether other novel genes influence the variation in adiponectin level and the roles of these genetic variants on subsequent clinical outcomes has not been thoroughly investigated. Therefore, we aimed not only to identify genetic variants modulating plasma adiponectin levels but also to investigate whether these variants are associated with adiponectin-related metabolic traits and cardiovascular diseases.
RESEARCH DESIGN AND METHODS
We conducted a genome-wide association study (GWAS) to identify quantitative trait loci (QTL) associated with high molecular weight forms of adiponectin levels by genotyping 382 young-onset hypertensive (YOH) subjects with Illumina HumanHap550 SNP chips. The culpable single nucleotide polymorphism (SNP) variants responsible for lowered adiponectin were then confirmed in another 559 YOH subjects, and the association of these SNP variants with the risk of metabolic syndrome (MS), type 2 diabetes mellitus (T2DM), and ischemic stroke was examined in an independent community–based prospective cohort, the CardioVascular Disease risk FACtors Two-township Study (CVDFACTS, n = 3,350).
The SNP (rs4783244) most significantly associated with adiponectin levels was located in intron 1 of the T-cadherin (CDH13) gene in the first stage (P = 7.57 × 10−9). We replicated and confirmed the association between rs4783244 and plasma adiponectin levels in an additional 559 YOH subjects (P = 5.70 × 10−17). This SNP was further associated with the risk of MS (odds ratio [OR] = 1.42, P = 0.027), T2DM in men (OR = 3.25, P = 0.026), and ischemic stroke (OR = 2.13, P = 0.002) in the CVDFACTS.
These findings indicated the role of T-cadherin in modulating adiponectin levels and the involvement of CDH13 or adiponectin in the development of cardiometabolic diseases.
Ancestry informative markers (AIMs) are a type of genetic marker that is informative for tracing the ancestral ethnicity of individuals. Application of AIMs has gained substantial attention in population genetics, forensic sciences, and medical genetics. Single nucleotide polymorphisms (SNPs), the materials of AIMs, are useful for classifying individuals from distinct continental origins but cannot discriminate individuals with subtle genetic differences from closely related ancestral lineages. Proof-of-principle studies have shown that gene expression (GE) also is a heritable human variation that exhibits differential intensity distributions among ethnic groups. GE supplies ethnic information supplemental to SNPs; this motivated us to integrate SNP and GE markers to construct AIM panels with a reduced number of required markers and provide high accuracy in ancestry inference. Few studies in the literature have considered GE in this aspect, and none have integrated SNP and GE markers to aid classification of samples from closely related ethnic populations.
We integrated a forward variable selection procedure into flexible discriminant analysis to identify key SNP and/or GE markers with the highest cross-validation prediction accuracy. By analyzing genome-wide SNP and/or GE markers in 210 independent samples from four ethnic groups in the HapMap II Project, we found that average testing accuracies for a majority of classification analyses were quite high, except for SNP-only analyses that were performed to discern study samples containing individuals from two close Asian populations. The average testing accuracies ranged from 0.53 to 0.79 for SNP-only analyses and increased to around 0.90 when GE markers were integrated together with SNP markers for the classification of samples from closely related Asian populations. Compared to GE-only analyses, integrative analyses of SNP and GE markers showed comparable testing accuracies and a reduced number of selected markers in AIM panels.
Integrative analysis of SNP and GE markers provides high-accuracy and/or cost-effective classification results for assigning samples from closely related or distantly related ancestral lineages to their original ancestral populations. User-friendly BIASLESS (Biomarkers Identification and Samples Subdivision) software was developed as an efficient tool for selecting key SNP and/or GE markers and then building models for sample subdivision. BIASLESS was programmed in R and R-GUI and is available online at http://www.stat.sinica.edu.tw/hsinchou/genetics/prediction/BIASLESS.htm.
Single nucleotide polymorphism (SNP); Allele frequency; Gene expression; HapMap; Classification analysis; Ancestry informative marker (AIM)
Rheumatoid arthritis (RA) is a chronic inflammatory disorder with a polygenic mode of inheritance. This study examined the hypothesis that runs of homozygosity (ROHs) play a recessive-acting role in the underlying RA genetic mechanism and identified RA-associated ROHs. Ours is the first genome-wide homozygosity association study for RA and characterized the ROH patterns associated with RA in the genomes of 2,000 RA patients and 3,000 normal controls of the Wellcome Trust Case Control Consortium. Genome scans consistently pinpointed two regions within the human major histocompatibility complex region containing RA-associated ROHs. The first region is from 32,451,664 bp to 32,846,093 bp (−log10(p)>22.6591). RA-susceptibility genes, such as HLA-DRB1, are contained in this region. The second region ranges from 32,933,485 bp to 33,585,118 bp (−log10(p)>8.3644) and contains other HLA-DPA1 and HLA-DPB1 genes. These two regions are physically close but are located in different blocks of linkage disequilibrium, and ∼40% of the RA patients' genomes carry these ROHs in the two regions. By analyzing homozygote intensities, an ROH that is anchored by the single nucleotide polymorphism rs2027852 and flanked by HLA-DRB6 and HLA-DRB1 was found associated with increased risk for RA. The presence of this risky ROH provides a 62% accuracy to predict RA disease status. An independent genomic dataset from 868 RA patients and 1,194 control subjects of the North American Rheumatoid Arthritis Consortium successfully validated the results obtained using the Wellcome Trust Case Control Consortium data. In conclusion, this genome-wide homozygosity association study provides an alternative to allelic association mapping for the identification of recessive variants responsible for RA. The identified RA-associated ROHs uncover recessive components and missing heritability associated with RA and other autoimmune diseases.
Hypertension is a complex disorder with high prevalence rates all over the world. We conducted the first genome-wide gene-based association scan for hypertension in a Han Chinese population. By analyzing genome-wide single-nucleotide-polymorphism data of 400 matched pairs of young-onset hypertensive patients and normotensive controls genotyped with the Illumina HumanHap550-Duo BeadChip, 100 susceptibility genes for hypertension were identified and also validated with permutation tests. Seventeen of the 100 genes exhibited differential allelic and expression distributions between patient and control groups. These genes provided a good molecular signature for classifying hypertensive patients and normotensive controls. Among the 17 genes, IGF1, SLC4A4, WWOX, and SFMBT1 were not only identified by our gene-based association scan and gene expression analysis but were also replicated by a gene-based association analysis of the Hong Kong Hypertension Study. Moreover, cis-acting expression quantitative trait loci associated with the differentially expressed genes were found and linked to hypertension. IGF1, which encodes insulin-like growth factor 1, is associated with cardiovascular disorders, metabolic syndrome, decreased body weight/size, and changes of insulin levels in mice. SLC4A4, which encodes the electrogenic sodium bicarbonate cotransporter 1, is associated with decreased body weight/size and abnormal ion homeostasis in mice. WWOX, which encodes the WW domain-containing protein, is related to hypoglycemia and hyperphosphatemia. SFMBT1, which encodes the scm-like with four MBT domains protein 1, is a novel hypertension gene. GRB14, TMEM56 and KIAA1797 exhibited highly significant differential allelic and expressed distributions between hypertensive patients and normotensive controls. GRB14 was also found relevant to blood pressure in a previous genetic association study in East Asian populations. TMEM56 and KIAA1797 may be specific to Taiwanese populations, because they were not validated by the two replication studies. Identification of these genes enriches the collection of hypertension susceptibility genes, thereby shedding light on the etiology of hypertension in Han Chinese populations.
Quantitative trait locus (QTL) mapping using deep DNA sequencing data is a challenging task. In this study we performed region-based and pathway-based QTL mappings using a p-value combination method to analyze the simulated quantitative traits Q1 and Q4 and the exome sequencing data. The aims were to evaluate the performance of the QTL mapping approaches that were used and to suggest plausible strategies for QTL mapping of DNA sequencing data. We conducted single-locus QTL mappings using a linear regression model with adjustments for age and smoking status, and we also conducted region-based and pathway-based QTL mappings using a truncated product method for combining p-values from the single-locus QTL mapping. To account for the features of rare variants and common single-nucleotide polymorphisms (SNPs), we considered independently rare-variant-only, common-SNP-only, and combined analyses. An analysis of 200 simulated replications showed that the three region-based methods reasonably controlled type I error, whereas the combined analysis yielded the greatest statistical power. Rare-variant-only, common-SNP-only, and combined analyses were also applied to pathway-based QTL mappings. We found that pathway-based QTL mappings had a power of approximately 100% when the significance of the vascular endothelial growth factor pathway was evaluated, but type I errors were slightly inflated. Our approach complements single-locus QTL mapping. An integrated approach using single-locus, combined region-based, and combined pathway-based analyses should yield promising results for QTL mapping of DNA sequencing data.
Genome-wide single-nucleotide polymorphism (SNP) arrays containing hundreds of thousands of SNPs from the human genome have proven useful for studying important human genome questions. Data quality of SNP arrays plays a key role in the accuracy and precision of downstream data analyses. However, good indices for assessing data quality of SNP arrays have not yet been developed.
We developed new quality indices to measure the quality of SNP arrays and/or DNA samples and investigated their statistical properties. The indices quantify a departure of estimated individual-level allele frequencies (AFs) from expected frequencies via standardized distances. The proposed quality indices followed lognormal distributions in several large genomic studies that we empirically evaluated. AF reference data and quality index reference data for different SNP array platforms were established based on samples from various reference populations. Furthermore, a confidence interval method based on the underlying empirical distributions of quality indices was developed to identify poor-quality SNP arrays and/or DNA samples. Analyses of authentic biological data and simulated data show that this new method is sensitive and specific for the detection of poor-quality SNP arrays and/or DNA samples.
This study introduces new quality indices, establishes references for AFs and quality indices, and develops a detection method for poor-quality SNP arrays and/or DNA samples. We have developed a new computer program that utilizes these methods called SNP Array Quality Control (SAQC). SAQC software is written in R and R-GUI and was developed as a user-friendly tool for the visualization and evaluation of data quality of genome-wide SNP arrays. The program is available online (http://www.stat.sinica.edu.tw/hsinchou/genetics/quality/SAQC.htm).
Allele frequency is one of the most important population indices and has been broadly applied to genetic/genomic studies. Estimation of allele frequency using genotypes is convenient but may lose data information and be sensitive to genotyping errors.
This study utilizes a unified intensity-measuring approach to estimating individual-level allele frequencies for 1,104 and 1,270 samples genotyped with the single-nucleotide-polymorphism arrays of the Affymetrix Human Mapping 100K and 500K Sets, respectively. Allele frequencies of all samples are estimated and adjusted by coefficients of preferential amplification/hybridization (CPA), and large ethnicity-specific and cross-ethnicity databases of CPA and allele frequency are established. The results show that using the CPA significantly improves the accuracy of allele frequency estimates; moreover, this paramount factor is insensitive to the time of data acquisition, effect of laboratory site, type of gene chip, and phenotypic status. Based on accurate allele frequency estimates, analytic methods based on individual-level allele frequencies are developed and successfully applied to discover genomic patterns of allele frequencies, detect chromosomal abnormalities, classify sample groups, identify outlier samples, and estimate the purity of tumor samples. The methods are packaged into a new analysis tool, ALOHA (Allele-frequency/Loss-of-heterozygosity/Allele-imbalance).
This is the first time that these important genetic/genomic applications have been simultaneously conducted by the analyses of individual-level allele frequencies estimated by a unified intensity-measuring approach. We expect that additional practical applications for allele frequency analysis will be found. The developed databases and tools provide useful resources for human genome analysis via high-throughput single-nucleotide-polymorphism arrays. The ALOHA software was written in R and R GUI and can be downloaded at http://www.stat.sinica.edu.tw/hsinchou/genetics/aloha/ALOHA.htm.
The HLA region is considered to be the main genetic risk factor for rheumatoid arthritis. Previous research demonstrated that HLA-DRB1 alleles encoding the shared epitope are specific for disease that is characterized by antibodies to cyclic citrullinated peptides (anti-CCP). In the present study, we incorporated the shared epitope and either anti-CCP antibodies or rheumatoid factor into linkage disequilibrium mapping, to assess the association between the shared epitope or antibodies with the disease gene identified. Incorporating the covariates into the association mapping provides a mechanism 1) to evaluate gene-gene and gene-environment interactions and 2) to dissect the pathways underlying disease induction/progress in quantitative antibodies.
Genome-wide association studies, which analyzes hundreds of thousands of single-nucleotide polymorphisms to identify disease susceptibility genes, are challenging because the work involves intensive computation and complex modeling. We propose a two-stage genome-wide association scanning procedure, consisting of a single-locus association scan for the first stage and a gene-based association scan for the second stage. Marginal effects of single-nucleotide polymorphisms are examined by using the exact Armitage trend test or logistic regression, and gene effects are examined by using a p-value combination method. Compared with some existing single-locus and multilocus methods, the proposed method has the following merits: 1) convenient for definition of biologically meaningful regions, 2) powerful for detection of minor-effect genes, 3) helpful for alleviation of a multiple-testing problem, and 4) convenient for result interpretation. The method was applied to study Genetic Analysis Workshop 16 Problem 1 rheumatoid arthritis data, and strong association signals were found. The results show that the human major histocompatibility complex region is the most important genomic region associated with rheumatoid arthritis. Moreover, previously reported genes including PTPN22, C5, and IL2RB were confirmed; novel genes including HLA-DRA, BTNL2, C6orf10, NOTCH4, TAP2, and TNXB were identified by our analysis.
Young-onset hypertension has a stronger genetic component than late-onset counterpart; thus, the identification of genes related to its susceptibility is a critical issue for the prevention and management of this disease. We carried out a two-stage association scan to map young-onset hypertension susceptibility genes. The first-stage analysis, a genome-wide association study, analyzed 175 matched case-control pairs; the second-stage analysis, a confirmatory association study, verified the results at the first stage based on a total of 1,008 patients and 1,008 controls. Single-locus association tests, multilocus association tests and pair-wise gene-gene interaction tests were performed to identify young-onset hypertension susceptibility genes. After considering stringent adjustments of multiple testing, gene annotation and single-nucleotide polymorphism (SNP) quality, four SNPs from two SNP triplets with strong association signals (−log10(p)>7) and 13 SNPs from 8 interactive SNP pairs with strong interactive signals (−log10(p)>8) were carefully re-examined. The confirmatory study verified the association for a SNP quartet 219 kb and 495 kb downstream of LOC344371 (a hypothetical gene) and RASGRP3 on chromosome 2p22.3, respectively. The latter has been implicated in the abnormal vascular responsiveness to endothelin-1 and angiotensin II in diabetic-hypertensive rats. Intrinsic synergy involving IMPG1 on chromosome 6q14.2-q15 was also verified. IMPG1 encodes interphotoreceptor matrix proteoglycan 1 which has cation binding capacity. The genes are novel hypertension targets identified in this first genome-wide hypertension association study of the Han Chinese population.
Association testing is a powerful tool for identifying disease susceptibility genes underlying complex diseases. Technological advances have yielded a dramatic increase in the density of available genetic markers, necessitating an increase in the number of association tests required for the analysis of disease susceptibility genes. As such, multiple-tests corrections have become a critical issue. However the conventional statistical corrections on locus-specific multiple tests usually result in lower power as the number of markers increases. Alternatively, we propose here the application of the longest significant run (LSR) method to estimate a region-specific p-value to provide an index for the most likely candidate region.
An advantage of the LSR method relative to procedures based on genotypic data is that only p-value data are needed and hence can be applied extensively to different study designs. In this study the proposed LSR method was compared with commonly used methods such as Bonferroni's method and FDR controlling method. We found that while all methods provide good control over false positive rate, LSR has much better power and false discovery rate. In the authentic analysis on psoriasis and asthma disease data, the LSR method successfully identified important candidate regions and replicated the results of previous association studies.
The proposed LSR method provides an efficient exploratory tool for the analysis of sequences of dense genetic markers. Our results show that the LSR method has better power and lower false discovery rate comparing with the locus-specific multiple tests.
Microarray-based pooled DNA experiments that combine the merits of DNA pooling and gene chip technology constitute a pivotal advance in biotechnology. This new technique uses pooled DNA, thereby reducing costs associated with the typing of DNA from numerous individuals. Moreover, use of an oligonucleotide gene chip reduces costs related to processing various DNA segments (e.g., primers, reagents). Thus, the technique provides an overall cost-effective solution for large-scale genomic/genetic research. However, few publicly shared tools are available to systematically analyze the rapidly accumulating volume of whole-genome pooled DNA data.
We propose a generalized concept of pooled DNA and present a user-friendly tool named Microarray Pooled DNA Analyzer (MPDA) that we developed to analyze hybridization intensity data from microarray-based pooled DNA experiments. MPDA enables whole-genome DNA preferential amplification/hybridization analysis, allele frequency estimation, association mapping, allelic imbalance detection, and permits integration with shared data resources online. Graphic and numerical outputs from MPDA support global and detailed inspection of large amounts of genomic data. Four whole-genome data analyses are used to illustrate the major functionalities of MPDA. The first analysis shows that MPDA can characterize genomic patterns of preferential amplification/hybridization and provide calibration information for pooled DNA data analysis. The second analysis demonstrates that MPDA can accurately estimate allele frequencies. The third analysis indicates that MPDA is cost-effective and reliable for association mapping. The final analysis shows that MPDA can identify regions of chromosomal aberration in cancer without paired-normal tissue.
MPDA, the software that integrates pooled DNA association analysis and allelic imbalance analysis, provides a convenient analysis system for extensive whole-genome pooled DNA data analysis. The software, user manual and illustrated examples are freely available online at the MPDA website listed in the Availability and requirements section.
Association mapping using abundant single nucleotide polymorphisms is a powerful tool for identifying disease susceptibility genes for complex traits and exploring possible genetic diversity. Genotyping large numbers of SNPs individually is performed routinely but is cost prohibitive for large-scale genetic studies. DNA pooling is a reliable and cost-saving alternative genotyping method. However, no software has been developed for complete pooled-DNA analyses, including data standardization, allele frequency estimation, and single/multipoint DNA pooling association tests. This motivated the development of the software, 'PDA' (Pooled DNA Analyzer), to analyze pooled DNA data.
We develop the software, PDA, for the analysis of pooled-DNA data. PDA is originally implemented with the MATLAB® language, but it can also be executed on a Windows system without installing the MATLAB®. PDA provides estimates of the coefficient of preferential amplification and allele frequency. PDA considers an extended single-point association test, which can compare allele frequencies between two DNA pools constructed under different experimental conditions. Moreover, PDA also provides novel chromosome-wide multipoint association tests based on p-value combinations and a sliding-window concept. This new multipoint testing procedure overcomes a computational bottleneck of conventional haplotype-oriented multipoint methods in DNA pooling analyses and can handle data sets having a large pool size and/or large numbers of polymorphic markers. All of the PDA functions are illustrated in the four bona fide examples.
PDA is simple to operate and does not require that users have a strong statistical background. The software is available at .
A thorough genetic mapping study was performed to identify predisposing genes for alcoholism dependence using the Collaborative Study on the Genetics of Alcoholism (COGA) data. The procedure comprised whole-genome linkage and confirmation analyses, single locus and haplotype fine mapping analyses, and gene × environment haplotype regression. Stratified analysis was considered to reduce the ethnic heterogeneity and simultaneously family-based and case-control study designs were applied to detect potential genetic signals. By using different methods and markers, we found high linkage signals at D1S225 (253.7 cM), D1S547 (279.2 cM), D2S1356 (64.6 cM), and D7S2846 (56.8 cM) with nonparametric linkage scores of 3.92, 4.10, 4.44, and 3.55, respectively. We also conducted haplotype and odds ratio analyses, where the response was the dichotomous status of alcohol dependence, explanatory variables were the inferred individual haplotypes and the three statistically significant covariates were age, gender, and max drink (the maximum number of drinks consumed in a 24-hr period). The final model identified important AD-related haplotypes within a candidate region of NRXN1 at 2p21 and a few others in the inter-gene regions. The relative magnitude of risks to the identified risky/protective haplotypes was elucidated.