The formation of phenotypic traits, such as biomass production, tumor volume and viral abundance, undergoes a complex process in which interactions between genes and developmental stimuli take place at each level of biological organization from cells to organisms. Traditional studies emphasize the impact of genes by directly linking DNA-based markers with static phenotypic values. Functional mapping, derived to detect genes that control developmental processes using growth equations, has proven powerful for addressing questions about the roles of genes in development. By treating phenotypic formation as a cohesive system using differential equations, a different approach—systems mapping—dissects the system into interconnected elements and then map genes that determine a web of interactions among these elements, facilitating our understanding of the genetic machineries for phenotypic development. Here, we argue that genetic mapping can play a more important role in studying the genotype–phenotype relationship by filling the gaps in the biochemical and regulatory process from DNA to end-point phenotype. We describe a new framework, named network mapping, to study the genetic architecture of complex traits by integrating the regulatory networks that cause a high-order phenotype. Network mapping makes use of a system of differential equations to quantify the rule by which transcriptional, proteomic and metabolomic components interact with each other to organize into a functional whole. The synthesis of functional mapping, systems mapping and network mapping provides a novel avenue to decipher a comprehensive picture of the genetic landscape of complex phenotypes that underlie economically and biomedically important traits.
network mappin; complex traits; differential equations; DNA polymorphism; systems biology
As a basis of personalized medicine, pharmacogenetics and pharmacogenomics that aim to study the genetic architecture of drug response critically rely on dynamic modeling of how a drug is absorbed and transported to target tissues where the drug interacts with body molecules to produce drug effects. Systems mapping provides a general framework for integrating systems pharmacology and pharmacogenomics through robust ordinary differential equations. In this chapter, we extend systems mapping to more complex and more heterogeneous structure of drug response by implementing stochastic differential equations (SDE). We argue that SDE-implemented systems mapping provides a computational tool for pharmacogeneticor pharmacogenomic research towards personalized medicine.
Stochastic differential equation; PK; PD; Statistical model; Genetic architecture; Drug response
Genetic pleiotropy refers to the situation in which a single gene influences multiple traits and so it is considered as a major factor that underlies genetic correlation among traits. To identify pleiotropy, an important focus in genome-wide association studies (GWAS) is on finding genetic variants that are simultaneously associated with multiple traits. On the other hand, longitudinal designs are often employed in many complex disease studies, such that, traits are measured repeatedly over time within the same subject. Performing genetic association analysis simultaneously on multiple longitudinal traits for detecting pleiotropic effects is interesting but challenging. In this paper, we propose a 2-step method for simultaneously testing the genetic association with multiple longitudinal traits. In the first step, a mixed effects model is used to analyze each longitudinal trait. We focus on estimation of the random effect that accounts for the subject-specific genetic contribution to the trait; fixed effects of other confounding covariates are also estimated. This first step enables separation of the genetic effect from other confounding effects for each subject and for each longitudinal trait. Then in the second step, we perform a simultaneous association test on multiple estimated random effects arising from multiple longitudinal traits. The proposed method can efficiently detect pleiotropic effects on multiple longitudinal traits and can flexibly handle traits of different data types such as quantitative, binary, or count data. We apply this method to analyze the 16th Genetic Analysis Workshop (GAW16) Framingham Heart Study (FHS) data. A simulation study is also conducted to validate this 2-step method and evaluate its performance.
pleiotropic effect; genetic association; multiple traits; longitudinal data; mixed effects model; single nucleotide polymorphisms (SNPs)
Despite their importance in biology and biomedicine, genetic mapping of binary traits that change over time has not been well explored. In this article, we develop a statistical model for mapping quantitative trait loci (QTLs) that govern longitudinal responses of binary traits. The model is constructed within the maximum likelihood framework by which the association between binary responses is modeled in terms of conditional log odds-ratios. With this parameterization, the maximum likelihood estimates (MLEs) of marginal mean parameters are robust to the misspecification of time dependence. We implement an iterative procedures to obtain the MLEs of QTL genotype-specific parameters that define longitudinal binary responses. The usefulness of the model was validated by analyzing a real example in rice. Simulation studies were performed to investigate the statistical properties of the model, showing that the model has power to identify and map specific QTLs responsible for the temporal pattern of binary traits.
binary trait; dynamic trait; functional mapping; maximum likelihood estimate
Next generation sequencing (NGS) has been leading the genetic study of human disease into an era of unprecedented productivity. Many bioinformatics pipelines have been developed to call variants from NGS data. The performance of these pipelines depends crucially on the variant caller used and on the calling strategies implemented. We studied the performance of four prevailing callers, SAMtools, GATK, glftools and Atlas2, using single-sample and multiple-sample variant-calling strategies. Using the same aligner, BWA, we built four single-sample and three multiple-sample calling pipelines and applied the pipelines to whole exome sequencing data taken from 20 individuals. We obtained genotypes generated by Illumina Infinium HumanExome v1.1 Beadchip for validation analysis and then used Sanger sequencing as a “gold-standard” method to resolve discrepancies for selected regions of high discordance. Finally, we compared the sensitivity of three of the single-sample calling pipelines using known simulated whole genome sequence data as a gold standard. Overall, for single-sample calling, the called variants were highly consistent across callers and the pairwise overlapping rate was about 0.9. Compared with other callers, GATK had the highest rediscovery rate (0.9969) and specificity (0.99996), and the Ti/Tv ratio out of GATK was closest to the expected value of 3.02. Multiple-sample calling increased the sensitivity. Results from the simulated data suggested that GATK outperformed SAMtools and glfSingle in sensitivity, especially for low coverage data. Further, for the selected discrepant regions evaluated by Sanger sequencing, variant genotypes called by exome sequencing versus the exome array were more accurate, although the average variant sensitivity and overall genotype consistency rate were as high as 95.87% and 99.82%, respectively. In conclusion, GATK showed several advantages over other variant callers for general purpose NGS analyses. The GATK pipelines we developed perform very well.
Recent advances in next-generation sequencing technologies have transformed the genetics study of human diseases; this is an era of unprecedented productivity. Exome sequencing, the targeted sequencing of the protein-coding portion of the human genome, has been shown to be a powerful and cost-effective method for detection of disease variants underlying Mendelian disorders. Increasing effort has been made in the interest of the identification of rare variants associated with complex traits in sequencing studies. Here we provided an overview of the application fields for exome sequencing in human diseases. We describe a general framework of computation and bioinformatics for handling sequencing data. We then demonstrate data quality and agreement between exome sequencing and exome microarray (chip) genotypes using data collected on the same set of subjects in a genetic study of panic disorder. Our results show that, in sequencing data, the data quality was generally higher for variants within the exonic target regions, compared to that outside the target regions, due to the target enrichment. We also compared genotype concordance for variant calls obtained by exome sequencing vs. exome genotyping microarrays. The overall consistency rate was >99.83% and the heterozygous consistency rate was >97.55%. The two platforms share a large amount of agreement over low frequency variants in the exonic regions, while exome sequencing provides much more information on variants not included on exome genotyping microarrays. The results demonstrate that exome sequencing data are of high quality and can be used to investigate the role of rare coding variants in human diseases.
exome sequencing; exome arrays; Mendelian diseases; complex traits; whole-genome sequencing
Mutations in GBA1 gene result in defective acid β-glucosidase and the complex phenotype of Gaucher disease (GD) related to the accumulation of glucosylceramide-laden macrophages. The phenotype is highly variable even among patients harboring identical GBA1 mutations. We hypothesized that modifier gene(s) underlie phenotypic diversity in GD and performed a GWAS study in Ashkenazi Jewish patients with type 1 GD (GD1), homozygous for N370S mutation. Patients were assigned to mild, moderate or severe disease category using composite disease severity scoring systems. Whole-genome genotyping for >500,000 SNPs was performed to search for associations using OQLS algorithm in 139 eligible patients. Several SNPs in linkage disequilibrium within the CLN8 gene locus were associated with the GD1 severity: SNP rs11986414 was associated with GD1 severity at p value 1.26 × 10−6. Compared to mild disease, risk allele A at rs11986414 conferred an odds ratio of 3.72 for moderate/severe disease. Loss of function mutations in CLN8 causes neuronal ceroid-lipofuscinosis but our results indicate that its increased expression may protect against severe GD1. In cultured skin fibroblasts, the relative expression of CLN8 was higher in mild GD compared to severely affected patients in whom CLN8 risk alleles were over-represented. In an in vitro cell model of GD, CLN8 expression was increased which was further enhanced in the presence of bioactive substrate, glucosylsphingosine. Taken together, CLN8 is a candidate modifier gene for GD1 that may function as a protective sphingolipid sensor and/or in glycosphingolipid trafficking. Future studies should explore the role of CLN8 in pathophysiology of GD.
Gaucher disease; GWAS; genotype/phenotype correlations; phenotypic diversity; modifier genes; CLN8; N370S; GBA mutations
Despite our increasing recognition of the mechanisms that specify and propagate epigenetic states of gene expression, the pattern of how epigenetic modifications contribute to the overall genetic variation of a phenotypic trait remains largely elusive.
We construct a quantitative model to explore the effect of epigenetic modifications that occur at specific rates on the genome. This model, derived from, but beyond, the traditional quantitative genetic theory that is founded on Mendel’s laws, allows questions concerning the prevalence and importance of epigenetic variation to be incorporated and addressed.
It provides a new avenue for bringing chromatin inheritance into the realm of complex traits, facilitating our understanding of the means by which phenotypic variation is generated.
Mathematical models of viral dynamics in vivo provide incredible insights into the mechanisms for the nonlinear interaction between virus and host cell populations, the dynamics of viral drug resistance, and the way to eliminate virus infection from individual patients by drug treatment. The integration of these mathematical models with high-throughput genetic and genomic data within a statistical framework will raise a hope for effective treatment of infections with HIV virus through developing potent antiviral drugs based on individual patients’ genetic makeup. In this opinion article, we will show a conceptual model for mapping and dictating a comprehensive picture of genetic control mechanisms for viral dynamics through incorporating a group of differential equations that quantify the emergent properties of a system.
Population stratification is an important issue in case–control studies of disease-marker association. Failure to properly account for population structure can lead to spurious association or reduced power. In this article, we compare the performance of six methods correcting for population stratification in case–control association studies. These methods include genomic control (GC), EIGENSTRAT, principal component-based logistic regression (PCA-L), LAPSTRUCT, ROADTRIPS, and EMMAX. We also include the uncorrected Armitage test for comparison. In the simulation studies, we consider a wide range of population structure models for unrelated samples, including admixture. Our simulation results suggest that PCA-L and LAPSTRUCT perform well over all the scenarios studied, whereas GC, ROADTRIPS, and EMMAX fail to correct for population structure at single nucleotide polymorphisms (SNPs) that show strong differentiation across ancestral populations. The Armitage test does not adjust for confounding due to stratification thus has inflated type I error. Among all correction methods, EMMAX has the greatest power, based on the population structure settings considered for samples with unrelated individuals. The three methods, EIGENSTRAT, PCA-L, and LAPSTRUCT, are comparable, and outperform both GC and ROADTRIPS in almost all situations.
Population structure; association testing; type I error; power
A number of studies have been conducted to investigate the predictive value of common genetic variants for complex diseases. To date, these studies have generally shown that common variants have no appreciable added predictive value over classical risk factors. New sequencing technology has enhanced the ability to identify rare variants that may have larger functional effects than common variants. One would expect rare variants to improve the discrimination power for disease risk by permitting more detailed quantification of genetic risk. Using the Genetic Analysis Workshop 17 simulated data sets for unrelated individuals, we evaluate the predictive value of rare variants by comparing prediction models built using the support vector machine algorithm with or without rare variants. Empirical results suggest that rare variants have appreciable effects on disease risk prediction.
Large-scale genome-wide association studies (GWAS) have identified many loci associated with body mass index (BMI), but few studies focused on obesity as a binary trait. Here we report the results of a GWAS and candidate SNP genotyping study of obesity, including extremely obese cases and never overweight controls as well as families segregating extreme obesity and thinness. We first performed a GWAS on 520 cases (BMI>35 kg/m2) and 540 control subjects (BMI<25 kg/m2), on measures of obesity and obesity-related traits. We subsequently followed up obesity-associated signals by genotyping the top ∼500 SNPs from GWAS in the combined sample of cases, controls and family members totaling 2,256 individuals. For the binary trait of obesity, we found 16 genome-wide significant signals within the FTO gene (strongest signal at rs17817449, P = 2.5×10−12). We next examined obesity-related quantitative traits (such as total body weight, waist circumference and waist to hip ratio), and detected genome-wide significant signals between waist to hip ratio and NRXN3 (rs11624704, P = 2.67×10−9), previously associated with body weight and fat distribution. Our study demonstrated how a relatively small sample ascertained through extreme phenotypes can detect genuine associations in a GWAS.
We propose an incomplete-data, quasi-likelihood framework, for estimation and score tests, which accommodates both dependent and partially-observed data. The motivation comes from genetic association studies, where we address the problems of estimating haplotype frequencies and testing association between a disease and haplotypes of multiple tightly-linked genetic markers, using case-control samples containing related individuals. We consider a more general setting in which the complete data are dependent with marginal distributions following a generalized linear model. We form a vector Z whose elements are conditional expectations of the elements of the complete-data vector, given selected functions of the incomplete data. Assuming that the covariance matrix of Z is available, we form an optimal linear estimating function based on Z, which we solve by an iterative method. This approach addresses key difficulties in the haplotype frequency estimation and testing problems in related individuals: (1) dependence that is known but can be complicated; (2) data that are incomplete for structural reasons, as well as possibly missing, with different amounts of information for different observations; (3) the need for computational speed in order to analyze large numbers of markers; (4) a well-established null model, but an alternative model that is unknown and is problematic to fully specify in related individuals. For haplotype analysis, we give sufficient conditions for consistency and asymptotic normality of the estimator and asymptotic χ2 null distribution of the score test. We apply the method to test for association of haplotypes with alcoholism in the GAW 14 COGA data set.
Estimating function; Association testing; Dependent data; Missing data; Generalized linear model; Score test