With the advent of affordable and comprehensive sequencing technologies, access to molecular genetics for clinical diagnostics and research applications is increasing. However, variant interpretation remains challenging, and tools that close the gap between data generation and data interpretation are urgently required. Here we present a transferable approach to help address the limitations in variant annotation.
We develop a network of Bayesian logistic regression models that integrate multiple lines of evidence to evaluate the probability that a rare variant is the cause of an individual’s disease. We present models for genes causing inherited cardiac conditions, though the framework is transferable to other genes and syndromes.
Our models report a probability of pathogenicity, rather than a categorisation into pathogenic or benign, which captures the inherent uncertainty of the prediction. We find that gene- and syndrome-specific models outperform genome-wide approaches, and that the integration of multiple lines of evidence performs better than individual predictors. The models are adaptable to incorporate new lines of evidence, and results can be combined with familial segregation data in a transparent and quantitative manner to further enhance predictions.
Though the probability scale is continuous, and innately interpretable, performance summaries based on thresholds are useful for comparisons. Using a threshold probability of pathogenicity of 0.9, we obtain a positive predictive value of 0.999 and sensitivity of 0.76 for the classification of variants known to cause long QT syndrome over the three most important genes, which represents sufficient accuracy to inform clinical decision-making. A web tool APPRAISE [http://www.cardiodb.org/APPRAISE] provides access to these models and predictions.
Our Bayesian framework provides a transparent, flexible and robust framework for the analysis and interpretation of rare genetic variants. Models tailored to specific genes outperform genome-wide approaches, and can be sufficiently accurate to inform clinical decision-making.
Electronic supplementary material
The online version of this article (doi:10.1186/s13073-014-0120-4) contains supplementary material, which is available to authorized users.
We present the analysis of a prospective multicentre study to investigate genetic effects on the prognosis of newly treated epilepsy. Patients with a new clinical diagnosis of epilepsy requiring medication were recruited and followed up prospectively. The clinical outcome was defined as freedom from seizures for a minimum of 12 months in accordance with the consensus statement from the International League Against Epilepsy (ILAE). Genetic effects on remission of seizures after starting treatment were analysed with and without adjustment for significant clinical prognostic factors, and the results from each cohort were combined using a fixed-effects meta-analysis. After quality control (QC), we analysed 889 newly treated epilepsy patients using 472 450 genotyped and 6.9 × 106 imputed single-nucleotide polymorphisms. Suggestive evidence for association (defined as Pmeta < 5.0 × 10−7) with remission of seizures after starting treatment was observed at three loci: 6p12.2 (rs492146, Pmeta = 2.1 × 10−7, OR[G] = 0.57), 9p23 (rs72700966, Pmeta = 3.1 × 10−7, OR[C] = 2.70) and 15q13.2 (rs143536437, Pmeta = 3.2 × 10−7, OR[C] = 1.92). Genes of biological interest at these loci include PTPRD and ARHGAP11B (encoding functions implicated in neuronal development) and GSTA4 (a phase II biotransformation enzyme). Pathway analysis using two independent methods implicated a number of pathways in the prognosis of epilepsy, including KEGG categories ‘calcium signaling pathway’ and ‘phosphatidylinositol signaling pathway’. Through a series of power curves, we conclude that it is unlikely any single common variant explains >4.4% of the variation in the outcome of newly treated epilepsy.
We observe n sequences at each of m sites and assume that they have evolved from an ancestral sequence that forms the root of a binary tree of known topology and branch lengths, but the sequence states at internal nodes are unknown. The topology of the tree and branch lengths are the same for all sites, but the parameters of the evolutionary model can vary over sites. We assume a piecewise constant model for these parameters, with an unknown number of change-points and hence a transdimensional parameter space over which we seek to perform Bayesian inference. We propose two novel ideas to deal with the computational challenges of such inference. Firstly, we approximate the model based on the time machine principle: the top nodes of the binary tree (near the root) are replaced by an approximation of the true distribution; as more nodes are removed from the top of the tree, the cost of computing the likelihood is reduced linearly in n. The approach introduces a bias, which we investigate empirically. Secondly, we develop a particle marginal Metropolis-Hastings (PMMH) algorithm, that employs a sequential Monte Carlo (SMC) sampler and can use the first idea. Our time-machine PMMH algorithm copes well with one of the bottle-necks of standard computational algorithms: the transdimensional nature of the posterior distribution. The algorithm is implemented on simulated and real data examples, and we empirically demonstrate its potential to outperform competing methods based on approximate Bayesian computation (ABC) techniques.
approximate Bayesian computation; binary trees; change-point models; particle marginal Metropolis-Hastings; sequential Monte Carlo samplers; time machine
In 2012, a skeleton was excavated at the presumed site of the Grey Friars friary in Leicester, the last-known resting place of King Richard III. Archaeological, osteological and radiocarbon dating data were consistent with these being his remains. Here we report DNA analyses of both the skeletal remains and living relatives of Richard III. We find a perfect mitochondrial DNA match between the sequence obtained from the remains and one living relative, and a single-base substitution when compared with a second relative. Y-chromosome haplotypes from male-line relatives and the remains do not match, which could be attributed to a false-paternity event occurring in any of the intervening generations. DNA-predicted hair and eye colour are consistent with Richard’s appearance in an early portrait. We calculate likelihood ratios for the non-genetic and genetic data separately, and combined, and conclude that the evidence for the remains being those of Richard III is overwhelming.
King Richard III was a controversial English King whose remains are presumably deposited in Grey Friars in Leicester. Here the authors sequence the mitochondrial genome and Y-chromosome DNA of the skeletal remains and living relatives of Richard III and confirm that the remains belong to King Richard III.
When evaluating the weight of evidence (WoE) for an individual to be a contributor to a DNA sample, an allele frequency database is required. The allele frequencies are needed to inform about genotype probabilities for unknown contributors of DNA to the sample. Typically databases are available from several populations, and a common practice is to evaluate the WoE using each available database for each unknown contributor. Often the most conservative WoE (most favourable to the defence) is the one reported to the court. However the number of human populations that could be considered is essentially unlimited and the number of contributors to a sample can be large, making it impractical to perform every possible WoE calculation, particularly for complex crime scene profiles. We propose instead the use of only the database that best matches the ancestry of the queried contributor, together with a substantial FST adjustment. To investigate the degree of conservativeness of this approach, we performed extensive simulations of one- and two-contributor crime scene profiles, in the latter case with, and without, the profile of the second contributor available for the analysis. The genotypes were simulated using five population databases, which were also available for the analysis, and evaluations of WoE using our heuristic rule were compared with several alternative calculations using different databases. Using FST = 0.03, we found that our heuristic gave WoE more favourable to the defence than alternative calculations in well over 99% of the comparisons we considered; on average the difference in WoE was just under 0.2 bans (orders of magnitude) per locus. The degree of conservativeness of the heuristic rule can be adjusted through the FST value. We propose the use of this heuristic for DNA profile WoE calculations, due to its ease of implementation, and efficient use of the evidence while allowing a flexible degree of conservativeness.
•A heuristic rule of assuming the database of Q for all unprofiled individuals in a CSP is proposed.•We simulate a total of 105 000 one- and two-contributor CSPs with no dropin or dropout.•The heuristic rule is conservative compared to an alternative for the majority of simulated CSPs.•We suggest that the use of this heuristic will allow for evaluation of complex cases with many possible databases.
WoE, Weight of Evidence; Q, Queried contributor; X, Alternate contributor that replaces Q in defence hypothesis; K, Contributor to the CSP whose reference profile is available; U, Unprofiled contributor to the CSP; Population database; DNA mixtures; Likelihood ratio; Forensic DNA
•The behaviour of multi-replicate LRs with respect to the inverse match probability is proposed as a method to validate forensic LR software.•We perform lab-based and simulated experiments of one-, two- and three-contributor CSPs, as well as investigating a real-world CSP.•LRs rise towards the IMP with additional replicates, while never exceeding it. Additionally, the LR from multiple low-template replicates can exceed that from a single good-quality sample.•We validate likeLTD by demonstrating that it adheres to the expected behaviours.
To date there is no generally accepted method to test the validity of algorithms used to compute likelihood ratios (LR) evaluating forensic DNA profiles from low-template and/or degraded samples. An upper bound on the LR is provided by the inverse of the match probability, which is the usual measure of weight of evidence for standard DNA profiles not subject to the stochastic effects that are the hallmark of low-template profiles. However, even for low-template profiles the LR in favour of a true prosecution hypothesis should approach this bound as the number of profiling replicates increases, provided that the queried contributor is the major contributor. Moreover, for sufficiently many replicates the standard LR for mixtures is often surpassed by the low-template LR. It follows that multiple LTDNA replicates can provide stronger evidence for a contributor to a mixture than a standard analysis of a good-quality profile. Here, we examine the performance of the likeLTD software for up to eight replicate profiling runs. We consider simulated and laboratory-generated replicates as well as resampling replicates from a real crime case. We show that LRs generated by likeLTD usually do exceed the mixture LR given sufficient replicates, are bounded above by the inverse match probability and do approach this bound closely when this is expected. We also show good performance of likeLTD even when a large majority of alleles are designated as uncertain, and suggest that there can be advantages to using different profiling sensitivities for different replicates. Overall, our results support both the validity of the underlying mathematical model and its correct implementation in the likeLTD software.
Low-template DNA; DNA mixtures; Likelihood ratio; Replicates; Forensic; likeLTD
The current genetic makeup of Latin America has been shaped by a history of extensive admixture between Africans, Europeans and Native Americans, a process taking place within the context of extensive geographic and social stratification. We estimated individual ancestry proportions in a sample of 7,342 subjects ascertained in five countries (Brazil, Chile, Colombia, México and Perú). These individuals were also characterized for a range of physical appearance traits and for self-perception of ancestry. The geographic distribution of admixture proportions in this sample reveals extensive population structure, illustrating the continuing impact of demographic history on the genetic diversity of Latin America. Significant ancestry effects were detected for most phenotypes studied. However, ancestry generally explains only a modest proportion of total phenotypic variation. Genetically estimated and self-perceived ancestry correlate significantly, but certain physical attributes have a strong impact on self-perception and bias self-perception of ancestry relative to genetically estimated ancestry.
Latin America has a history of extensive mixing between Native Americans and people arriving from Europe and Africa. As a result, individuals in the region have a highly heterogeneous genetic background and show great variation in physical appearance. Latin America offers an excellent opportunity to examine the genetic basis of the differentiation in physical appearance between Africans, Europeans and Native Americans. The region is also an advantageous setting in which to examine the interplay of genetic, physical and social factors in relation to ethnic/racial self-perception. Here we present the most extensive analysis of genetic ancestry, physical diversity and self-perception of ancestry yet conducted in Latin America. We find significant geographic variation in ancestry across the region, this variation being consistent with demographic history and census information. We show that genetic ancestry impacts many aspects of physical appearance. We observe that self-perception is highly influenced by physical appearance, and that variation in physical appearance biases self-perception of ancestry relative to genetically estimated ancestry.
Epilepsy is highly heritable, but its genetic architecture is poorly understood. Speed et al. estimate the number of susceptibility loci, show that common variants account for the majority of heritability, and demonstrate that epilepsy consists of genetically distinct subtypes. They conclude that gene-based prediction models may have clinical utility in first-seizure settings.
Epilepsy is a disease with substantial missing heritability; despite its high genetic component, genetic association studies have had limited success detecting common variants which influence susceptibility. In this paper, we reassess the role of common variants on epilepsy using extensions of heritability analysis. Our data set consists of 1258 UK patients with epilepsy, of which 958 have focal epilepsy, and 5129 population control subjects, with genotypes recorded for over 4 million common single nucleotide polymorphisms. Firstly, we show that on the liability scale, common variants collectively explain at least 26% (standard deviation 5%) of phenotypic variation for all epilepsy and 27% (standard deviation 5%) for focal epilepsy. Secondly we provide a new method for estimating the number of causal variants for complex traits; when applied to epilepsy, our most optimistic estimate suggests that at least 400 variants influence disease susceptibility, with potentially many thousands. Thirdly, we use bivariate analysis to assess how similar the genetic architecture of focal epilepsy is to that of non-focal epilepsy; we demonstrate both significant differences (P = 0.004) and significant similarities (P = 0.01) between the two subtypes, indicating that although the clinical definition of focal epilepsy does identify a genetically distinct epilepsy subtype, there is also scope to improve the classification of epilepsy by incorporating genotypic information. Lastly, we investigate the potential value in using genetic data to diagnose epilepsy following a single epileptic seizure; we find that a prediction model explaining 10% of phenotypic variation could have clinical utility for deciding which single-seizure individuals are likely to benefit from immediate anti-epileptic drug therapy.
epilepsy; association studies; heritability analysis; complex trait prediction
Allergy is a complex disease that is likely to involve dysregulated CD4+ T cell activation. Here we propose a novel methodology to gain insight into how coordinated behaviour emerges between disease-dysregulated pathways in response to pathophysiological stimuli. Using peripheral blood mononuclear cells of allergic rhinitis patients and controls cultured with and without pollen allergens, we integrate CD4+ T cell gene expression from microarray data and genetic markers of allergic sensitisation from GWAS data at the pathway level using enrichment analysis; implicating the complement system in both cellular and systemic response to pollen allergens. We delineate a novel disease network linking T cell activation to the complement system that is significantly enriched for genes exhibiting correlated gene expression and protein-protein interactions, suggesting a tight biological coordination that is dysregulated in the disease state in response to pollen allergen but not to diluent. This novel disease network has high predictive power for the gene and protein expression of the Th2 cytokine profile (IL-4, IL-5, IL-10, IL-13) and of the Th2 master regulator (GATA3), suggesting its involvement in the early stages of CD4+ T cell differentiation. Dissection of the complement system gene expression identifies 7 genes specifically associated with atopic response to pollen, including C1QR1, CFD, CFP, ITGB2, ITGAX and confirms the role of C3AR1 and C5AR1. Two of these genes (ITGB2 and C3AR1) are also implicated in the network linking complement system to T cell activation, which comprises 6 differentially expressed genes. C3AR1 is also significantly associated with allergic sensitisation in GWAS data.
A report on the 4th International Conference on Quantitative Genetics (ICQG4), Edinburgh, UK, June 17-22, 2012.
BLUP; complex traits; epistasis; genomic architecture; prediction; quantitative genetics; selection
Niemann–Pick disease type C (NP-C) is a rare, autosomal-recessive, progressive neurological disease caused by mutations in either the NPC1 gene (in 95% of cases) or the NPC2 gene. This observational, multicentre genetic screening study evaluated the frequency and phenotypes of NP-C in consecutive adult patients with neurological and psychiatric symptoms. Diagnostic testing for NP-C involved NPC1 and NPC2 exonic gene sequencing and gene dosage analysis. When available, results of filipin staining, plasma cholestane-3β,5α,6β-triol assays and measurements of relevant sphingolipids were also collected. NPC1 and NPC2 gene sequencing was completed in 250/256 patients from 30 psychiatric and neurological reference centres across the EU and USA [median (range) age 38 (18–90) years]. Three patients had a confirmed diagnosis of NP-C; two based on gene sequencing alone (two known causal disease alleles) and one based on gene sequencing and positive filipin staining. A further 12 patients displayed either single mutant NP-C alleles (8 with NPC1 mutations and 3 with NPC2 mutations) or a known causal disease mutation and an unclassified NPC1 allele variant (1 patient). Notably, high plasma cholestane-3β,5α,6β-triol levels were observed for all NP-C cases (n = 3). Overall, the frequency of NP-C patients in this study [1.2% (95% CI; 0.3%, 3.5%)] suggests that there may be an underdiagnosed pool of NP-C patients among adults who share common neurological and psychiatric symptoms.
Prion diseases are fatal neurodegenerative diseases of humans and animals caused by the misfolding and aggregation of prion protein (PrP). Mammalian prion diseases are under strong genetic control but few risk factors are known aside from the PrP gene locus (PRNP). No genome-wide association study (GWAS) has been done aside from a small sample of variant Creutzfeldt–Jakob disease (CJD). We conducted GWAS of sporadic CJD (sCJD), variant CJD (vCJD), iatrogenic CJD, inherited prion disease, kuru and resistance to kuru despite attendance at mortuary feasts. After quality control, we analysed 2000 samples and 6015 control individuals (provided by the Wellcome Trust Case Control Consortium and KORA-gen) for 491032-511862 SNPs in the European study. Association studies were done in each geographical and aetiological group followed by several combined analyses. The PRNP locus was highly associated with risk in all geographical and aetiological groups. This association was driven by the known coding variation at rs1799990 (PRNP codon 129). No non-PRNP loci achieved genome-wide significance in the meta-analysis of all human prion disease. SNPs at the ZBTB38–RASA2 locus were associated with CJD in the UK (rs295301, P = 3.13 × 10−8; OR, 0.70) but these SNPs showed no replication evidence of association in German sCJD or in Papua New Guinea-based tests. A SNP in the CHN2 gene was associated with vCJD [P = 1.5 × 10−7; odds ratio (OR), 2.36], but not in UK sCJD (P = 0.049; OR, 1.24), in German sCJD or in PNG groups. In the overall meta-analysis of CJD, 14 SNPs were associated (P < 10−5; two at PRNP, three at ZBTB38–RASA2, nine at nine other independent non-PRNP loci), more than would be expected by chance. None of the loci recently identified as genome-wide significant in studies of other neurodegenerative diseases showed any clear evidence of association in prion diseases. Concerning common genetic variation, it is likely that the PRNP locus contains the only strong risk factors that act universally across human prion diseases. Our data are most consistent with several other risk loci of modest overall effects which will require further genetic association studies to provide definitive evidence.
biomathematics; systems biology; bioinformatics
Despite the success of genome-wide association studies (GWAS) in identifying loci associated with common diseases, a significant proportion of the causality remains unexplained. Recent advances in genomic technologies have placed us in a position to initiate large-scale studies of human disease-associated epigenetic variation, specifically variation in DNA methylation (DNAm). Such Epigenome-Wide Association Studies (EWAS) present novel opportunities but also create new challenges that are not encountered in GWAS. We discuss EWAS study design, cohort and sample selections, statistical significance and power, confounding factors, and follow-up studies. We also discuss how integration of EWAS with GWAS can help to dissect complex GWAS haplotypes for functional analysis.
Epigenomics; Disease Genetics; DNA Methylation; Epigenetics; Quantitative Trait
Motivation: Copy number variations (CNVs) are increasingly recognized as an substantial source of individual genetic variation, and hence there is a growing interest in investigating the evolutionary history of CNVs as well as their impact on complex disease susceptibility. CNV/SNP haplotypes are critical for this research, but although many methods have been proposed for inferring integer copy number, few have been designed for inferring CNV haplotypic phase and none of these are applicable at genome-wide scale. Here, we present a method for inferring missing CNV genotypes, predicting CNV allelic configuration and for inferring CNV haplotypic phase from SNP/CNV genotype data. Our method, implemented in the software polyHap v2.0, is based on a hidden Markov model, which models the joint haplotype structure between CNVs and SNPs. Thus, haplotypic phase of CNVs and SNPs are inferred simultaneously. A sampling algorithm is employed to obtain a measure of confidence/credibility of each estimate.
Results: We generated diploid phase-known CNV–SNP genotype datasets by pairing male X chromosome CNV–SNP haplotypes. We show that polyHap provides accurate estimates of missing CNV genotypes, allelic configuration and CNV haplotypic phase on these datasets. We applied our method to a non-simulated dataset—a region on Chromosome 2 encompassing a short deletion. The results confirm that polyHap's accuracy extends to real-life datasets.
Availability: Our method is implemented in version 2.0 of the polyHap software package and can be downloaded from http://www.imperial.ac.uk/medicine/people/l.coin
Supplementary information: Supplementary data are available at Bioinformatics online.
Fasting plasma glucose and risk of type 2 diabetes are higher among Indian Asians than among European and North American Caucasians. Few studies have investigated genetic factors influencing glucose metabolism among Indian Asians.
RESEARCH DESIGN AND METHODS
We carried out genome-wide association studies for fasting glucose in 5,089 nondiabetic Indian Asians genotyped with the Illumina Hap610 BeadChip and 2,385 Indian Asians (698 with type 2 diabetes) genotyped with the Illumina 300 BeadChip. Results were compared with findings in 4,462 European Caucasians.
We identified three single nucleotide polymorphisms (SNPs) associated with glucose among Indian Asians at P < 5 × 10−8, all near melatonin receptor MTNR1B. The most closely associated was rs2166706 (combined P = 2.1 × 10−9), which is in moderate linkage disequilibrium with rs1387153 (r2 = 0.60) and rs10830963 (r2 = 0.45), both previously associated with glucose in European Caucasians. Risk allele frequency and effect sizes for rs2166706 were similar among Indian Asians and European Caucasians: frequency 46.2 versus 45.0%, respectively (P = 0.44); effect 0.05 (95% CI 0.01–0.08) versus 0.05 (0.03–0.07 mmol/l), respectively, higher glucose per allele copy (P = 0.84). SNP rs2166706 was associated with type 2 diabetes in Indian Asians (odds ratio 1.21 [95% CI 1.06–1.38] per copy of risk allele; P = 0.006). SNPs at the GCK, GCKR, and G6PC2 loci were also associated with glucose among Indian Asians. Risk allele frequencies of rs1260326 (GCKR) and rs560887 (G6PC2) were higher among Indian Asians compared with European Caucasians.
Common genetic variation near MTNR1B influences blood glucose and risk of type 2 diabetes in Indian Asians. Genetic variation at the MTNR1B, GCK, GCKR, and G6PC2 loci may contribute to abnormal glucose metabolism and related metabolic disturbances among Indian Asians.
We conducted a two-stage genome-wide association study to identify common genetic variation altering risk of the metabolic syndrome and related phenotypes in Indian Asian men, who have a high prevalence of these conditions. In Stage 1, approximately 317,000 single nucleotide polymorphisms were genotyped in 2700 individuals, from which 1500 SNPs were selected to be genotyped in a further 2300 individuals. Selection for inclusion in Stage 1 was based on four metabolic syndrome component traits: HDL-cholesterol, plasma glucose and Type 2 diabetes, abdominal obesity measured by waist to hip ratio, and diastolic blood pressure. Association was tested with these four traits and a composite metabolic syndrome phenotype. Four SNPs reaching significance level p<5×10−7 and with posterior probability of association >0.8 were found in genes CETP and LPL, associated with HDL-cholesterol. These associations have already been reported in Indian Asians and in Europeans. Five additional loci harboured SNPs significant at p<10−6 and posterior probability >0.5 for HDL-cholesterol, type 2 diabetes or diastolic blood pressure. Our results suggest that the primary genetic determinants of metabolic syndrome are the same in Indian Asians as in other populations, despite the higher prevalence. Further, we found little evidence of a common genetic basis for metabolic syndrome traits in our sample of Indian Asian men.
Neuroticism is a moderately heritable personality trait considered to be a risk factor for developing major depression, anxiety disorders and dementia. We performed a genome-wide association study in 2,235 participants drawn from a population-based study of neuroticism, making this the largest association study for neuroticism to date. Neuroticism was measured by the Eysenck Personality Questionnaire. After Quality Control, we analysed 430,000 autosomal SNPs together with an additional 1.2 million SNPs imputed with high quality from the Hap Map CEU samples. We found a very small effect of population stratification, corrected using one principal component, and some cryptic kinship that required no correction. NKAIN2 showed suggestive evidence of association with neuroticism as a main effect (p<10−6) and GPC6 showed suggestive evidence for interaction with age (p≈10−7). We found support for one previously-reported association (PDE4D), but failed to replicate other recent reports. These results suggest common SNP variation does not strongly influence neuroticism. Our study was powered to detect almost all SNPs explaining at least 2% of heritability, and so our results effectively exclude the existence of loci having a major effect on neuroticism.
Although the introduction of genome-wide association studies (GWAS) have greatly increased the number of genes associated with common diseases, only a small proportion of the predicted genetic contribution has so far been elucidated. Studying the cumulative variation of polymorphisms in multiple genes acting in functional pathways may provide a complementary approach to the more common single SNP association approach in understanding genetic determinants of common disease. We developed a novel pathway-based method to assess the combined contribution of multiple genetic variants acting within canonical biological pathways and applied it to data from 14,000 UK individuals with 7 common diseases. We tested inflammatory pathways for association with Crohn's disease (CD), rheumatoid arthritis (RA) and type 1 diabetes (T1D) with 4 non-inflammatory diseases as controls. Using a variable selection algorithm, we identified variants responsible for the pathway association and evaluated their use for disease prediction using a 10 fold cross-validation framework in order to calculate out-of-sample area under the Receiver Operating Curve (AUC). The generalisability of these predictive models was tested on an independent birth cohort from Northern Finland. Multiple canonical inflammatory pathways showed highly significant associations (p 10−3–10−20) with CD, T1D and RA. Variable selection identified on average a set of 205 SNPs (149 genes) for T1D, 350 SNPs (189 genes) for RA and 493 SNPs (277 genes) for CD. The pattern of polymorphisms at these SNPS were found to be highly predictive of T1D (91% AUC) and RA (85% AUC), and weakly predictive of CD (60% AUC). The predictive ability of the T1D model (without any parameter refitting) had good predictive ability (79% AUC) in the Finnish cohort. Our analysis suggests that genetic contribution to common inflammatory diseases operates through multiple genes interacting in functional pathways.
The power of haplotype-based methods for association studies, identification of regions under selection, and ancestral inference, is well-established for diploid organisms. For polyploids, however, the difficulty of determining phase has limited such approaches. Polyploidy is common in plants and is also observed in animals. Partial polyploidy is sometimes observed in humans (e.g. trisomy 21; Down's syndrome), and it arises more frequently in some human tissues. Local changes in ploidy, known as copy number variations (CNV), arise throughout the genome. Here we present a method, implemented in the software polyHap, for the inference of haplotype phase and missing observations from polyploid genotypes. PolyHap allows each individual to have a different ploidy, but ploidy cannot vary over the genomic region analysed. It employs a hidden Markov model (HMM) and a sampling algorithm to infer haplotypes jointly in multiple individuals and to obtain a measure of uncertainty in its inferences.
In the simulation study, we combine real haplotype data to create artificial diploid, triploid, and tetraploid genotypes, and use these to demonstrate that polyHap performs well, in terms of both switch error rate in recovering phase and imputation error rate for missing genotypes. To our knowledge, there is no comparable software for phasing a large, densely genotyped region of chromosome from triploids and tetraploids, while for diploids we found polyHap to be more accurate than fastPhase. We also compare the results of polyHap to SATlotyper on an experimentally haplotyped tetraploid dataset of 12 SNPs, and show that polyHap is more accurate.
With the availability of large SNP data in polyploids and CNV regions, we believe that polyHap, our proposed method for inferring haplotypic phase from genotype data, will be useful in enabling researchers analysing such data to exploit the power of haplotype-based analyses.
Summary: Genetic data obtained on population samples convey information about their evolutionary history. Inference methods can extract part of this information but they require sophisticated statistical techniques that have been made available to the biologist community (through computer programs) only for simple and standard situations typically involving a small number of samples. We propose here a computer program (DIY ABC) for inference based on approximate Bayesian computation (ABC), in which scenarios can be customized by the user to fit many complex situations involving any number of populations and samples. Such scenarios involve any combination of population divergences, admixtures and population size changes. DIY ABC can be used to compare competing scenarios, estimate parameters for one or more scenarios and compute bias and precision measures for a given scenario and known values of parameters (the current version applies to unlinked microsatellite data). This article describes key methods used in the program and provides its main features. The analysis of one simulated and one real dataset, both with complex evolutionary scenarios, illustrates the main possibilities of DIY ABC.
Availability: The software DIY ABC is freely available at http://www.montpellier.inra.fr/CBGP/diyabc.
Supplementary information: Supplementary data are also available at http://www.montpellier.inra.fr/CBGP/diyabc
FREGENE simulates sequence-level data over large genomic regions in large populations. Because, unlike coalescent simulators, it works forwards through time, it allows complex scenarios of selection, demography, and recombination to be modelled simultaneously. Detailed tracking of sites under selection is implemented in FREGENE and provides the opportunity to test theoretical predictions and gain new insights into mechanisms of selection. We describe here main functionalities of both FREGENE and SAMPLE, a companion program that can replicate association study datasets.
We report detailed analyses of six large simulated datasets that we have made publicly available. Three demographic scenarios are modelled: one panmictic, one substructured with migration, and one complex scenario that mimics the principle features of genetic variation in major worldwide human populations. For each scenario there is one neutral simulation, and one with a complex pattern of selection.
FREGENE and the simulated datasets will be valuable for assessing the validity of models for selection, demography and population genetic parameters, as well as the efficacy of association studies. Its principle advantages are modelling flexibility and computational efficiency. It is open source and object-oriented. As such, it can be customised and the range of models extended.
Testing one SNP at a time does not fully realise the potential of genome-wide association studies to identify multiple causal variants, which is a plausible scenario for many complex diseases. We show that simultaneous analysis of the entire set of SNPs from a genome-wide study to identify the subset that best predicts disease outcome is now feasible, thanks to developments in stochastic search methods. We used a Bayesian-inspired penalised maximum likelihood approach in which every SNP can be considered for additive, dominant, and recessive contributions to disease risk. Posterior mode estimates were obtained for regression coefficients that were each assigned a prior with a sharp mode at zero. A non-zero coefficient estimate was interpreted as corresponding to a significant SNP. We investigated two prior distributions and show that the normal-exponential-gamma prior leads to improved SNP selection in comparison with single-SNP tests. We also derived an explicit approximation for type-I error that avoids the need to use permutation procedures. As well as genome-wide analyses, our method is well-suited to fine mapping with very dense SNP sets obtained from re-sequencing and/or imputation. It can accommodate quantitative as well as case-control phenotypes, covariate adjustment, and can be extended to search for interactions. Here, we demonstrate the power and empirical type-I error of our approach using simulated case-control data sets of up to 500 K SNPs, a real genome-wide data set of 300 K SNPs, and a sequence-based dataset, each of which can be analysed in a few hours on a desktop workstation.
Tests of association with disease status are normally conducted one SNP at a time, ignoring the effects of all other genotyped SNPs. We developed a computationally efficient method to simultaneously analyse all SNPs, either in a genome-wide association (GWA) study, or a fine-mapping study based on re-sequencing and/or imputation. The method selects a subset of SNPs that best predicts disease status, while controlling the type-I error of the selected SNPs. This brings many advantages over standard single-SNP approaches, because the signal from a particular SNP can be more clearly assessed when other SNPs associated with disease status are already included in the model. Thus, in comparison with single-SNP analyses, power is increased and the false positive rate is reduced because of reduced residual variation. Localisation is also greatly improved. We demonstrate these advantages over the widely used single-SNP Armitage Trend Test using GWA simulation studies, a real GWA dataset, and a sequence-based fine-mapping simulation study.
Indel rates were observed to be reduced approximately twenty-fold in exonic ENCODE regions, five-fold in sequence that exhibits high evolutionary constraint in mammals and up to two-fold in some classes of regulatory elements.
We describe the distribution of indels in the 44 Encyclopedia of DNA Elements (ENCODE) regions (about 1% of the human genome) and evaluate the potential contributions of small insertion and deletion polymorphisms (indels) to human genetic variation. We relate indels to known genomic annotation features and measures of evolutionary constraint.
Indel rates are observed to be reduced approximately 20-fold to 60-fold in exonic regions, 5-fold to 10-fold in sequence that exhibits high evolutionary constraint in mammals, and up to 2-fold in some classes of regulatory elements (for instance, formaldehyde assisted isolation of regulatory elements [FAIRE] and hypersensitive sites). In addition, some noncoding transcription and other chromatin mediated regulatory sites also have reduced indel rates. Overall indel rates for these data are estimated to be smaller than single nucleotide polymorphism (SNP) rates by a factor of approximately 2, with both rates measured as base pairs per 100 kilobases to facilitate comparison.
Indel rates exhibit a broadly similar distribution across genomic features compared with SNP density rates, with a reduction in rates in coding transcription and evolutionarily constrained sequence. However, unlike indels, SNP rates do not appear to be reduced in some noncoding functional sequences, such as pseudo-exons, and FAIRE and hypersensitive sites. We conclude that indel rates are greatly reduced in transcribed and evolutionarily constrained DNA, and discuss why indel (but not SNP) rates appear to be constrained at some regulatory sites.