Classification of rare missense substitutions observed during genetic testing for patient management is a considerable problem in clinical genetics. The Bayesian integrated evaluation of unclassified variants is a solution originally developed for BRCA1/2. Here, we take a step toward an analogous system for the mismatch repair (MMR) genes (MLH1, MSH2, MSH6, and PMS2) that confer colon cancer susceptibility in Lynch syndrome by calibrating in silico tools to estimate prior probabilities of pathogenicity for MMR gene missense substitutions. A qualitative five-class classification system was developed and applied to 143 MMR missense variants. This identified 74 missense substitutions suitable for calibration. These substitutions were scored using six different in silico tools (Align-Grantham Variation Grantham Deviation, multivariate analysis of protein polymorphisms [MAPP], Mut-Pred, PolyPhen-2.1, Sorting Intolerant From Tolerant, and Xvar), using curated MMR multiple sequence alignments where possible. The output from each tool was calibrated by regression against the classifications of the 74 missense substitutions; these calibrated outputs are interpretable as prior probabilities of pathogenicity. MAPP was the most accurate tool and MAPP + PolyPhen-2.1 provided the best-combined model (R2 = 0.62 and area under receiver operating characteristic = 0.93). The MAPP + PolyPhen-2.1 output is sufficiently predictive to feed as a continuous variable into the quantitative Bayesian integrated evaluation for clinical classification of MMR gene missense substitutions.
mismatch repair; in silico; missense substitutions; probability of pathogenicity
A major challenge in allogeneic bone marrow (BM) transplantation is overcoming engraftment resistance to avoid the clinical problem of graft rejection. Identifying gene pathways that regulate BM engraftment may reveal molecular targets for overcoming engraftment barriers. Previously, we developed a mouse model of BM transplantation that utilizes recipient conditioning with non-myeloablative total body irradiation (TBI). We defined TBI doses that lead to graft rejection, that conversely are permissive for engraftment, and mouse strain variation with regards to the permissive TBI dose. We now report gene expression analysis, using Agilent Mouse 8x60K microarrays, in spleens of mice conditioned with varied TBI doses for correlation to the expected engraftment phenotype. The spleens of mice given engrafting doses of TBI, compared with non-engrafting TBI doses, demonstrated substantially broader gene expression changes, significant at the multiple testing-corrected P < .05 level and with fold change ≥ 2. Functional analysis revealed significant enrichment for a down-regulated canonical pathway involving B-cell development. Genes enriched in this pathway suggest that suppressing donor antigen processing and presentation may be pivotal effects conferred by TBI to enable engraftment. Regardless of TBI dose and recipient mouse strain, pervasive genomic changes related to inflammation was observed and reflected by significant enrichment for canonical pathways and association with upstream regulators. These gene expression changes suggest that macrophage and complement pathways may be targeted to overcome engraftment barriers. These exploratory results highlight gene pathways that may be important in mediating BM engraftment resistance.
Bone marrow; graft rejection; gene expression
To identify novel mechanisms regulating allogeneic hematopoietic cell engraftment, we used forward genetics and previously described identification, in mice, of a bone marrow (BM) engraftment quantitative trait locus (QTL), termed Bmgr5. This QTL confers dominant and large allele effects for engraftment susceptibility. It was localized to chromosome 16 by quantitative genetic techniques in a segregating backcross bred from susceptible BALB.K and resistant B10.BR mice. We now report verification of the Bmgr5 QTL using reciprocal chromosome 16 consomic strains. The BM engraftment phenotype in these consomic mice shows that Bmgr5 susceptibility alleles are not only sufficient but also indispensable for conferring permissiveness for allogeneic BM engraftment. Using panels of congenic mice, we resolved the Bmgr5 QTL into two separate subloci, termed Bmgr5a (Chr16:14.6–15.8 Mb) and Bmgr5b (Chr16:15.8–17.6 Mb), each conferring permissiveness for the engraftment phenotype and both fine mapped to an interval amenable to positional cloning. Candidate Bmgr5 genes were then prioritized using whole exome DNA sequencing and microarray gene expression data. Further studies are warranted to elucidate the genetic interaction between the Bmgr5a and Bmgr5b QTL and identify causative genes and underlying gene variants. This may lead to new approaches for overcoming the problem of graft rejection in clinical hematopoietic cell transplantation.
Bone marrow; engraftment; quantitative trait locus; consomic mouse; congenic mouse
Many individually rare missense substitutions are encountered during deep resequencing of candidate susceptibility genes and clinical mutation screening of known susceptibility genes. BRCA1 and BRCA2 are among the most resequenced of all genes, and clinical mutation screening of these genes provides an extensive data set for analysis of rare missense substitutions. Align-GVGD is a mathematically simple missense substitution analysis algorithm, based on the Grantham difference, which has already contributed to classification of missense substitutions in BRCA1, BRCA2, and CHEK2. However, the distribution of genetic risk as a function of Align-GVGD's output variables Grantham variation (GV) and Grantham deviation (GD) has not been well characterized. Here, we used data from the Myriad Genetic Laboratories database of nearly 70,000 full-sequence tests plus two risk estimates, one approximating the odds ratio and the other reflecting strength of selection, to display the distribution of risk in the GV-GD plane as a series of surfaces. We abstracted contours from the surfaces and used the contours to define a sequence of missense substitution grades ordered from greatest risk to least risk. The grades were validated internally using a third, personal and family history-based, measure of risk. The Align-GVGD grades defined here are applicable to both the genetic epidemiology problem of classifying rare missense substitutions observed in known susceptibility genes and the molecular epidemiology problem of analyzing rare missense substitutions observed during case-control mutation screening studies of candidate susceptibility genes.
BRCA1; BRCA2; Align-GVGD; unclassified variant; missense substitution; protein multiple sequence alignment
Shared genomic segment (SGS) analysis is a method that uses dense SNP genotyping in high-risk pedigrees to identify regions of sharing between cases. Here, we illustrate the power of SGS to identify dominant rare risk variants. Using simulated pedigrees, we consider 12 disease models based on disease prevalence, minor allele frequency, and penetrance to represent disease loci that explain 0.2% to 99.8% of total disease risk. Pedigrees were required to contain ≥15 meioses between all cases and to be high-risk based on significant excess of disease (p<0.001 or p<0.00001). Across these scenarios the power for a single pedigree ranged widely. Nonetheless, fewer than 10 pedigrees was sufficient for excellent power in the majority of the models. Power increased with the risk attributable to the disease locus, penetrance, and the excess of disease in the pedigree. Sharing allowing for one sporadic case was uniformly more powerful than sharing using all cases. Further, we do a SGS analysis using a large Attenuated Familial Adenomatous Polyposis pedigree and identified a 1.96 Mb region containing the known causal APC gene with genome-wide significance (p<5×10−7). SGS is a powerful method for detecting rare variants and offers a valuable complement to GWAS and linkage analysis.
Three predisposition genes have been identified for cutaneous malignant melanoma (CMM), but they account for only about 25% of melanoma clusters/pedigrees. Linkage analyses of melanoma pedigrees from many countries have failed to identify significant linkage evidence for the remaining predisposition genes that must exist. The Utah linkage analysis approach of using singly informative extended high-risk pedigrees combined with high-density SNP markers has successfully identified significant linkage evidence for two regions. This first genome-wide linkage analysis of the extended Utah high-risk CMM pedigrees provides confirmation of linkage for a chromosome 9q region previously reported in Danish pedigrees. This report confirms that linkage analysis for common disorders can be successful in analysis of high-density markers in sets of singly informative high-risk pedigrees.
Electronically linked datasets have become an important part of clinical research. Information from multiple sources can be used to identify comorbid conditions and patient outcomes, measure use of healthcare services, and enrich demographic and clinical variables of interest. Innovative approaches for creating research infrastructure beyond a traditional data system are necessary.
Materials and methods
Records from a large healthcare system's enterprise data warehouse (EDW) were linked to a statewide population database, and a master subject index was created. The authors evaluate the linkage, along with the impact of missing information in EDW records and the coverage of the population database. The makeup of the EDW and population database provides a subset of cancer records that exist in both resources, which allows a cancer-specific evaluation of the linkage.
About 3.4 million records (60.8%) in the EDW were linked to the population database with a minimum accuracy of 96.3%. It was estimated that approximately 24.8% of target records were absent from the population database, which enabled the effect of the amount and type of information missing from a record on the linkage to be estimated. However, 99% of the records from the oncology data mart linked; they had fewer missing fields and this correlated positively with the number of patient visits.
Discussion and conclusion
A general-purpose research infrastructure was created which allows disease-specific cohorts to be identified. The usefulness of creating an index between institutions is that it allows each institution to maintain control and confidentiality of their own information.
Master subject index; record linking; confidentiality; cancer cohort; population database; informatics; statistics; record linking; master subject index; population database
The increased feasibility of whole-genome (or whole-exome) sequencing has led to renewed interest in using family data to find disease mutations. For clinical phenotypes that lend themselves to study in large families, this approach can be particularly effective, because it may be possible to obtain strong evidence of a causal mutation segregating in a single pedigree even under conditions of extreme locus and/or allelic heterogeneity at the population level. In this paper, we extend our capacity to carry out positional mapping in large pedigrees, using a combination of linkage analysis and within-pedigree linkage trait-variant disequilibrium analysis to fine map down to the level of individual sequence variants. To do this, we develop a novel hybrid approach to the linkage portion, combining the non-stochastic approach to integration over the trait model implemented in the software package Kelvin, with Markov chain Monte Carlo-based approximation of the marker likelihood using blocked Gibbs sampling as implemented in the McSample program in the JPSGCS package. We illustrate both the positional mapping template, as well as the efficacy of the hybrid algorithm, in application to a single large pedigree with phenotypes simulated under a two-locus trait model.
linkage analysis; linkage disequilibrium; MCMC; genome-wide association; PPL; PPLD; epistasis; whole-genome sequence
jPAP (Java Pedigree Analysis Package) performs variance components linkage analysis of either quantitative or discrete traits. Multivariate linkage analysis of two or more traits (all quantitative, all discrete, or any combination) allows the inference of pleiotropy between the traits. The inclusion of multiple quantitative trait loci in linkage analysis allows the inference of epistasis between loci. A user-friendly graphical user interface facilitates the usage of jPAP.
jPAP; Epistasis; Pleiotropy; Variance components linkage analysis; Quantitative trait loci
We applied a new weighted pairwise shared genomic segment (pSGS) analysis for susceptibility gene localization to high-density genomewide SNP data in three extended high-risk breast cancer pedigrees.
Using this method, four genomewide suggestive regions were identified on chromosomes 2, 4, 7 and 8, and a borderline suggestive region on chromosome 14. Seven additional regions with at least nominal evidence were observed. Of particular note among these total twelve regions were three regions that were identified in two pedigrees each; chromosomes 4, 7 and 14. Follow-up two-pedigree pSGS analyses further indicated excessive genomic sharing across the pedigrees in all three regions, suggesting that the underlying susceptibility alleles in those regions may be shared in common. In general, the pSGS regions identified were quite large (average 32.2 Mb), however, the range was wide (0.3 – 88.2 Mb). Several of the regions identified overlapped with loci and genes that have been previously implicated in breast cancer risk, including NBS1, BRCA1 and RAD51L1.
Our analyses have provided several loci of interest to pursue in these high-risk pedigrees and illustrate the utility of the weighted pSGS method and extended pedigrees for gene mapping in complex diseases. A focused sequencing effort across these loci in the sharing individuals is the natural next step to further map the critical underlying susceptibility variants in these regions.
Breast cancer; High-risk pedigrees; Susceptibility; Germline; Genomic sharing
We applied a shared genomic segment (SGS) analysis, incorporating an error model, to identify complete, or near complete, selective sweeps in the HapMap phase II data sets. This method is based on detecting heterozygous sharing across all individuals within a population, to identify regions of sharing with at least one allele in common. We identified multiple interesting regions, many of which are concordant with positive selection regions detected by previous population genetic tests. Others are suggested to be novel regions. Our finding illustrates the utility of SGS as a method for identifying regions of selection, and some of these regions have been proposed to be candidate regions for harboring disease genes.
identity by state; identity by descent; positive selection
We develop recent work on using graphical models for linkage disequilibrium to provide efficient programs for model fitting, phasing, and imputation of missing data in large data sets. Two important features contribute to the computational efficiency: the separation of the model fitting and phasing-imputation processes into different programs, and holding in memory only the data within a moving window of loci during model fitting. Optimal parameter values were chosen by cross-validation to maximize the probability of correctly imputing masked genotypes. The best accuracy obtained is slightly below than that from the Beagle program of Browning and Browning, and our fitting program is slower. However, for large data sets, it uses less storage. For a reference set of n individuals genotyped at m markers, the time and storage required for fitting a graphical model are approximately O(nm) and O(n+m), respectively. To impute the phases and missing data on n individuals using an already fitted graphical model requires O(nm) time and O(m) storage. While the times for fitting and imputation are both O(nm), the imputation process is considerably faster; thus, once a model is estimated from a reference data set, the marginal cost of phasing and imputing further samples is very low.
phasing-imputation; cross validation; SNP genotype assays
We summarize the contributions of Group 9 of Genetic Analysis Workshop 17. This group addressed the problems of linkage disequilibrium and other longer range forms of allelic association when evaluating the effects of genotypes on phenotypes. Issues raised by long-range associations, whether a result of selection, stratification, possible technical errors, or chance, were less expected but proved to be important. Most contributors focused on regression methods of various types to illustrate problematic issues or to develop adaptations for dealing with high-density genotype assays. Study design was also considered, as was graphical modeling. Although no method emerged as uniformly successful, most succeeded in reducing false-positive results either by considering clusters of loci within genes or by applying smoothing metrics that required results from adjacent loci to be similar. Two unexpected results that questioned our assumptions of what is required to model linkage disequilibrium were observed. The first was that correlations between loci separated by large genetic distances can greatly inflate single-locus test statistics, and, whether the result of selection, stratification, possible technical errors, or chance, these correlations seem overabundant. The second unexpected result was that applying principal components analysis to genome-wide genotype data can apparently control not only for population structure but also for linkage disequilibrium.
score tests; two-stage study designs; robust regression; higher criticism; principal components analysis; graphical modeling
Genetic Analysis Workshop 17 (GAW17) provided a platform for evaluating existing statistical genetic methods and for developing novel methods to analyze rare variants that modulate complex traits. In this article, we present an overview of the 1000 Genomes Project exome data and simulated phenotype data that were distributed to GAW17 participants for analyses, the different issues addressed by the participants, and the process of preparation of manuscripts resulting from the discussions during the workshop.
We generalize recent work on graphical models for linkage disequilibrium to estimate the conditional independence structure between all variables for individuals in the Genetic Analysis Workshop 17 unrelated individuals data set. Using a stepwise approach for computational efficiency and an extension of our previously described methods, we estimate a model that describes the relationships between the disease trait, all quantitative variables, all covariates, ethnic origin, and the loci most strongly associated with these variables. We performed our analysis for the first 50 replicate data sets. We found that our approach was able to describe the relationships between the outcomes and covariates and that it could correctly detect associations of disease with several loci and with a reasonable false-positive detection rate.
We applied our method of pairwise shared genomic segment (pSGS) analysis to high-risk pedigrees identified from the Genetic Analysis Workshop 17 (GAW17) mini-exome sequencing data set. The original shared genomic segment method focused on identifying regions shared by all case subjects in a pedigree; thus it can be sensitive to sporadic cases. Our new method examines sharing among all pairs of case subjects in a high-risk pedigree and then uses the mean sharing as the test statistic; in addition, the significance is assessed empirically based on the pedigree structure and linkage disequilibrium pattern of the single-nucleotide polymorphisms. Using all GAW17 replicates, we identified 18 unilineal high-risk pedigrees that contained excess disease (p < 0.01) and at least 15 meioses between case subjects. Eighteen rare causal variants were polymorphic in this set of pedigrees. Based on a significance threshold of 0.001, 72.2% (13/18) of these pedigrees were successfully identified with at least one region that contains a true causal variant. The regions identified included 4 of the possible 18 polymorphic causal variants. On average, 1.1 true positives and 1.7 false positives were identified per pedigree. In conclusion, we have demonstrated the potential of our new pSGS method for localizing rare disease causal variants in common disease using high-risk pedigrees and exome sequence data.
A result for the equivalence of conditional independence graphs of ordered and unordered vector random variables from first-order Markov models is extended to arbitrary forests. The result is relevant to estimating graphical models for linkage disequilibrium between genetic loci. It explains why, in terms of the conditional independence structure, it sometimes does not matter whether you consider haplotypes or genotypes.
graphical modelling; conditional independence graphs; Markov properties; linkage disequilibrium; allelic association
The latent p value is a recently proposed empirical method for assessing evidence against a null hypothesis in a stochastic system involving latent, unobservable variables. It is particularly applicable to genome-wide genetic linkage analysis for test statistics with poorly defined analytical distributions.
We describe an implementation of the latent p value method and its application to a linkage analysis of asthma in 81 extended pedigrees containing 1,858 people genotyped at 533 microsatellite markers. We compare the performance of the latent p value method to a more conventional p value calculation. We also compare the performance of various linkage statistics within this pedigree resource.
Using a novel linkage score referred to as the C-link statistic, our analysis provides strong evidence for a recessive gene influencing asthma on chromosome 5q13 (median latent p value = 0.03). We also demonstrate remarkable improvement in computational requirements compared to a more conventional empirical p value calculation.
The latent p value method is indeed feasible and provides a computationally efficient means to evaluate evidence for linkage regardless of the choice of linkage statistic.
Linkage analysis; Extended pedigrees; Genetic heterogeneity
We describe methods and programs for simulating the genotypes of individuals in a pedigree at large numbers of linked loci when the alleles of the founders are under linkage disequilibrium. Both simulation and estimation of linkage disequilibrium models are shown shown to be feasible on a genome wide scale. The methods are applied to evaluating the statistical significance of streaks of loci at which sets of related individuals share a common allele. The effects of properly allowing for linkage disequilibrium are shown to be important as they explain many of the large observations. This is illustrated by re analysis of a previously reported linkage of prostate cancer to chromosome 1p23.
Genetic mapping; graphical modelling; identity by descent
Both protein-truncating variants and some missense substitutions in CHEK2 confer increased risk of breast cancer. However, no large-scale study has used full open reading frame mutation screening to assess the contribution of rare missense substitutions in CHEK2 to breast cancer risk. This absence has been due in part to a lack of validated statistical methods for summarizing risk attributable to large numbers of individually rare missense substitutions.
Previously, we adapted an in silico assessment of missense substitutions used for analysis of unclassified missense substitutions in BRCA1 and BRCA2 to the problem of assessing candidate genes using rare missense substitution data observed in case-control mutation-screening studies. The method involves stratifying rare missense substitutions observed in cases and/or controls into a series of grades ordered a priori from least to most likely to be evolutionarily deleterious, followed by a logistic regression test for trends to compare the frequency distributions of the graded missense substitutions in cases versus controls. Here we used this approach to analyze CHEK2 mutation-screening data from a population-based series of 1,303 female breast cancer patients and 1,109 unaffected female controls.
We found evidence of risk associated with rare, evolutionarily unlikely CHEK2 missense substitutions. Additional findings were that (1) the risk estimate for the most severe grade of CHEK2 missense substitutions (denoted C65) is approximately equivalent to that of CHEK2 protein-truncating variants; (2) the population attributable fraction and the familial relative risk explained by the pool of rare missense substitutions were similar to those explained by the pool of protein-truncating variants; and (3) post hoc power calculations implied that scaling up case-control mutation screening to examine entire biochemical pathways would require roughly 2,000 cases and controls to achieve acceptable statistical power.
This study shows that CHEK2 harbors many rare sequence variants that confer increased risk of breast cancer and that a substantial proportion of these are missense substitutions. The study validates our analytic approach to rare missense substitutions and provides a method to combine data from protein-truncating variants and rare missense substitutions into a one degree of freedom per gene test.
Genomewide association studies have resulted in a great many genomic regions that are likely to harbor disease genes. Thorough interrogation of these specific regions is the logical next step, including regional haplotype studies to identify risk haplotypes upon which the underlying critical variants lie. Pedigrees ascertained for disease can be powerful for genetic analysis due to the cases being enriched for genetic disease. Here we present a Monte Carlo based method to perform haplotype association analysis. Our method, hapMC, allows for the analysis of full-length and sub-haplotypes, including imputation of missing data, in resources of nuclear families, general pedigrees, case-control data or mixtures thereof. Both traditional association statistics and transmission/disequilibrium statistics can be performed. The method includes a phasing algorithm that can be used in large pedigrees and optional use of pseudocontrols.
Our new phasing algorithm substantially outperformed the standard expectation-maximization algorithm that is ignorant of pedigree structure, and hence is preferable for resources that include pedigree structure. Through simulation we show that our Monte Carlo procedure maintains the correct type 1 error rates for all resource types. Power comparisons suggest that transmission-disequilibrium statistics are superior for performing association in resources of only nuclear families. For mixed structure resources, however, the newly implemented pseudocontrol approach appears to be the best choice. Results also indicated the value of large high-risk pedigrees for association analysis, which, in the simulations considered, were comparable in power to case-control resources of the same sample size.
We propose hapMC as a valuable new tool to perform haplotype association analyses, particularly for resources of mixed structure. The availability of meta-association and haplotype-mining modules in our suite of Monte Carlo haplotype procedures adds further value to the approach.
Summary: It has been argued that the missing heritability in common diseases may be in part due to rare variants and gene–gene effects. Haplotype analyses provide more power for rare variants and joint analyses across genes can address multi-gene effects. Currently, methods are lacking to perform joint multi-locus association analyses across more than one gene/region. Here, we present a haplotype-mining gene–gene analysis method, which considers multi-locus data for two genes/regions simultaneously. This approach extends our single region haplotype-mining algorithm, hapConstructor, to two genes/regions. It allows construction of multi-locus SNP sets at both genes and tests joint gene–gene effects and interactions between single variants or haplotype combinations. A Monte Carlo framework is used to provide statistical significance assessment of the joint and interaction statistics, thus the method can also be used with related individuals. This tool provides a flexible data-mining approach to identifying gene–gene effects that otherwise is currently unavailable.
We examine the utility of high density genotype assays for predisposition gene localization using extended pedigrees. Results for the distribution of the number and length of genomic segments shared identical by descent among relatives previously derived in the context of genomic mismatch scanning are reviewed in the context of dense single nucleotide polymorphism maps. We use long runs of loci at which cases share a common allele identically by state to localize hypothesized predisposition genes. The distribution of such runs under the hypothesis of no genetic effect is evaluated by simulation. Methods are illustrated by analysis of an extended prostate cancer pedigree previously reported to show significant linkage to chromosome 1p23. Our analysis establishes that runs of simple single locus statistics can be powerful, tractable and robust for finding DNA shared between relatives, and that extended pedigrees offer powerful designs for gene detection based on these statistics.
Candidate region; identity by descent; identity by state; prostate cancer; pedigree analysis
We derive methods for enumerating the distinct junction tree representations for any given decomposable graph. We discuss the relevance of the method to estimating conditional independence graphs of graphical models and give an algorithm that, given a junction tree, will generate uniformly at random a tree from the set of those that represent the same graph. Programs implementing these methods are included as supplemental material.
Chordal graphs; triangulated graphs; graphical models; Markov chain Monte Carlo methods