Many individually rare missense substitutions are encountered during deep resequencing of candidate susceptibility genes and clinical mutation screening of known susceptibility genes. BRCA1 and BRCA2 are among the most resequenced of all genes, and clinical mutation screening of these genes provides an extensive data set for analysis of rare missense substitutions. Align-GVGD is a mathematically simple missense substitution analysis algorithm, based on the Grantham difference, which has already contributed to classification of missense substitutions in BRCA1, BRCA2, and CHEK2. However, the distribution of genetic risk as a function of Align-GVGD's output variables Grantham variation (GV) and Grantham deviation (GD) has not been well characterized. Here, we used data from the Myriad Genetic Laboratories database of nearly 70,000 full-sequence tests plus two risk estimates, one approximating the odds ratio and the other reflecting strength of selection, to display the distribution of risk in the GV-GD plane as a series of surfaces. We abstracted contours from the surfaces and used the contours to define a sequence of missense substitution grades ordered from greatest risk to least risk. The grades were validated internally using a third, personal and family history-based, measure of risk. The Align-GVGD grades defined here are applicable to both the genetic epidemiology problem of classifying rare missense substitutions observed in known susceptibility genes and the molecular epidemiology problem of analyzing rare missense substitutions observed during case-control mutation screening studies of candidate susceptibility genes.
BRCA1; BRCA2; Align-GVGD; unclassified variant; missense substitution; protein multiple sequence alignment
Shared genomic segment (SGS) analysis is a method that uses dense SNP genotyping in high-risk pedigrees to identify regions of sharing between cases. Here, we illustrate the power of SGS to identify dominant rare risk variants. Using simulated pedigrees, we consider 12 disease models based on disease prevalence, minor allele frequency, and penetrance to represent disease loci that explain 0.2% to 99.8% of total disease risk. Pedigrees were required to contain ≥15 meioses between all cases and to be high-risk based on significant excess of disease (p<0.001 or p<0.00001). Across these scenarios the power for a single pedigree ranged widely. Nonetheless, fewer than 10 pedigrees was sufficient for excellent power in the majority of the models. Power increased with the risk attributable to the disease locus, penetrance, and the excess of disease in the pedigree. Sharing allowing for one sporadic case was uniformly more powerful than sharing using all cases. Further, we do a SGS analysis using a large Attenuated Familial Adenomatous Polyposis pedigree and identified a 1.96 Mb region containing the known causal APC gene with genome-wide significance (p<5×10−7). SGS is a powerful method for detecting rare variants and offers a valuable complement to GWAS and linkage analysis.
Three predisposition genes have been identified for cutaneous malignant melanoma (CMM), but they account for only about 25% of melanoma clusters/pedigrees. Linkage analyses of melanoma pedigrees from many countries have failed to identify significant linkage evidence for the remaining predisposition genes that must exist. The Utah linkage analysis approach of using singly informative extended high-risk pedigrees combined with high-density SNP markers has successfully identified significant linkage evidence for two regions. This first genome-wide linkage analysis of the extended Utah high-risk CMM pedigrees provides confirmation of linkage for a chromosome 9q region previously reported in Danish pedigrees. This report confirms that linkage analysis for common disorders can be successful in analysis of high-density markers in sets of singly informative high-risk pedigrees.
Electronically linked datasets have become an important part of clinical research. Information from multiple sources can be used to identify comorbid conditions and patient outcomes, measure use of healthcare services, and enrich demographic and clinical variables of interest. Innovative approaches for creating research infrastructure beyond a traditional data system are necessary.
Materials and methods
Records from a large healthcare system's enterprise data warehouse (EDW) were linked to a statewide population database, and a master subject index was created. The authors evaluate the linkage, along with the impact of missing information in EDW records and the coverage of the population database. The makeup of the EDW and population database provides a subset of cancer records that exist in both resources, which allows a cancer-specific evaluation of the linkage.
About 3.4 million records (60.8%) in the EDW were linked to the population database with a minimum accuracy of 96.3%. It was estimated that approximately 24.8% of target records were absent from the population database, which enabled the effect of the amount and type of information missing from a record on the linkage to be estimated. However, 99% of the records from the oncology data mart linked; they had fewer missing fields and this correlated positively with the number of patient visits.
Discussion and conclusion
A general-purpose research infrastructure was created which allows disease-specific cohorts to be identified. The usefulness of creating an index between institutions is that it allows each institution to maintain control and confidentiality of their own information.
Master subject index; record linking; confidentiality; cancer cohort; population database; informatics; statistics; record linking; master subject index; population database
The increased feasibility of whole-genome (or whole-exome) sequencing has led to renewed interest in using family data to find disease mutations. For clinical phenotypes that lend themselves to study in large families, this approach can be particularly effective, because it may be possible to obtain strong evidence of a causal mutation segregating in a single pedigree even under conditions of extreme locus and/or allelic heterogeneity at the population level. In this paper, we extend our capacity to carry out positional mapping in large pedigrees, using a combination of linkage analysis and within-pedigree linkage trait-variant disequilibrium analysis to fine map down to the level of individual sequence variants. To do this, we develop a novel hybrid approach to the linkage portion, combining the non-stochastic approach to integration over the trait model implemented in the software package Kelvin, with Markov chain Monte Carlo-based approximation of the marker likelihood using blocked Gibbs sampling as implemented in the McSample program in the JPSGCS package. We illustrate both the positional mapping template, as well as the efficacy of the hybrid algorithm, in application to a single large pedigree with phenotypes simulated under a two-locus trait model.
linkage analysis; linkage disequilibrium; MCMC; genome-wide association; PPL; PPLD; epistasis; whole-genome sequence
jPAP (Java Pedigree Analysis Package) performs variance components linkage analysis of either quantitative or discrete traits. Multivariate linkage analysis of two or more traits (all quantitative, all discrete, or any combination) allows the inference of pleiotropy between the traits. The inclusion of multiple quantitative trait loci in linkage analysis allows the inference of epistasis between loci. A user-friendly graphical user interface facilitates the usage of jPAP.
jPAP; Epistasis; Pleiotropy; Variance components linkage analysis; Quantitative trait loci
We applied a new weighted pairwise shared genomic segment (pSGS) analysis for susceptibility gene localization to high-density genomewide SNP data in three extended high-risk breast cancer pedigrees.
Using this method, four genomewide suggestive regions were identified on chromosomes 2, 4, 7 and 8, and a borderline suggestive region on chromosome 14. Seven additional regions with at least nominal evidence were observed. Of particular note among these total twelve regions were three regions that were identified in two pedigrees each; chromosomes 4, 7 and 14. Follow-up two-pedigree pSGS analyses further indicated excessive genomic sharing across the pedigrees in all three regions, suggesting that the underlying susceptibility alleles in those regions may be shared in common. In general, the pSGS regions identified were quite large (average 32.2 Mb), however, the range was wide (0.3 – 88.2 Mb). Several of the regions identified overlapped with loci and genes that have been previously implicated in breast cancer risk, including NBS1, BRCA1 and RAD51L1.
Our analyses have provided several loci of interest to pursue in these high-risk pedigrees and illustrate the utility of the weighted pSGS method and extended pedigrees for gene mapping in complex diseases. A focused sequencing effort across these loci in the sharing individuals is the natural next step to further map the critical underlying susceptibility variants in these regions.
Breast cancer; High-risk pedigrees; Susceptibility; Germline; Genomic sharing
We applied a shared genomic segment (SGS) analysis, incorporating an error model, to identify complete, or near complete, selective sweeps in the HapMap phase II data sets. This method is based on detecting heterozygous sharing across all individuals within a population, to identify regions of sharing with at least one allele in common. We identified multiple interesting regions, many of which are concordant with positive selection regions detected by previous population genetic tests. Others are suggested to be novel regions. Our finding illustrates the utility of SGS as a method for identifying regions of selection, and some of these regions have been proposed to be candidate regions for harboring disease genes.
identity by state; identity by descent; positive selection
We develop recent work on using graphical models for linkage disequilibrium to provide efficient programs for model fitting, phasing, and imputation of missing data in large data sets. Two important features contribute to the computational efficiency: the separation of the model fitting and phasing-imputation processes into different programs, and holding in memory only the data within a moving window of loci during model fitting. Optimal parameter values were chosen by cross-validation to maximize the probability of correctly imputing masked genotypes. The best accuracy obtained is slightly below than that from the Beagle program of Browning and Browning, and our fitting program is slower. However, for large data sets, it uses less storage. For a reference set of n individuals genotyped at m markers, the time and storage required for fitting a graphical model are approximately O(nm) and O(n+m), respectively. To impute the phases and missing data on n individuals using an already fitted graphical model requires O(nm) time and O(m) storage. While the times for fitting and imputation are both O(nm), the imputation process is considerably faster; thus, once a model is estimated from a reference data set, the marginal cost of phasing and imputing further samples is very low.
phasing-imputation; cross validation; SNP genotype assays
We summarize the contributions of Group 9 of Genetic Analysis Workshop 17. This group addressed the problems of linkage disequilibrium and other longer range forms of allelic association when evaluating the effects of genotypes on phenotypes. Issues raised by long-range associations, whether a result of selection, stratification, possible technical errors, or chance, were less expected but proved to be important. Most contributors focused on regression methods of various types to illustrate problematic issues or to develop adaptations for dealing with high-density genotype assays. Study design was also considered, as was graphical modeling. Although no method emerged as uniformly successful, most succeeded in reducing false-positive results either by considering clusters of loci within genes or by applying smoothing metrics that required results from adjacent loci to be similar. Two unexpected results that questioned our assumptions of what is required to model linkage disequilibrium were observed. The first was that correlations between loci separated by large genetic distances can greatly inflate single-locus test statistics, and, whether the result of selection, stratification, possible technical errors, or chance, these correlations seem overabundant. The second unexpected result was that applying principal components analysis to genome-wide genotype data can apparently control not only for population structure but also for linkage disequilibrium.
score tests; two-stage study designs; robust regression; higher criticism; principal components analysis; graphical modeling
Genetic Analysis Workshop 17 (GAW17) provided a platform for evaluating existing statistical genetic methods and for developing novel methods to analyze rare variants that modulate complex traits. In this article, we present an overview of the 1000 Genomes Project exome data and simulated phenotype data that were distributed to GAW17 participants for analyses, the different issues addressed by the participants, and the process of preparation of manuscripts resulting from the discussions during the workshop.
We generalize recent work on graphical models for linkage disequilibrium to estimate the conditional independence structure between all variables for individuals in the Genetic Analysis Workshop 17 unrelated individuals data set. Using a stepwise approach for computational efficiency and an extension of our previously described methods, we estimate a model that describes the relationships between the disease trait, all quantitative variables, all covariates, ethnic origin, and the loci most strongly associated with these variables. We performed our analysis for the first 50 replicate data sets. We found that our approach was able to describe the relationships between the outcomes and covariates and that it could correctly detect associations of disease with several loci and with a reasonable false-positive detection rate.
We applied our method of pairwise shared genomic segment (pSGS) analysis to high-risk pedigrees identified from the Genetic Analysis Workshop 17 (GAW17) mini-exome sequencing data set. The original shared genomic segment method focused on identifying regions shared by all case subjects in a pedigree; thus it can be sensitive to sporadic cases. Our new method examines sharing among all pairs of case subjects in a high-risk pedigree and then uses the mean sharing as the test statistic; in addition, the significance is assessed empirically based on the pedigree structure and linkage disequilibrium pattern of the single-nucleotide polymorphisms. Using all GAW17 replicates, we identified 18 unilineal high-risk pedigrees that contained excess disease (p < 0.01) and at least 15 meioses between case subjects. Eighteen rare causal variants were polymorphic in this set of pedigrees. Based on a significance threshold of 0.001, 72.2% (13/18) of these pedigrees were successfully identified with at least one region that contains a true causal variant. The regions identified included 4 of the possible 18 polymorphic causal variants. On average, 1.1 true positives and 1.7 false positives were identified per pedigree. In conclusion, we have demonstrated the potential of our new pSGS method for localizing rare disease causal variants in common disease using high-risk pedigrees and exome sequence data.
A result for the equivalence of conditional independence graphs of ordered and unordered vector random variables from first-order Markov models is extended to arbitrary forests. The result is relevant to estimating graphical models for linkage disequilibrium between genetic loci. It explains why, in terms of the conditional independence structure, it sometimes does not matter whether you consider haplotypes or genotypes.
graphical modelling; conditional independence graphs; Markov properties; linkage disequilibrium; allelic association
The latent p value is a recently proposed empirical method for assessing evidence against a null hypothesis in a stochastic system involving latent, unobservable variables. It is particularly applicable to genome-wide genetic linkage analysis for test statistics with poorly defined analytical distributions.
We describe an implementation of the latent p value method and its application to a linkage analysis of asthma in 81 extended pedigrees containing 1,858 people genotyped at 533 microsatellite markers. We compare the performance of the latent p value method to a more conventional p value calculation. We also compare the performance of various linkage statistics within this pedigree resource.
Using a novel linkage score referred to as the C-link statistic, our analysis provides strong evidence for a recessive gene influencing asthma on chromosome 5q13 (median latent p value = 0.03). We also demonstrate remarkable improvement in computational requirements compared to a more conventional empirical p value calculation.
The latent p value method is indeed feasible and provides a computationally efficient means to evaluate evidence for linkage regardless of the choice of linkage statistic.
Linkage analysis; Extended pedigrees; Genetic heterogeneity
We describe methods and programs for simulating the genotypes of individuals in a pedigree at large numbers of linked loci when the alleles of the founders are under linkage disequilibrium. Both simulation and estimation of linkage disequilibrium models are shown shown to be feasible on a genome wide scale. The methods are applied to evaluating the statistical significance of streaks of loci at which sets of related individuals share a common allele. The effects of properly allowing for linkage disequilibrium are shown to be important as they explain many of the large observations. This is illustrated by re analysis of a previously reported linkage of prostate cancer to chromosome 1p23.
Genetic mapping; graphical modelling; identity by descent
Both protein-truncating variants and some missense substitutions in CHEK2 confer increased risk of breast cancer. However, no large-scale study has used full open reading frame mutation screening to assess the contribution of rare missense substitutions in CHEK2 to breast cancer risk. This absence has been due in part to a lack of validated statistical methods for summarizing risk attributable to large numbers of individually rare missense substitutions.
Previously, we adapted an in silico assessment of missense substitutions used for analysis of unclassified missense substitutions in BRCA1 and BRCA2 to the problem of assessing candidate genes using rare missense substitution data observed in case-control mutation-screening studies. The method involves stratifying rare missense substitutions observed in cases and/or controls into a series of grades ordered a priori from least to most likely to be evolutionarily deleterious, followed by a logistic regression test for trends to compare the frequency distributions of the graded missense substitutions in cases versus controls. Here we used this approach to analyze CHEK2 mutation-screening data from a population-based series of 1,303 female breast cancer patients and 1,109 unaffected female controls.
We found evidence of risk associated with rare, evolutionarily unlikely CHEK2 missense substitutions. Additional findings were that (1) the risk estimate for the most severe grade of CHEK2 missense substitutions (denoted C65) is approximately equivalent to that of CHEK2 protein-truncating variants; (2) the population attributable fraction and the familial relative risk explained by the pool of rare missense substitutions were similar to those explained by the pool of protein-truncating variants; and (3) post hoc power calculations implied that scaling up case-control mutation screening to examine entire biochemical pathways would require roughly 2,000 cases and controls to achieve acceptable statistical power.
This study shows that CHEK2 harbors many rare sequence variants that confer increased risk of breast cancer and that a substantial proportion of these are missense substitutions. The study validates our analytic approach to rare missense substitutions and provides a method to combine data from protein-truncating variants and rare missense substitutions into a one degree of freedom per gene test.
Genomewide association studies have resulted in a great many genomic regions that are likely to harbor disease genes. Thorough interrogation of these specific regions is the logical next step, including regional haplotype studies to identify risk haplotypes upon which the underlying critical variants lie. Pedigrees ascertained for disease can be powerful for genetic analysis due to the cases being enriched for genetic disease. Here we present a Monte Carlo based method to perform haplotype association analysis. Our method, hapMC, allows for the analysis of full-length and sub-haplotypes, including imputation of missing data, in resources of nuclear families, general pedigrees, case-control data or mixtures thereof. Both traditional association statistics and transmission/disequilibrium statistics can be performed. The method includes a phasing algorithm that can be used in large pedigrees and optional use of pseudocontrols.
Our new phasing algorithm substantially outperformed the standard expectation-maximization algorithm that is ignorant of pedigree structure, and hence is preferable for resources that include pedigree structure. Through simulation we show that our Monte Carlo procedure maintains the correct type 1 error rates for all resource types. Power comparisons suggest that transmission-disequilibrium statistics are superior for performing association in resources of only nuclear families. For mixed structure resources, however, the newly implemented pseudocontrol approach appears to be the best choice. Results also indicated the value of large high-risk pedigrees for association analysis, which, in the simulations considered, were comparable in power to case-control resources of the same sample size.
We propose hapMC as a valuable new tool to perform haplotype association analyses, particularly for resources of mixed structure. The availability of meta-association and haplotype-mining modules in our suite of Monte Carlo haplotype procedures adds further value to the approach.
Summary: It has been argued that the missing heritability in common diseases may be in part due to rare variants and gene–gene effects. Haplotype analyses provide more power for rare variants and joint analyses across genes can address multi-gene effects. Currently, methods are lacking to perform joint multi-locus association analyses across more than one gene/region. Here, we present a haplotype-mining gene–gene analysis method, which considers multi-locus data for two genes/regions simultaneously. This approach extends our single region haplotype-mining algorithm, hapConstructor, to two genes/regions. It allows construction of multi-locus SNP sets at both genes and tests joint gene–gene effects and interactions between single variants or haplotype combinations. A Monte Carlo framework is used to provide statistical significance assessment of the joint and interaction statistics, thus the method can also be used with related individuals. This tool provides a flexible data-mining approach to identifying gene–gene effects that otherwise is currently unavailable.
We examine the utility of high density genotype assays for predisposition gene localization using extended pedigrees. Results for the distribution of the number and length of genomic segments shared identical by descent among relatives previously derived in the context of genomic mismatch scanning are reviewed in the context of dense single nucleotide polymorphism maps. We use long runs of loci at which cases share a common allele identically by state to localize hypothesized predisposition genes. The distribution of such runs under the hypothesis of no genetic effect is evaluated by simulation. Methods are illustrated by analysis of an extended prostate cancer pedigree previously reported to show significant linkage to chromosome 1p23. Our analysis establishes that runs of simple single locus statistics can be powerful, tractable and robust for finding DNA shared between relatives, and that extended pedigrees offer powerful designs for gene detection based on these statistics.
Candidate region; identity by descent; identity by state; prostate cancer; pedigree analysis
We derive methods for enumerating the distinct junction tree representations for any given decomposable graph. We discuss the relevance of the method to estimating conditional independence graphs of graphical models and give an algorithm that, given a junction tree, will generate uniformly at random a tree from the set of those that represent the same graph. Programs implementing these methods are included as supplemental material.
Chordal graphs; triangulated graphs; graphical models; Markov chain Monte Carlo methods
Graft-versus-host disease (GVHD) is the major cause of morbidity and mortality after allogeneic hematopoietic cell transplantation. From a genetic perspective, GVHD is a complex phenotypic trait. While it is understood that susceptibility results from interacting polymorphisms of genes encoding histocompatibility antigens and immune regulatory molecules, a detailed and integrative understanding of the genetic background underlying GVHD remains lacking. To gain insight regarding these issues, we performed a forward genetic study. A MHC-matched mouse model was utilized in which irradiated recipient BALB.K and B10.BR mice demonstrate differential susceptibility to lethal GHVD when transplanted using AKR/J donors. Assessment of GVHD in (B10.BR x BALB.K)F1 mice revealed that susceptibility is a dominant trait and conferred by deleterious alleles from the BALB.K strain. In order to identify the alleles responsible GVHD susceptibility, a genome scanning approach was taken using (B10.BR x BALB.K)F1 × B10.BR backcross mice as recipients. A major susceptibility locus, termed the Gvh1 locus, was identified on chromosome 16 using linkage analysis (LOD = 9.1). A second locus was found on chromosome 13, named Gvh2, which had additive but protective effects. Further identification of Gvh genes by positional cloning may yield new insight into genetic control mechanisms regulating GVHD and potentially reveal novel approaches for effective GVHD therapy.
Transplantation; Graft Versus Host Disease; Rodent
Motivation: Efficient models for genetic linkage disequilibrium (LD) are needed to enable appropriate statistical analysis of the dense, genome-wide single nucleotide polymorphism assays currently available.
Results: Estimation of graphical models for LD within a restricted class of decomposable models is shown to be possible using computer time and storage that scale linearly with the number of loci. Programs for estimation and for simulating from these models on a whole-genome basis are described and provided.
Availability: Java classes and source code for IntervalLD and GeneDrops are freely available over the internet at http://bioinformatics.med.utah.edu/∼alun.
A recent genome-wide association (GWA) study suggested seven new loci as associated with prostate cancer (PRCA) susceptibility. The strongest associated SNP in each region was identified (rs2660753, rs9364554, rs6465657, rs10993994, rs7931342, rs2735839, rs5945619). We studied these seven SNPs in a replication study consisting of 169 familial PRCA cases selected from Utah high-risk PRCA pedigrees and 805 controls. We performed subset analyses for aggressive and early onset PRCA. At a nominal significance level, two SNPs were found to be associated with PRCA: rs10993994 on chromosome 10q11 (odds ratio (OR) =1.42 [95% confidence interval (CI), 1.05–1.90], p=0.022); and rs5945619 on chromosome Xp11 (OR=1.54 [95% CI, 1.03–2.31], p=0.035). Restricting analysis to familial PRCA cases with aggressive disease yielded very similar risk estimates at both SNPs. However, subset analysis for familial, early onset disease indicated highly significant association evidence and substantially higher risk estimates for rs10993994 (OR=2.20 [95% CI, 1.48–3.27], p<0.0001). This result suggests that the higher risk estimates from the stage 1 cohort in the original study for rs10993994 may have been due to the early-onset and familial nature of the PRCA cases in that cohort. In conclusion, in a small case-control study of PRCA cases from Utah high-risk pedigrees, we have significantly replicated association of PRCA with rs10993994 (10q11) upon study-wide correction for multiple comparisons. We also nominally replicated the association of PRCA with rs5945619 (Xp11). In particular, it appears that the susceptibility locus at 10q11 maybe involved in familial, early onset disease.
Prostate Cancer; Genetic Risk