Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
Hum Mutat. Author manuscript; available in PMC 2010 September 10.
Published in final edited form as:
PMCID: PMC2936773

Genetic Evidence and Integration of Various Data Sources for Classifying Uncertain Variants Into a Single Model

David E. Goldgar,1,* Douglas F. Easton,2 Graham B. Byrnes,3 Amanda B. Spurdle,4 Edwin S. Iversen,5 Marc S. Greenblatt,6 and IARC Unclassified Genetic Variants Working Group


Genetic testing often results in the finding of a variant whose clinical significance is unknown. A number of different approaches have been employed in the attempt to classify such variants. For some variants, case-control, segregation, family history, or other statistical studies can provide strong evidence of direct association with cancer risk. For most variants, other evidence is available that relates to properties of the protein or gene sequence. In this work we propose a Bayesian method for assessing the likelihood that a variant is pathogenic. We discuss the assessment of prior probability, and how to combine the various sources of data into a statistically valid integrated assessment with a posterior probability of pathogenicity. In particular, we propose the use of a two-component mixture model to integrate these various sources of data and to estimate the parameters related to sensitivity and specificity of specific kinds of evidence. Further, we discuss some of the issues involved in this process and the assumptions that underpin many of the methods used in the evaluation process.

Keywords: likelihood models, unclassified variants, integrated model, classification


Clinicians and scientists are accustomed to forming conclusions based on multiple lines of evidence, and to dealing with degrees of uncertainty in these conclusions. Genetic testing for cancer predisposition syndromes often identifies variants whose pathogenicity is uncertain, which poses difficult issues for risk communication in the genetic counseling setting. In the literature these have been designated variants of uncertain significance (VUSs), or unclassified variants (UVs or UCVs). Here we shall use the term “variant” to refer to them. As detailed in companion articles in this issue of Human Mutation [Couch et al., 2008; Hofstra et al., 2008; Spurdle et al., 2008a; Tavtigian et al., 2008b], multiple methods have been used in attempts to classify whether a given variant is pathogenic (also termed “disease-associated”) or of little clinical significance (also termed neutral, benign, or “polymorphism,” although most such variants do not meet the criteria of a 1% allele frequency to be true polymorphisms in the genetic sense). A summary of these approaches and some perceived advantages and disadvantages are shown in Table 1 (revised from Table 1 in Goldgar et al. [2004]). The present work discusses the details of trying to integrate all methods to reach a conclusion for each variant and, where possible, providing a quantitative assessment of the evidence related to that conclusion. These conclusions can then be used to classify variants for clinical purposes, for example using the five-category scheme proposed in Plon et al. [2008]. At present, our discussions pertain mainly to inherited cancer predisposition syndromes, and certain kinds of evidence are even more limited to tumor suppressor genes. However, in vitro and in silico studies have also been done in noncancer Mendelian genetic syndromes [Chan et al., 2007; Bayrak-Toydemir et al., 2008], and we hope that our approach is tested in the wider genetics community.

Table 1
Data Relevant for SequenceVariant Classification as Pathogenic/Benign for High-Risk Cancer Genes

An analogous process exists for classifying carcinogens. The International Agency for Research on Cancer (IARC), in its monographs program [IARC, 2006], uses epidemiologic data from humans, animal data, and in vitro mechanistic data to classify carcinogens as: definitely carcinogenic, probably carcinogenic, possibly carcinogenic, likely not carcinogenic, or uncertain. The process relies on a panel of outside experts and is built around consensus. Like clinical decision-making and the use of multiple scientific experiments, the integration process is generally qualitative rather than quantitative. Importantly, substances are periodically reevaluated as new data becomes available, and classifications may change in either direction.

In the case of classifying genetic variants, some types of evidence (e.g., case-control studies, cosegregation) measure a more direct association of the variant with disease, whereas other evidence relies on the effect of the variant (observed or predicted) on aspects of the gene in question that are surrogates for disease risk and are, in that sense, a less direct measure of the clinical outcome. In some cases a particular assay may incorporate aspects of both kinds of evidence. Here we will provide some specific details on the genetic lines of evidence, because the other lines of evidence are covered in other articles in this issue.


Most direct evidence involves clinical observations of disease occurrence, and relies on statistical genetics methods. Each of these methods can be used to determine a likelihood ratio (LR) of association of the variant with the condition, which then provides a method for combining the evidence from different approaches. Four common types of analysis fall into this category: 1) cosegregation of the variant with disease; 2) comparison of allele frequency between cases and controls; 3) association of allele frequency with personal or family history of the disease; and 4) co-occurrence of the disease with other genetic variants.

Cosegregation Analysis

The most straightforward genetic evidence is cosegregation of a variant with the cancer phenotype in pedigrees. This approach is described in detail by Thompson et al. [2003], but briefly, the LR is derived by comparing the likelihood of affected individuals sharing the variant with that under the null hypothesis (that the variant is neutral with respect to risk, in which case the variant will segregate randomly within a pedigree). Essentially, the approach is similar to genetic linkage analysis, except that one is interested in the segregation of the variant itself rather than linked markers, and the likelihood can be derived in a similar manner to linkage LOD scores. The likelihood is a function of an assumed penetrance function incorporating the risk of disease for each genotype at the disease locus (which may be age- and/or sex-specific) and the frequency of the disease allele (usually assumed to be very rare). These penetrance parameters are typically assumed to be those estimated from (hopefully) large studies of families segregating known “high-risk” pathogenic mutations. However, the LRs are not very sensitive to minor misspecification of the penetrance, and the method otherwise depends only on Mendel’s laws, so the method can be considered very robust. A major advantage of this approach is that it depends only on the availability of DNA samples from multiple individuals from families with the variant, which will often be collected during the counseling process. It also examines the question most directly relevant to genetic counseling, and is the method that clinical geneticists are most comfortable with. The main disadvantage is that, like linkage analysis, its power depends on the number of informative meioses. Most disease pedigrees are small and obtaining samples from multiple affected individuals can be difficult. For this reason, it is rarely possible to categorize variants as pathogenic on the basis of segregation alone.

Other models assessing cosegregation can be employed [Petersen et al., 1998; Zhou et al., 2005] and, given sufficient data, the penetrance of a variant (or group of variants) could be estimated to maximize the LR by comparing the hypotheses that the variant has a specified penetrance function to that under the null hypothesis that the variant confers no increased disease risk.

Case-Control Analysis

A second direct approach is to compare the variant frequency in series of cases and controls. For example, case-control studies have helped to classify a problematic MSH2 variant (c.965G>A; p.G322D) as neutral [Barnetson et al., 2008]. The case-control approach can also include related individuals and missing genotypes using the weighted-score test method [Thornton and McPeek, 2007]. LRs for such data can be derived straightforwardly using standard case-control approaches. While this is the standard approach for evaluating common genetic variants, the main disadvantage in the current context is that most VUSs are rare (frequencies typically <1 in 1,000). Thus, prohibitively large sample sizes will usually be required to demonstrate that a variant is pathogenic. In addition, the variants of interest are often specific to a single geographical region or ethnic group, making it even more difficult to obtain the relevant samples. In practice, the approach is more often used as a rapid method to screen out probable neutral variants. If a variant is genotyped in a few hundred controls and shown to have a frequency of 1% or more, it is highly unlikely to be a high-risk variant (of course, it remains possible that the variant is associated with a more modest risk of the disease, but this falls outside the current classification problem). Mitchell et al. [2005] considered the statistical problem of the probability that a newly discovered neutral variant would be found in a series of controls. For example, they calculate that for complete resequencing of 10 kb (e.g., BRCA2) of coding sequence, the probability that a sequence variant found in a patient would not be found in 200 controls is ~0.025.

Personal and Family History

This approach is essentially an extension of the case-control approach. It relies on the fact that a variant that is associated with disease will tend to occur in families with a stronger history of disease. Therefore, a variant that is pathogenic will tend to have a family history similar to that of known pathogenic mutations, while a neutral variant will tend to have a weaker family history comparable to individuals without a mutation. This allows an LR to be constructed, providing one has available the family histories of all the individuals who have been tested in the population in which the variant was identified. The simplest approach to deriving the LR is to categorize features of the family history, by numbers of affected individuals, age at onset, and so forth. One can then derive a risk score, either from an external model such as the Manchester score [Evans et al., 2004, 2005], or by fitting an empirical model to the dataset of interest [Easton et al., 2007]. Alternatively, one could use a specific model and use the entire pedigree structure information [Antoniou et al., 2004; Berry et al., 2002]. Unlike the cosegregation approach, this approach only requires genotyping of a single individual per family, making it informative for a much larger number of variants. It does, however, require detailed data on the set of families on whom mutation testing is being done, not just the family with the variant of interest. The potential power of this approach has been demonstrated by the analysis of variants in BRCA1 and BRCA2 in the Myriad dataset [Easton et al., 2007]. Since the method depends on there being variation in the strength of family history, the approach works best when mutation screening has been extended to include relatively low-risk individuals.

It is worth noting that LR is independent of the LR determined by segregation within a pedigree [Easton et al., 2007], since the latter reflects the distribution of genotypes (carriers and noncarriers) of the variant among the pedigree members conditional on the disease phenotypes, while the former compares the distribution of phenotypes under the hypotheses of causality and neutrality.

This approach can be extended to include subtypes of disease, if pathogenic mutations are known to be associated with particular subtypes. This use of disease pathology is described in Hofstra et al. [2008].


The general principle of co-occurrence is that, if a pathogenic high-risk mutation is found in a family, one would expect this to be the main cause of the disease aggregation in that family, and would not expect there to be a second mutation. Therefore, the co-occurrence of another known pathogenic mutation potentially reduces the likelihood that an UV is truly pathogenic.

The detailed implementation of this approach will, however, vary from gene to gene, depending in particular on the phenotypic consequences of carrying two pathogenic mutations. For example, in the case of BRCA1, organisms with two nonfunctional alleles are believed not to be viable. “Double-knockout” of BRCA1 in transgenic mice is an embryonic lethal phenotype [Liu et al., 1996], and no human has been found to carry two pathogenic BRCA1 alleles despite the presence of relatively common founder mutations in some populations such as Ashkenazi Jews [Judkins et al., 2005]. Therefore, co-occurrence of a novel BRCA1 variant with a known pathogenic BRCA1 variant is taken as evidence for the lack of pathogenicity of the novel variant. An LR for pathogenicity can be assessed and used in quantitative assessment of pathogenicity using multifactorial likelihood modeling [Goldgar et al., 2004; Easton et al., 2007]. On the other hand, carriers of two BRCA2 mutations manifest Fanconi anemia, an autosomal recessive condition characterized by bone marrow failure, congenital malformations, predisposition to cancer, and cellular hypersensitivity to DNA cross-linking agents [Howlett et al., 2002]. Thus, co-occurrence in trans of a novel BRCA2 variant with a known mutation may be interpreted by assessing the presence of a Fanconi-like phenotype. The presence of two variants and the resulting phenotype may also be helpful in mismatch repair (MMR) genes. Carriers of two pathogenic MMR variants have an aggressive phenotype with extremely early onset of cancer [Rahman and Scott, 2007; Auclair et al., 2006; Agostini et al., 2005], and carriers of two MLH1 variants may manifest a phenotype of neurofibromatosis [Wang et al., 1999; Ricciardone et al., 1999]. These clinical factors can be considered when evaluating MMR variants. Because sufficient genetic/direct evidence for reliable classification on this basis alone may be lacking particularly for very rare variants, it will be often necessary to rely on other additional information in the classification process.

One difficulty with the above approach is that the arguments strictly apply to variants co-occurring in trans. The occurrence of a variant in cis with a known mutation is not informative, at least for the common cancer susceptibility genes in which the mutations are inactivating, since the disease risk would be expected to be the same regardless of whether the UV was pathogenic or neutral. However, in selected cases, such a variant may be used as a linked marker to predict carrier status and cancer risk within a given family (discussed in Plon et al. [2008]), even if the underlying true pathogenic variant is yet to be discovered. This situation would be identified by very strong genetic evidence, such as cosegregation in favor of it being pathogenic for variants that, for example, are at a residue that is not evolutionary conserved, in a nonfunctional domain of the protein, or have wild-type function when measured in assays using specific artificial cell constructs.


Indirect evidence may be derived from a variety of features of a variant, including structural features of the gene or protein or the ability of the protein to perform key cellular functions. In vitro assays of a crucial cellular function are often considered to be good surrogates for determining the pathogenicity of a variant. In some cases, such as MMR function, this association appears to be robust. However, for some genes, such as BRCA1, it may be more difficult to associate the function that is assayed with the disease risk [Couch et al., 2008]. Impaired in vitro function has been documented for cancer-associated variants of many genes, and such a finding can serve as qualitative evidence in classifying some variants. Another aspect of indirect evidence may derive from a variety of bioinformatic analyses of the mutant sequence, including evolutionary conservation, the predicted nature of the severity of the amino acid substitution [Tavtigian et al., 2008a], and predicted effects of both exonic and intronic sequence on protein splicing [Spurdle et al., 2008a].

To best define the predictive value of indirect measures, they should be validated by applying them to a set of variants that are known to be pathogenic and a set known to be benign/neutral, by calibrating them against more direct evidence from clinical observations. This allows the evidence from such data to be summarized numerically in terms of positive predictive value (PPV; i.e., the percentage of positive tests that are true-positives) and negative predictive value (NPV; i.e., the percentage of negative tests that are true-negatives). Unfortunately, this has not always been possible. One difficulty in such an analysis is that for certain disease genes, there are very few well-established pathogenic missense variants. If such relationships could be quantified, however, then an odds ratio (OR) or LR for pathogenicity could be calculated by dividing the likelihood that a variant is pathogenic by the likelihood that it is not. For example, if a method results in a PPV of 90%, then the OR that a variant with a positive test is pathogenic would be 90%/10% or 9:1. Although it will be desirable to eventually determine the reasons for false-positives and false-negatives and improving each method of assessment, it is not necessary to understand the reasons before using the observed PVs to classify variants.

Several studies have estimated the PV of computational studies based on protein structural gene evolution [Tavtigian et al., 2008a]. In studies of CDKN2A, MLH1, MSH2, and two genes responsible for noncancer hereditary syndromes, the overall predictive value (OPV) was around 80% (OR [approximate, equals] 4) for any single method; some algorithms had higher sensitivity and NPV, and others had a higher specificity and PPV. For CDKN2A, MLH1, and MSH2 variants, if three or four methods agreed on a classification of “pathogenic,” then the PPV for predicting in vitro functional assays (and in the case of CDKN2A, clinical phenotype) was over 90%, which translated to an OR value between 10–15 [Chan et al., 2007]. The PPV of mutations at invariant amino acid (AA) positions is also well over 90% [Chan et al., 2007; Chao et al., 2008; Balasubramanian et al., 2005]. The NPV of predicting neutrality from a change at a poorly conserved AA is not as high, so computational predictions of lack of pathogenicity carry less confidence (NPV = 58.8–73.5%). Karchin et al. [2007] used a supervised learning approach based on the crystal structure of the BRCT domain of BRCA1 to correlate the performance of computer prediction algorithms with functional assays on a training set of 36 variants and a test set of 54 variants. A majority vote strategy from three different supervised learning algorithms agreed with the functional assay 94% of the time (OR [approximate, equals] 15).


Clearly, an important aspect of the classification problem is the determination, to the extent possible, of the relationships among the different forms of evidence and how to integrate them to form one conclusion. Accordingly, one important goal of the Working Group was to develop a mechanism for integrating both qualitative and quantitative measures of likelihood of pathogenicity. Such a mechanism would provide an objective, reproducible classification for each variant.

Three issues are paramount when thinking about combining these differing kinds of evidence into a single model: 1) relevance to disease phenotype; 2) quantification of qualitative measures; and 3) statistical independence (or lack thereof) of different model components and how one might incorporate their dependence in a statistically valid manner. Ideally, the quantitative data should be summarized in terms of an estimated posterior probability that the variant is pathogenic given all the available information. Plon et al. [2008], propose that the qualitative classification of variants should correspond to these probabilities as follows: pathogenic, >0.99; likely pathogenic, 0.95–0.99; uncertain, 0.05–095; likely neutral, 0.001–0.05; neutral/little clinical significance, <0.001.

In other articles in this issue, the Working Group presents tumor pathology, sequence-based, and functional assay approaches to classifying the disease causality of several cancer genes [Hofstra et al., 2008; Tavtigian et al., 2008a; Couch et al., 2008]. For the BRCA1 and BRCA2 genes, the statistical likelihood model has initially been used as a “gold standard” measure of disease causality [Goldgar et al., 2004; Easton et al., 2007]. However, this method is dependent on the availability of genetic and family data and will likely not be useful for classification of the cancer relevance of the majority of rare variants. In addition, this model makes a number of assumptions about penetrance and risk associated with mutations. Because of these limitations it is necessary to develop and apply a method that utilizes multiple forms of data and is robust to the limitations in genetic/family data. Specifically, we can conceive of estimating whether sets of variants are pathogenic or neutral using a two-component multivariate mixture model (e.g., Conlon [2008]).

Within this general framework, we assume all variants can be classified as either mutations (M) or neutral variants. The aim is to compute the posterior probability P(M | D) that a variant is pathogenic, given all available data D = {Di}, where each Di is a different item of data evaluated for the variant in question, including potentially family history, evolutionary sequence analysis, functional assays, genetic measures, etc. The posterior odds that the variant is pathogenic are then calculated from the prior probability and the likelihood ratios associated with each data item through Bayes’ theorem:


assuming mutual independence of the individual Di and where P(M) is the prior probability that the VUS under consideration is pathogenic (e.g., M). The posterior probability of causality is then given by LRLR+1

The Two-Component Mixture Model

The formal classification model combines data on evolutionary sequence conservation, functional assays, and certain genetic measures, denoted “Data,” and family history, denoted FH, on a series of missense mutations, known neutral polymorphisms, and known pathogenic variants. The model takes the form of a two-component multivariate mixture model in which one component describes variation in the measures associated with neutral variants V and the other component describes variation associated with pathogenic variants/mutations M. In this formulation, the values of the classification variable V/M for the missense mutations are unknown and are the quantities of interest. Statistical inference for these quantities involves standard Bayesian latent variable methods. The probability that an individual variant is classified as pathogenic will be calculated as the posterior probability that the variant is from the pathogenic component in the mixture model, i.e., Pr(M | Data, FH). The model may be parameterized as conditional on the family history/phenotype data associated with variants—an approach that makes the analysis more robust to ascertainment biases. In this formulation, the contribution to the likelihood associated with known pathogenic variants is of the form Pr(Data | M, FH), the contribution associated with known neutral variants is of the form Pr(Data | V, FH) and the contribution of variants of unknown significance is Pr(Data | FH). Hence:


Strengths of this approach include that it brings together complementary sources of data and that it combines data on variants whose classification is known (known neutral and pathogenic variants) and those whose classification is not (UVs). Each of the variables in “Data” contribute to estimation of the unmeasured, gold standard variable V/M. Hence, the conditional distributions Pr(Data | V, FH) and Pr(Data | M, FH) can be estimated more accurately than if the variables were considered individually. Incorporating data on variants whose classification is known improves our ability to estimate these distributions and improves robustness to ascertainment bias provided that the variants of known disposition were identified via the same process as the variants of unknown disposition [Zhou et al., 2005]. This modeling approach allows one to simultaneously characterize the association between the classification variable (V/M) and the various data sources and to estimate the unknown classification of the missense mutations and, importantly, allows for estimation of the sensitivity, specificity, and PVs of the various functional and sequence conservation assays based on the estimated gold standard variable V/M.

In addition to the above-described mixture models, there are other bioinformatic tools that could be applied to the problem of combining these various sources of data into a classifier. These include neural networks, various cluster-type analyses, and adaptive learning. To date, it is not clear whether these methods have improved the predictions in variant classification beyond those obtained by rule-based algorithms [Chan et al., 2007; Balasubramanian et al., 2005]; however, they hold promise and warrant further exploration.

Prior Probability

As outlined above, Bayesian approaches to integrating different types of data start with a prior probability of pathogenicity of a variant. Probabilities differ among genes, and prior probability of different variants in the same gene may differ based on the clinical scenario and features of the variant and the gene. One can think about prior probability as it relates to the individual being tested, that is, what is the probability before any testing is done, that the individual to be tested carries a pathogenic variant in the given gene or genes to be tested? For many of the common cancer predisposition genes, there are a variety of algorithms that can provide these probabilities, or they can be taken from estimates based on locus heterogeneity studies. For example, we may know that 80% of classic Li-Fraumeni syndrome families have a germline mutation in the TP53 gene. We also can define the phenotype more broadly. For example, the probability could include additional factors such as tumor characteristics (e.g., estrogen receptor status, grade, and cytokeratin expression in breast cancer and microsatellite instability (MSI) in hereditary colorectal cancer [Hofstra et al., 2008]).

On the other hand, we can think of the prior probability as it relates to the particular variant in a given gene that has just been identified in a patient. In this case, the prior probability could be based on previous knowledge of the particular variant from other studies or, if it is a new variant, the prior probability could incorporate information on the type of variant, the domain of the protein in which it is located, or if it is evolutionarily conserved. For example in a recent study, Easton et al. [2007] estimated that 12% of all missense mutations in the BRCA1 and BRCA2 genes were pathogenic, but when restricted to three specific functional domains, 35% of the variants in these domains were estimated to be pathogenic; these figures could be used as prior probabilities in assessing new variants that have these properties. In contrast to the case of BRCA1 and BRCA2, for the MMR genes MLH1 and MSH2, a majority of classifiable variants appear to be pathogenic (49 pathogenic, 26 neutral [Chao et al., 2008]).

Since the process of classifying variants is likely to be iterative for each gene, the posterior probability from the last classification effort can be used as the prior probability for the next. Thus, all previous data are implicitly or explicitly included in the new calculation of prior probability. It will be important for the relevant research and clinical communities to develop a standard way of expressing such outcomes.

A Note on Independence

The basic model outlined above assumes that the individual components of the multifactorial model are mutually independent, thus allowing one to simply multiply the corresponding LRs from each component. While this is true for much, if not all, of the genetic data, it is easy to imagine that other sorts of evidence might be strongly correlated. In a trivial example, if one were using both microsatellite instability and immunohistochemistry data independently for classification of MMR variants, the evidence would be overstated, given that these two assays are very highly correlated. One way to deal with this would be to consider all possible combinations (e.g., IHC+ MSI−, etc.) with a corresponding LR associated with each combination. This approach was taken in analyzing histological grade and hormone receptor status of tumors from carriers of BRCA1 UVs [Chenevix-Trench et al., 2006]. Another approach is to use statistical methods to identify the optimal combination of correlated functional or in silico assays; however, this presupposes the availability of large data sets of known pathogenic and neutral mutations; or they can be jointly estimated using methods such as those described above.


When employing comparison groups to build models or LRs for a given type of evidence, investigators may use historical data that has been generated on reference groups of affected carriers of pathogenic variants vs. mutation-negative affected individuals (either “sporadic” nonfamilial or familial cases). For certain types of evidence there are no sufficiently large gold standard reference groups of pathogenic missense variants that can be applied, since most variants that are known to be pathogenic result in complete lack of message. For certain evidence, such as that from tumor pathology, loss of heterozygosity, microsatellite instability, or co-occurrence, we must assume that the same mechanism of action holds for missense variants as for truncating mutations. In some cases this may be a reasonable assumption, but without empirical evidence from a large set of known pathogenic missense mutations in a variety of functional domains it is difficult to know a priori how much confidence to place in such evidence. Protein-specific factors are also important. For example, for large proteins with multiple functional domains, such as BRCA1 and BRCA2, data may need to be validated separately for different domains. Studies on BRCA variants generally focus on those in the BRCA2 DBD and BRCA1 BRCT domains.

Another consideration in the combining of data to arrive at a single conclusion about a given variant relates to the expression of the evidence as an OR; e.g., the ratio of the odds that a given variant is pathogenic vs. neutral are 250:1, etc., or as a posterior probability; e.g., a variant has a 99.6% chance of being pathogenic. In the former case, no assumptions need be made about the prior likelihood that a given variant is pathogenic, while in the latter case, this is implicitly or explicitly included in the calculation. For example, if the prior probability of pathogenicity is 50% (prior odds are 1), the above statements are equivalent. If, however, the prior odds are 10:1 in favor of pathogenicity, then an OR of 250:1 accords a posterior probability of 99.96% (posterior odds of 2,500:1). However, this presupposes that there is prior information available that is statistically independent of that included in the multifactorial LR model. It would be useful if a standard way of separating prior and posterior information could be agreed upon by the relevant research and clinical communities.

In the case of a UV, the key aspects needed are how to obtain the necessary relationships between things like sequence conservation/severity of the substitution for missense, splice prediction for potential spice-site variants, any functional assays, etc., and the probability that a variant is pathogenic. This can best be done through systematic studies of a large number of variants in which genetic data are available for the calibration of these approaches, using for example, approaches such as the two-component mixture model described above. Recent studies are addressing this problem for computational predictions based on sequence variation and statistical genetics [Chan et al., 2007; Tavtigian et al., 2008b; Easton et al., 2007; Barnetson et al., 2008; Chao et al., 2008]. Future studies on tumor pathology and in vitro functional assays should also focus on this goal. More discussion about this topic can be found in the relevant companion articles in this issue.

For new variants or others that remain of unknown significance (Class 3 in the new system), the challenge is to assess the predictive values of the various data types and the relationships among them to find a probability of pathogenicity that can be used in the clinic. This can best be done through systematic studies of a large number of variants in which genetic data are available for the calibration of these approaches.


There are several implicit assumptions that should be noted here regarding the methods described in this work. First, in terms of the genetic information, we generally assume that the penetrance, which is used here to mean the age- and site-specific risks of disease for carriers and noncarriers, is accurately known (but this is also true for genetic counseling of known pathogenic mutations). We are also implicitly assuming that all variants either have this penetrance or are not associated with any increased risks. In practice, if a variant were of intermediate risk, we would expect that different sources of evidence might provide conflicting information. In such cases the resulting risk figures would in fact likely be intermediate to the known pathogenic and wild-type situations. The classification system should take this into account, if possible, as it may affect any subsequent genetic counseling. In addition to the penetrance, any other phenotypic feature (e.g., MSI, pathology) used to calculate the prior probability that a family carries a mutation in a given gene is assumed to be the same for all pathogenic mutations. This could be an issue for certain genes in which the relevant data were derived from one class of mutations (e.g., protein truncating) and the variants are largely from a different class of mutations (e.g., missense). It is not unreasonable that for certain features these two types of mutations may have different phenotypic effects; so it will be important to validate these findings in a sample of known pathogenic mutations of each type. For example, there seems to be some emerging data in several genes that loss of heterozygosity patterns may differ between missense and null mutations [Spirio et al., 1998; Spurdle et al., 2008b].

Another major assumption inherent in this process is that the probabilities of a variant being deleterious based upon sequence predictions or functional assays are generalizable to variants that were not previously part of those studies, both within the same gene (functional assays) and, potentially very usefully, in other genes (sequence analysis). Thus, can general estimates of pathogenicity from the Align-GVGD program (A-GVGD; GV, Grantham variation; GD, Grantham deviation; derived from the large Myriad dataset [Tavtigian et al., 2008b] be applied to variants in, for instance, p16 or MSH2? It will require a large data set to validate this assumption. If it proves difficult to find a general rule that applies to many disease genes, then it will be necessary to derive individual predictive models for each gene, with each entailing a large amount of laboratory work to generate the required evolutionary sequence information, along with many bioinformatic/statistical analyses to develop and validate the model.

Quantitative vs. Qualitative Conclusions

We have stressed the desirability of deriving objective, quantitative probabilities of pathogenicity. However, even in the absence of ORs/LRs for functional assays and tumor pathology features, these measures can contribute to classification, much as in vitro data are now used to classify carcinogens. For example, loss of MMR protein expression by immunohistochemistry in multiple colon tumors from a kindred in which multiple individuals carry the same MMR variant can be supporting evidence for the pathogenicity of that variant. Such conclusions are also strongly supported by in vitro loss of MMR activity (or in the case of CDKN2A, cell cycle arrest), even though the PPV of the individual assays are not yet clear [Ou et al., 2007]. If a posterior probability can be estimated from other data, such qualitative evidence might be used to reclassify a variant by one or more classes in the new system. It is likely that for each gene, a panel of experts will need to certify the quality of the supporting data used to determine the variant’s class, whether it is done qualitatively or quantitatively.


In this work, we discuss some of the issues that arise when combining evidence from different sources into a cohesive model to assess pathogenicity of variants found during genetic testing for hereditary cancer syndromes. The key to our ability to provide a more informative classification paradigm will be to systematically utilize all information in a quantitative way, including outputs from available bioinformatic tools, tumor pathology, and in vitro functional assays. This will allow us to assign each variant a posterior probability that it is pathogenic, and to classify variants based on this probability. A short-term goal for all of these fields should be the systematic study of large numbers of variants that can be classified by genetic data, to calibrate all of these methods in terms of sensitivity, specificity, and PV. Until all of these items can be quantified, experts in each field should validate the usefulness of qualitative results in classifying variants. However, we stress that there are assumptions that underlie not only the individual components of a multifactorial approach, but also the way in which these components are combined to arrive at an overall conclusion. Although many of these assumptions are difficult to verify, their robustness and the conclusions that rely on them can and should be evaluated so that they can be integrated reliably into clinical genetics. Last, we believe that the same basic approach used for hereditary cancer syndromes eventually can be used for all genetic counseling, with a variant being weighted according to its relative likelihood of being a neutral or pathogenic variant, based on all available data.


Supported by National Institutes of Health (NIH) grants CA116167 (to F.J.C.) and CA96536 (to M.S.G.) and the Lake Champlain Cancer Research Organization (to M.S.G.).

Grant sponsors: Australian National Health and Medical Research Council; Lake Champlain Cancer Research Organization; National Institutes of Health (NIH) Breast Cancer Specialized Program in Research Excellence; Grant number: P50 CA116201; Grant sponsor: NIH; Grant number: CA116167; CA96536.


Members of the IARC Working Group on Unclassified Genetic Variants

Paolo Boffetta, IARC, France; Fergus Couch, Mayo Clinic, USA; Niels de Wind, Leiden University, the Netherlands; Douglas Easton, Cambridge University, UK; Diana Eccles, University of Southampton, UK; William Foulkes, McGill University, Canada; Maurizio Genuardi, University of Florence, Italy; David Goldgar, University of Utah, USA; Marc Greenblatt, University of Vermont, USA; Robert Hofstra, University Medical Center Groningen, the Netherlands; Frans Hogervorst, Netherlands Cancer Institute, the Netherlands; Nicoline Hoogerbrugge, University Medical Center Neimejen, the Netherlands; Sharon Plon, Baylor University, USA; Paolo Radice, Istituto Nazionale Tumori, Italy; Lene Rasmussen, Roskilde University, Denmark; Olga Sinilnikova, Hospices Civils de Lyon, France; Amanda Spurdle, Queensland Institute of Medical Research, Australia; and Sean Tavtigian, IARC, France.


  • Agostini M, Tibiletti MG, Lucci-Cordisco E, Chiaravalli A, Morreau H, Furlan D, Boccuto L, Pucciarelli S, Capella C, Boiocchi M, Viel A. Two PMS2 mutations in a Turcot syndrome family with small bowel cancers. Am J Gastroenterol. 2005;100:1886–1891. [PubMed]
  • Antoniou AC, Pharoah PP, Smith P, Easton DF. The BOADICEA model of genetic susceptibility to breast and ovarian cancer. Br J Cancer. 2004;91:1580–1590. [PMC free article] [PubMed]
  • Auclair J, Busine MP, Navarro C, Ruano E, Montmain G, Desseigne F, Saurin JC, Lasset C, Bonadona V, Giraud S, Puisieux A, Wang Q. Systematic mRNA analysis for the effect of MLH1 and MSH2 missense and silent mutations on aberrant splicing. Hum Mutat. 2006;27:145–154. [PubMed]
  • Balasubramanian S, Xia Y, Freinkman E, Gerstein M. Sequence variation in G-protein-coupled receptors: analysis of single nucleotide polymorphisms. Nucleic Acids Res. 2005;33:1710–1721. [PMC free article] [PubMed]
  • Barnetson RA, Cartwright N, van Vliet A, Haq N, Drew K, Farrington S, Williams N, Warner J, Campbell H, Porteous ME, Dunlop MG. Classification of ambiguous mutations in DNA mismatch repair genes identified in a population-based study of colorectal cancer. Hum Mutat. 2008;29:367–374. [PubMed]
  • Bayrak-Toydemir P, McDonald J, Mao R, Phansalkar A, Gedge F, Robles J, Goldgar D, Lyon E. Likelihood ratios to assess genetic evidence for clinical significance of uncertain variants: hereditary hemorrhagic telangiectasia as a model. Exp Mol Pathol. 2008;85:45–49. [PubMed]
  • Berry DA, Iversen ES, Jr, Gudbjartsson DF, Hiller EH, Garber JE, Peshkin BN, Lerman C, Watson P, Lynch HT, Hilsenbeck SG, Rubinstein WS, Hughes KS, Parmigiani G. BRCAPRO validation, sensitivity of genetic testing of BRCA1/BRCA2, and prevalence of other breast cancer susceptibility genes. J Clin Oncol. 2002;20:2701–2712. [PubMed]
  • Chan PA, Duraisamy S, Miller PJ, Newell JA, McBride C, Bond JP, Raevaara T, Ollila S, Nyström M, Grimm AJ, Christodoulou J, Oetting WS, Greenblatt MS. Interpreting missense variants: comparing computational methods in human disease genes CDKN2A, MLH1, MSH2, MECP2, and tyrosinase (TYR) Hum Mutat. 2007;28:683–693. [PubMed]
  • Chao EC, Velasquez JL, Witherspoon MS, Rozek LS, Peel D, Ng P, Gruber SB, Watson P, Rennert G, Anton-Culver H, Lynch H, Lipkin SM. Accurate classification of MLH1/MSH2 missense variants with multivariate analysis of protein polymorphisms-mismatch repair (MAPP-MMR) Hum Mutat. 2008;29:852–860. [PubMed]
  • Chenevix-Trench G, Healey S, Lakhani S, Waring P, Cummings M, Brinkworth R, Deffenbaugh AM, Burbidge LA, Pruss D, Judkins T, Scholl T, Bekessy A, Marsh A, Lovelock P, Wong M, Tesoriero A, Renard H, Southey M, Hopper JL, Yannoukakos K, Brown M, Easton D, Tavtigian SV, Goldgar D, Spurdle AB, kConFab Investigators Genetic and histopathologic evaluation of BRCA1 and BRCA2 DNA sequence variants of unknown clinical significance. Cancer Res. 2006;66:2019–2027. [PubMed]
  • Conlon EM. A Bayesian mixture model for meta analysis of microarray studies. Funct Integr Genomics. 2008;8:43–53. [PubMed]
  • Couch FJ, Rasmussen L, Hofstra R, Monteiro AANM, Greenblatt MS, de Wind N, IARC Unclassified Genetic Variants Working Group Assessment of functional effects of unclassified genetic variants. Hum Mutat. 2008;29:1314–1326. [PMC free article] [PubMed]
  • Easton DF, Deffenbaugh AM, Pruss D, Frye C, Wenstrup RJ, Allen-Brady K, Tavtigian SV, Monteiro AN, Iversen ES, Couch FJ, Goldgar DE. A systematic genetic assessment of 1,433 sequence variants of unknown clinical significance in the BRCA1 and BRCA2 breast cancer-predisposition genes. Am J Hum Genet. 2007;81:873–883. [PubMed]
  • Evans DG, Eccles DM, Rahman N, Young K, Bulman M, Amir E, Shenton A, Howell A, Lalloo F. A new scoring system for the chances of identifying a BRCA1/2 mutation outperforms existing models including BRCAPRO. J Med Genet. 2004;41:474–480. [PMC free article] [PubMed]
  • Evans DG, Lalloo F, Wallace A, Rahman N. Update on the Manchester Scoring System for BRCA1 and BRCA2 testing. J Med Genet. 2005;42:e39. [PMC free article] [PubMed]
  • Goldgar DE, Easton DF, Deffenbaugh AM, Monteiro AN, Tavtigian SV, Couch FJ. Integrated evaluation of DNA sequence variants of unknown clinical significance: application to BRCA1 and BRCA2. Am J Hum Genet. 2004;75:535–544. [PubMed]
  • Hofstra RW, Spurdle AB, Eccles D, Foulkes WD, de Wind N, Hoogerbrugge N, Hogervorst FBL, IARC Unclassified Genetic Variants Working Group Tumor characteristics as an analytic tool for classifying genetic variants of uncertain clinical significance. Hum Mutat. 2008;29:1292–1303. [PMC free article] [PubMed]
  • Howlett NG, Taniguchi T, Olson S, Cox B, Waisfisz Q, De Die-Smulders C, Persky N, Grompe M, Joenje H, Pals G, Ikeda H, Fox EA, D’Andrea AD. Biallelic inactivation of BRCA2 in Fanconi anemia. Science. 2002;297:606–609. [PubMed]
  • IARC . Monographs on the Evaluation of Carcinogenic Risks to Humans. International Agency for Research on Cancer; Lyon, France: [Last accessed: 11 September 2008]. 2006. Preamble; amended January 2006. Available at:
  • Judkins T, Hendrickson BC, Deffenbaugh AM, Eliason K, Leclair B, Norton MJ, Ward BE, Pruss D, Scholl T. Application of embryonic lethal or other obvious phenotypes to characterize the clinical significance of genetic variants found in trans with known deleterious mutations. Cancer Res. 2005;65:10096–10103. [PubMed]
  • Karchin R, Monteiro AN, Tavtigian SV, Carvalho MA, Sali A. Functional impact of missense variants in BRCA1 predicted by supervised learning. PLoS Comput Biol. 2007;3:e26. [PubMed]
  • Liu CY, Flesken-Nikitin A, Li S, Zeng Y, Lee WH. Inactivation of the mouse Brca1 gene leads to failure in the morphogenesis of the egg cylinder in early postimplantation development. Genes Dev. 1996;10:1835–1843. [PubMed]
  • Mitchell AA, Chakravarti A, Cutler DJ. On the probability that a novel variant is a disease-causing mutation. Genome Res. 2005;15:960–966. [PubMed]
  • Ou J, Niessen RC, Lutzen A, Sijmons RH, Kleibeuker JH, de Wind N, Rasmussen LJ, Hofstra RM. Functional analysis helps to clarify the clinical importance of unclassified variants in DNA mismatch repair genes. Hum Mutat. 2007;28:1047–1054. [PubMed]
  • Petersen GM, Parmigiani G, Thomas D. Missense mutations in disease genes: a Bayesian approach to evaluate causality. Am J Hum Genet. 1998;62:1516–1524. [PubMed]
  • Plon SE, Eccles DM, Easton DF, Foulkes W, Genuardi M, Greenblatt MS, Hogervorst FBL, Hoogerbrugge N, Spurdle AB, Tavtigian S, IARC Unclassified Genetic Variants Working Group Sequence variant classification and reporting: recommendations for improving the interpretation of cancer susceptibility genetic test results. Hum Mutat. 2008;29:1282–1291. [PMC free article] [PubMed]
  • Rahman N, Scott RH. Cancer genes associated with phenotypes in monoallelic and biallelic mutation carriers: new lessons from old players. Hum Mol Genet. 2007;16:R60–R66. Spec No. [PubMed]
  • Ricciardone MD, Ozçelik T, Cevher B, Ozda&gbreve H, Tuncer M, Gürgey A, Uzunalimoğlu O, Cetinkaya H, Tanyeli A, Erken E, Oztürk M. Human MLH1 deficiency predisposes to hematological malignancy and neurofibromatosis type 1. Cancer Res. 1999;59:290–293. [PubMed]
  • Spirio LN, Samowitz W, Robertson J, Robertson M, Burt RW, Leppert M, White R. Alleles of APC modulate the frequency and classes of mutations that lead to colon polyps. Nat Genet. 1998;20:385–388. [PubMed]
  • Spurdle AB, Couch FJ, Hogervorst FBL, Radice P, Sinilnikova OM, IARC Unclassified Genetic Variants Working Group Prediction and assessment of splicing alterations. Hum Mutat. 2008a;29:1304–1313. [PMC free article] [PubMed]
  • Spurdle AB, Lakhani SR, Healey S, Parry S, Da Silva LM, Brinkworth R, Hopper JL, Brown MA, Babikyan D, Chenevix-Trench G, Tavtigian SV, Goldgar DE. Clinical classification of BRCA1 and BRCA2 DNA sequence variants: the value of cytokeratin profiles and evolutionary analysis—a report from the kConFab Investigators. J Clin Oncol. 2008b;26:1657–1663. [PubMed]
  • Tavtigian S, Byrnes GB, Goldgar DE, Thomas A. Classification of rare missense substitutions, using risk surfaces, with genetic- and molecular-epidemiology applications. Hum Mutat. 2008a;29:1342–1354. [PubMed]
  • Tavtigian SV, Greenblatt MS, Lesueur F, Byrnes GB, IARC Unclassified Genetic Variants Working Group In silico analysis of missense substitutions using sequence-alignment based methods. Hum Mutat. 2008b;29:1327–1336. [PMC free article] [PubMed]
  • Thompson D, Easton DF, Goldgar DE. A full-likelihood method for the evaluation of causality of sequence variants from family data. Am J Hum Genet. 2003;73:652–655. [PubMed]
  • Thornton T, McPeek MS. Case-control association testing with related individuals: a more powerful quasi-likelihood score test. Am J Hum Genet. 2007;81:321–337. [PubMed]
  • Wang Q, Lasset C, Desseigne F, Frappaz D, Bergeron C, Navarro C, Ruano E, Puisieux A. Neurofibromatosis and early onset of cancers in hMLH1-deficient children. Cancer Res. 1999;59:294–297. [PubMed]
  • Zhou XI, Iversen ES, Jr, Parmigiani G. Classification of missense mutations of disease genes. J Am Stat Assn. 2005;100:51–60. [PMC free article] [PubMed]