Search tips
Search criteria 


Logo of biostsLink to Publisher's site
Biostatistics. 2010 July; 11(3): 519–532.
PMCID: PMC2883298

Statistical inference on the penetrances of rare genetic mutations based on a case–family design

Hong Zhang
Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD 20892, USA and Department of Statistics and Finance, University of Science and Technology of China, Hefei, Anhui 230026, People's Republic of China
Sylviane Olschwang
Institut National de la Santé et de la Recherche Médicale (INSERM), Unité 891, Centrede Recherches en Cancérologie de Marseille, 13009 Marseille, France and Department of Oncogenetics, Institut Paoli-Calmettes, 13009 Marseille, France


We propose a formal statistical inference framework for the evaluation of the penetrance of a rare genetic mutation using family data generated under a kin–cohort type of design, where phenotype and genotype information from first-degree relatives (sibs and/or offspring) of case probands carrying the targeted mutation are collected. Our approach is built upon a likelihood model with some minor assumptions, and it can be used for age-dependent penetrance estimation that permits adjustment for covariates. Furthermore, the derived likelihood allows unobserved risk factors that are correlated within family members. The validity of the approach is confirmed by simulation studies. We apply the proposed approach to estimating the age-dependent cancer risk among carriers of the MSH2 or MLH1 mutation.

Keywords: Case–family design, Penetrance, Proportional hazards model, Rare mutation, Unobserved risk factors


An increasing number of mutations have been found to be associated with an elevated risk for various genetic disorders. A precise estimation of the age-dependent risk for people carrying the disease-causing mutations is essential for defining prevention strategies and understanding underlying mechanisms of the diseases. When a disease–causal mutation is identified, a precise estimation of its penetrance is possible using the kin–cohort design (Wacholder and others, 1998), which has been studied extensively in the literature, for example, Gail, Pee, Benichou, and Carroll (1999), Chatterjee and Wacholder (2001), Chatterjee and others (2006), Wang and others (2007), among others. Gail, Pee, and Carroll (1999) studied the advantages and disadvantages of the kin–cohort design. They found that the kin–cohort design has several practical advantages, including comparatively rapid execution, modest reductions in required sample sizes compared with cohort or case–control designs, and the ability to study the effects of an autosomal dominant mutation on several disease outcomes; the disadvantages include 2 sources of bias: a proband's decision to participate is influenced by the disease status of his relatives and the proband is unable to recall the disease histories of relatives accurately.

In a standard kin–cohort design, a volunteer (either affected or unaffected) agrees to be genotyped, and the phenotype information on the disease histories of his or her first-degree relatives is obtained through a questionnaire. When the information on both phenotype and genotype for relatives is available, alternative approaches are needed in order to take full advantage of all available data while correcting for bias due to the effects of ascertainment. In this paper, we assume that all probands are affected carriers, though the proposed approach can be extended to include unaffected probands carrying the mutation. Recently, Wang and others (2006) proposed a nonparametric method for estimating the penetrance of a rare mutation. Olschwang and others (2009) proposed an alternative parametric logistic regression model. Both approaches rely on the assumption that the penetrance of noncarriers is zero. This assumption might not be true for many genetic diseases. The penetrance estimate could be severely biased if this assumption was not valid in real applications.

In this paper, we focus on rare mutations and aim at developing a rigorous statistical inference framework for such case–family design. The main difference between this design and the standard kin–cohort design is that the former collects information on both phenotypes and genotypes of the probands’ relatives, while the latter simply collects the phenotypes of the relatives through a questionnaire. The assumption of zero penetrance for the noncarriers is not required for our approach. Furthermore, the proposed approach is based on a likelihood model conditioned on the phenotypes of all individuals; therefore, the derived estimate should not suffer from the biases mentioned for the kin–cohort design.

Some covariates such as gender and ethnicity can be incorporated easily in our approach. Multiple rare mutations can also be handled in the context of the proposed conditional likelihood framework. The derivation of the conditional likelihood functions requires minor assumptions. The maximum likelihood estimates (MLEs) can be obtained through standard optimization algorithm available in mathematical/statistical softwares. Statistical inferences, such as constructing confidence intervals and testing hypotheses for the parameters characterizing the penetrance, can be performed based on the standard large-sample theories.

The performance of the proposed approach is examined through simulation studies, which illustrate the desired properties of the approach. Finally, we demonstrate the application of the proposed approach by applying it to a study of Lynch syndrome.


2.1. Notation

Throughout this paper, the mutations responsible for the disease of interest are assumed to be on autosome. In the case–family design considered, some unrelated affected individuals (cases) collected from a case–control study are genotyped, and those cases carrying the study mutation are termed “case probands”; the first-degree relatives (sibs and/or offspring) of the case probands are interviewed for phenotyping and genotyping. To motivate our approach, we first focus on congenital or early-onset diseases that manifest before the ages at which subjects are ascertained. We want to estimate the age-independent penetrance of a known disease-causing mutation. Suppose some case probands are ascertained, and several first-degree relatives of each case proband are then collected for genotyping at the disease locus. Throughout this paper, we assume that the mutation (allele m is the mutation of wild allele M) causing disease is rare. Since the mutation is so rare that homogeneous genotype mm is seldom seen if we assume Hardy–Weinberg equilibrium holds for the alleles, then we have only 2 genotypes, namely Mm (mutation, denoted by g = 1) and MM (nonmutation, denoted by g = 0). Let the disease penetrance of MM and Mm be f0 and f1, respectively. Let the disease status of an individual be d that takes value 1 if affected and 0 otherwise.

2.2. Likelihood function

Suppose I unrelated case probands carrying the mutation are ascertained. To derive the likelihood function of the observed data, we need to make the following assumptions:

  • (i) The study mutation is rare.
  • (ii) Hardy–Weinberg equilibrium holds for the corresponding allele, mating is random, and Mendelian inheritance law holds.
  • (iii) The study mutation is independent of the unobserved risk factors.
  • (iv) The disease is rare.
  • (v) There is no interaction effect between the study mutation and the unobserved risk factors. That is, the joint disease penetrance satisfies the following relationship:
    An external file that holds a picture, illustration, etc.
Object name is biostskxq009fx1_ht.jpg

where r is a vector of unobserved risk factor values and c1 is a constant independent of r.

Under the assumptions (i)–(v), the likelihood function for the observed genotypes of the relatives can be approximated by

An external file that holds a picture, illustration, etc.
Object name is biostskxq009fx2_ht.jpg

where ai (bi) is the number of the affected (unaffected) relatives carrying the mutation and n1i (n0i) is the number of affected (unaffected) relatives, of the ith case proband, i = 1,…,I. Refer to Appendix A of the supplementary material (available at Biostatistics online) for the derivation of (2.2) that is available . Notice that p1 (p0) is the probability of a relative being a carrier, given the condition that he/she is affected (unaffected) and the case proband is a carrier. It is seen that p0 has exactly the same value as that given in Wang and others (2006) when f0 = 0. Furthermore, when f0 = 0 (i.e. a noncarrier has penetrance 0), all the affected relatives are carriers and they provide no information on f1.

It can be seen from (2.2) that the relative's genotypes within the same family are conditionally independent given the ascertainment scheme. We want to point out that this is not an assumption but is the result derived from the assumptions (i)–(v). An important advantage of this likelihood is that it is independent of the unobserved risk factors, making it suitable for estimating marginal penetrances of carriers and noncarriers.

The assumption (i) is the key assumption, which is the motivation for this study. The assumptions (ii) and (iii) are commonly seen in literature, which are used to derive the conditional mutation distribution of a proband's relatives. The assumption (iv) is a technical one, and our simulation study shows that the performance of the proposed approach is acceptable even when the disease is common with the prevalence being 0.1. The assumption (v) is equivalent to the multiplicative model for multiple risk factors (see e.g. Gail and others, 2008, Yu and others, 2009). In particular, the following log-linear model satisfies the assumption (v):

An external file that holds a picture, illustration, etc.
Object name is biostskxq009fx3_ht.jpg

where c2 is a constant and a and b are regression parameters. Throughout this paper, “τ” stands for the transpose of a vector. Notice that we do not assume any correlation structure for the unobserved risk factors of family members. Furthermore, the unobserved risk factors can be of any type, such as discrete and continuous, environmental or genetical.

2.3. Identifiability of f1 and f0

When genotypes are available only for the unaffected relatives of case probands, we see from the likelihood function (2.2) that the penetrances f1 and f0 are not identifiable. However, the 2 penetrances f1 and f0 are identifiable when at least 1 affected relative and 1 unaffected relative are genotyped, provided that f1 > f0 > 0. Actually, there is a one-to-one relationship between the penetrances {f1,f0} and the estimable parameters {p1,p0} when f1 > f0 > 0. This is different from the situation in the standard case–control design, where only the relative risk f1/f0 is identifiable. Notice that in our case–family design, our retrospective likelihood function is conditioned on the mutation status of the proband and disease status. This additional conditioning as well as the assumption of rare mutation make both f1 and f0 identifiable. It is also noticed that f0 and f1 are not identifiable when f1 = f0 but this is not a problem since the major purpose of our case–family design is to estimate the penetrance function of a known risk mutation with f1 > f0.

2.4. Maximum likelihood estimates

Denote An external file that holds a picture, illustration, etc.
Object name is biostskxq009fx4_ht.jpg, An external file that holds a picture, illustration, etc.
Object name is biostskxq009fx5_ht.jpg, An external file that holds a picture, illustration, etc.
Object name is biostskxq009fx6_ht.jpg, and An external file that holds a picture, illustration, etc.
Object name is biostskxq009fx7_ht.jpg, Then the overall likelihood can be written as

An external file that holds a picture, illustration, etc.
Object name is biostskxq009fx8_ht.jpg

Since the above likelihood function is the product of 2 binomial likelihood functions, the MLEs of p1 and p0 are An external file that holds a picture, illustration, etc.
Object name is biostskxq009fx9_ht.jpg and An external file that holds a picture, illustration, etc.
Object name is biostskxq009fx10_ht.jpg, respectively. Therefore, the MLEs of f1 and f0 are, respectively,

An external file that holds a picture, illustration, etc.
Object name is biostskxq009fx11_ht.jpg

or equivalently,

An external file that holds a picture, illustration, etc.
Object name is biostskxq009fx12_ht.jpg

When f0 = 0, the MLE of f1 is (1 − 2B/N0)/(1 − B/N0). This estimator is simpler than that of Wang and others (2006) since their method needs to estimate an additional offset for each family. When f0 is not equal to 0, using (1 − 2B/N0)/(1 − B/N0) as an estimator of f1 could produce considerable bias. For example, if f0 = 0.1 and f1 = 0.2, then the estimator (1 − 2B/N0)/(1 − B/N0) converges to (1 − 2p0)/(1 − p0) = 1/9 as the sample size goes to infinity and the relative bias (Rbias) is (1/9 − 0.2)/0.2 = − 4/9.

If all the affected relatives are carriers so that N1 = A, then the MLEs of f0 and f1 are 0 and (1 − 2B/N0)/(1 − B/N0), respectively. This confirms the fact that the affected relatives provide no information on f1 when f0 = 0, as was mentioned in Section 2.2.

With a large sample size, the MLEs An external file that holds a picture, illustration, etc.
Object name is biostskxq009fx13_ht.jpg and An external file that holds a picture, illustration, etc.
Object name is biostskxq009fx14_ht.jpg converge to f1 and f0, respectively, so that they asymptotically locate within the interval [0,1]. When the sample size is not large enough, however, the 2 estimates could be negative or greater than 1. In such situation, we can estimate the penetrances by adding a constraint 0 ≤ f0,f1 ≤ 1.

2.5. Hypothesis testing and confidence interval

It is of interest to test the null hypothesis that the mutation has no effect on the disease (f0 = f1), provided that the genotypes of some affected relatives are available. To test this null hypothesis, we can construct a likelihood ratio test. Since the common penetrance under the null hypothesis is not identifiable, the limiting null distribution of the likelihood ratio test is no longer standard chi-square distributed. To assess the significance of the likelihood ratio test statistic, we can adopt a permutation test by permutating the disease status of the relatives.

The confidence intervals of the penetrances can be constructed based on the asymptotic normality of the MLEs, with the variance–covariance matrix of the MLEs being estimated by the inverse of the observed information matrix.


3.1. Notation

In most situations, the penetrances depend on age, and we are interested in estimating age-dependent penetrances. Suppose that we observe the ages at diagnosis for all the relatives and the ages at onset for those affected individuals. We will take this information into account in the evaluation of the age-dependent penetrances.

For the ith proband, suppose the information on the phenotypes and genotypes of ni relatives are collected. Let the genotype and affection status of the jth relative (zeroth relative is the case proband) of the ith case proband be coded by gij and dij, respectively. That is, gij = 1 if the jth relative is a carrier and 0 otherwise, and dij = 1 if the jth relative is affected and 0 otherwise. Let aij and tij (tij is an unobserved value that is greater than aij if the jth relative is unaffected) be the current age and the age at onset of the jth relative, respectively. Let yij = min{tij,aij}.

3.2. Likelihood function

We can formulate a conditional likelihood for the ith family's data as P(gi|di,yi,gi0 = 1,di0 = 1,yi0,ai,ai0), where gi = (gi1,…,gini), di = (di1,…,dini), yi = (yi1,…,yini), and ai = (ai1,…,aini). To derive the likelihood function, we need the following assumption corresponding to the assumption (v) for the age-independent penetrances:

  • (v) There is no interaction effect between the study mutation and the unobserved risk factors, that is, the density function of the age at onset p(t|g,r) given the study mutation g and unobserved risk factors r satisfies the relationship
An external file that holds a picture, illustration, etc.
Object name is biostskxq009fx15_ht.jpg

where c3 is a constant.

Under Cox's proportional hazards model (Cox, 1972), the hazard function is multiplicative with respect to g and r if there is no interaction effect. Therefore, the Cox model together with the rare disease assumption imply the assumption (v)’ since the density function is approximately the hazard function under the assumptions.

Under the assumptions (i)–(iv) and (v)', we can show that the overall likelihood can be approximated by

An external file that holds a picture, illustration, etc.
Object name is biostskxq009fx16_ht.jpg

where λ(·|g) and S(·|g) are, respectively, the hazard function and survival function of age at onset of individuals carrying genotype g. The derivation of (3.2) is similar to that of (2.2) so is omitted. We can assume a suitable functional form for λ(t|g). For example, under the given assumptions, the joint proportional hazards model implies a marginal proportional hazard function

An external file that holds a picture, illustration, etc.
Object name is biostskxq009fx17_ht.jpg

where λ0(t;η) is the baseline hazard function known up to a parameter vector η of finite dimension.

If only unaffected relatives are genotyped, then the likelihood function (3.2) reduces to

An external file that holds a picture, illustration, etc.
Object name is biostskxq009fx18_ht.jpg

It can be shown that S(·|1) and S(·|0) are not identifiable in (3.4), as in Section 2.3. For rare disease, one can assume that the penetrance of noncarriers is nearly zero so that S(y|0) ≈ 1, and the likelihood function is approximately

An external file that holds a picture, illustration, etc.
Object name is biostskxq009fx19_ht.jpg

which is equivalent to model (3) of Olschwang and others (2009). Making the additional assumption of a Weibull survival function form of S(y|1) yields a logistic regression model given by (5) of Olschwang and others (2009).

3.3. MLE, hypothesis testing, and confidence interval

The MLEs of the unknown parameters can be obtained by the Newton–Raphson algorithm or any optimization algorithm. To examine whether the study mutation has effect on the disease, we can test the null hypothesis β = 0 using either likelihood ratio test or Wald test, where β is given in (3.3). We can also estimate the variances of the MLEs and construct the confidence intervals of the unknown parameters based on a large-sample theory.


In many real applications, we might be interested in comparing penetrances between 2 groups, for example, male versus female. Also, when there are multiple known disease-causing mutations involved, we are interested in comparing the penetrances among multiple mutations. An example will be given in Section 6. We can extend the previous likelihood functions further to adjust for covariates and multiple disease-causing mutations. In the following example, we illustrate how to incorporate covariates and multiple mutations in the situation where the genotypes from both affected and unaffected relatives are available.

4.1. Likelihood function

Assume that a covariate vector Z is observed for each relative. Then we can incorporate the covariates’ effect in a proportional hazards model:

An external file that holds a picture, illustration, etc.
Object name is biostskxq009fx20_ht.jpg

where Λ(t|g,Z) is the cumulative hazard function of the age at onset given covariate Z and genotype g and Λ0(t;η) is the baseline cumulative hazard function corresponding to g = 0 and Z = 0, which is known up to a parameter vector η of finite dimension. The likelihood function is therefore approximately

An external file that holds a picture, illustration, etc.
Object name is biostskxq009fx21_ht.jpg

Suppose K types of disease-causing rare mutations are considered. We assume that each family can have at most one type of mutation segregated. Let δi be a mutation indicator, that is, δi = k if the ith case proband has the kth type of mutation. We assume that the cumulative hazard functions of these risk mutations are proportional:

An external file that holds a picture, illustration, etc.
Object name is biostskxq009fx22_ht.jpg

where β1 = 0. The approximated likelihood function can be written as

An external file that holds a picture, illustration, etc.
Object name is biostskxq009fx23_ht.jpg

where 1k(δi) is an indicator function taking value 1 if δi = k and 0 otherwise. Refer to Appendix B of the supplementary material (available at Biostatistics online) for the derivation of (4.3). This expression shows that the mutation behaves as a family-shared categorical covariate.

4.2. MLE, confidence interval, and hypothesis testing

The MLEs of η, β, γ, and βk, k = 2,…,K, in (4.1) and (4.2) can be obtained using the Newton–Raphson algorithm or any optimization algorithm. The variance estimates of the MLEs and confidence intervals of the unknown parameters can be obtained as before.

It is of interest to compare the penetrances for various disease-causing mutations, which can be conducted by the standard likelihood ratio test based on the likelihood (4.3). The proportionality of the hazard functions can also be tested by a likelihood ratio test, with the alternative hypothesis being that the mutations have their own specific penetrance functions. If the null hypothesis that the penetrance functions are proportional is not rejected, we can feel free to apply the proportional hazards model (4.2); otherwise we need to estimate mutation-specific penetrance functions.


We conducted simulation studies to assess the performance of the proposed approach.

First, we studied the age-independent penetrances. We assumed 2 independent disease related single-nucleotide polymorphisms (SNPs): one is the study mutation with minor allele frequency (MAF) 0.01 or 0.001 and the other one is unobserved with MAF 0.2. We assumed dominant mode of inheritance for both the SNPs. The disease and risk factors were related by a logistic regression model:

An external file that holds a picture, illustration, etc.
Object name is biostskxq009fx24_ht.jpg

where g (r) is 1 if the genotype of the study SNP (unobserved SNP) is of higher risk and 0 otherwise and OR is the odds ratio parameter for the unobserved SNP, which takes value 1 or 2. The marginal penetrance f1 for carriers was fixed at 0.5 and the other penetrance f0 was 0.03 or 0.1. The values of log-OR parameters a and b were determined by the other parameters. The genotypes of parents were generated under Hardy–Weinberg equilibrium and random mating, and the genotypes of offspring were independently generated given parental genotypes. From a large number of generated families with 3 offspring, we randomly selected 1 000 000 families with the first offspring being affected and carrying the study mutation and treated them as the source population from which the study sample was collected. A sample of size 200, 500, or 1000 was drawn from this population and simulation results based on 100 000 replications were produced. Reported in Table 1 are the Rbias of the estimates defined as the mean estimated penetrances divided by the true penetrance minus 1, empirical standard errors (SE) and mean estimated standard errors (SEE) of the estimates, and empirical coverage probability (ECP) of the penetrances.

Table 1.

Age-independent penetrance estimates

Overall, the estimates have minor bias when the disease is rare (f0 = 0.03) and the study mutation is rare (MAF = 0.001), with absolute relative biases no more than 1.2%. Common disease (f0 = 0.1), increased MAF (0.01) of the study mutation, and positive effect of unobserved mutation (OR = 2) has small impact on the estimates, with Rbias − 7.9% ~ 3.4%. In all situations, the SEE are very close to the empirical ones. The relative bias tends to be stable and remain to be small when the sample size increases. We also estimated the penetrance of carriers using only the genotypes of unaffected relatives by assuming zero penetrance of noncarriers. The resulting Rbias is generally small when f0 = 0.03 but it could become considerably large when f0 = 0.1 (results not shown).

It is also seen from Table 1 that the Rbias for a mutation with MAF = 0.001 tends to be smaller than that observed for a mutation with MAF = 0.01. Additional simulation results show that the relative biases get larger when the MAF increases. For example, a MAF of 0.03 produces relative biases at the range of − 10.3% ~ − 23.7%, and an MAF of 0.1 produces relative biases at the range of − 41.9% ~ − 65.1%, with the other parameters the same as those in Table 1. It appears that the proposed approach is suitable for rare mutation with MAF ≤ 0.01.

Second, we studied the proposed approach when the penetrance is age dependent. We generated data from the following Cox model with Weibull baseline hazard function:

An external file that holds a picture, illustration, etc.
Object name is biostskxq009fx25_ht.jpg

where g and r are the same as those in (5.1) with the same MAFs. The OR was fixed at 1 or 2. The other parameters ξ, ψ, and β were determined by 3 cumulative risk probabilities: p30,0 = P(T ≤ 30|g = 0), p60,0 = P(T ≤ 60|g = 0), and p60,1 = P(T ≤ 60|g = 1), where T is the age at onset. To mimic common disease, we set p30,0 = 0.03 and p60,0 = 0.09; to mimic rare disease, we set p30,0 = 0.01 and p60,0 = 0.03. In both situations, we set p60,1 = 0.5. The ages of the relatives of a proband were generated from the uniform distribution in the interval (a − 5,a + 5), where a is the current age of the proband that is uniformly distributed in the interval (20, 70). The ages, genotypes, and disease status were generated for a large number of families similarly to the age-dependent situation. In each family, there were 2 parents and 3 offspring whose data were generated. Altogether, 1 000 000 families with 1 affected proband (the first offspring) carrying the mutation in each family were obtained. From these families, we sampled 400 or 1000 families and estimated ξ, ψ, and β in model (5.2) by ignoring the unobserved mutation. Substituting the estimated parameters gave the estimates of marginal survival functions of carriers and noncarriers. Based on 5000 replications, we calculated the mean estimated survival functions of both carriers and noncarriers and the 90% confidence intervals of the survival functions.

Presented in Figures 1 and and22 are the results for carriers and noncarriers, respectively, with sample size 1000 and OR = 1 (unobserved mutation does not play a role on the disease). We can see that the bias of the estimates reduces dramatically when the MAF of study mutation decreases from 0.01 to 0.001, showing that the approximation of the likelihood function works pretty good for relatively rare mutation. When the disease gets common, the proposed method using both affected and unaffected relatives does not produce extra bias. However, the method that uses only unaffected relatives has much larger bias for common disease. This extra bias is due to the improper assumption of zero penetrance function of noncarriers for common disease. Other results for sample size 400 or OR = 2 are presented in Figures s1s6 of the supplementary material (available at Biostatistics online). In summary, the bias of the penetrance functions get smaller as the sample size increases. The positive effect of unobserved mutation (OR = 2) has only limited impact on the penetrance function estimates. In particular, the impact is minimal when the MAF of the study mutation is small and the disease is rare.

Fig. 1.

Estimated survival functions of carriers with sample size 1000 and OR = 1. “Common mutation”: MAF = 0.01; “rare mutation”: MAF = 0.001; “common disease”: P(T ≤ 30|g = 0) = 0.03 and P(T ...

Fig. 2.

Estimated survival functions of noncarriers with sample size 1000 and OR = 1. “Common mutation”: MAF = 0.01; “rare mutation”: MAF = 0.001; “common disease”: P(T ≤ 30|g = 0) = 0.03 and P(T ...

Finally, we examined the robustness of the specification of the baseline hazard function. Our simulation studies showed that the misspecification of the baseline hazard function could result in bias, with its magnitude depending on the true and misspecified functions. Here, we do not present the simulation results but briefly summarize them. If the true baseline hazard function is gamma, Weibull, or log-normal, but it was misspecified to be any other 2 functions, then the resulting penetrance estimate had small bias; if the baseline hazard function is piecewise constant but it was misspecified to be Weibull, then the bias could be relatively large.


We applied the proposed approach to a study of Lynch syndrome (Olschwang and others, 2009). In this study, the carriers were identified in 8 genetic units of France and Switzerland. These units offered germline analysis of MSH2 and MLH2 genes. A restrospective questionnaire was conducted to ask for some information on asymptomatic first-degree relatives of carriers. The collected information includes the type of disease-causing germline mutation identified in the proband, birth data, sex, and age at genetic diagnosis. The presence or absence of disease-causing mutation was then assessed from these relatives. Phenotypes and genotypes from 856 asymptomatic first-degree relatives of MSH2 or MLH1 carriers were collected from those 8 centers. For each relative, the gender and mutation status at genes MSH2 and MLH1 were obtained, as summarized in Table 2. Furthermore, the ages of the relatives were available, so that we could estimate the age-dependent penetrances.

Table 2.

Summary of genotypes

With pooled data, we assumed a Weibull survival function (t/eψ)ξ for carriers. With gender or mutation type adjusted, we assumed a proportional hazards model with survival function (t/eψ)ξeβ1x1 or (t/eψ)ξeβ2x2. Here, x1 = 1 if male and 0 if female and x2 = 1 if MSH1 and 0 if MSH2. We obtained the MLEs of the unknown parameters (ψ, ξ, β1, and β2), estimated SE of the MLEs, and 95% confidence intervals of the parameters. The estimation and hypothesis-testing results are presented in Table 3. The estimated survival function together with its confidence interval curves based on 5000 bootstrappings (Efron and Tibshirani, 1993) are plotted in Figure s7 of of the supplementary material (available at Biostatistics online).

Table 3.

Estimates of the parameters for the Lynch syndrome data

The penetrance difference between male and female was moderately large, the penetrance difference between 2 genes was minor, and both of differences were not statistically significant (with P-values 0.168 and 0.515, respectively). These results are consistent with those of Olschwang and others (2009). In this example, we fitted a more general Weibull baseline hazard function with 2 parameters while Olschwang and others (2009) fitted an exponential baseline hazard function with a threshold value.


A precise estimation of the age-dependent risk for people carrying disease-causing mutations would have a tremendous impact on public health, which is instrumental in the counseling of individuals who are identified by genetic testing as carriers and who are faced with different options for cancer prevention or early detection. We provide a rigorous statistical inference framework for the evaluation of the penetrance of a rare mutation. The approach can handle both covariates and multiple rare mutations.

It is helpful to check the parametric assumption of the baseline hazard function. Because the design is retrospective and the observations are subject to censoring, rigorously checking the parametric assumption is a great challenge. In practice, one can try some commonly used parametric baselines and choose the one with the largest likelihood. This technique, however, could be misleading if the true baseline is very different from the selected ones. Instead, a nonparametric approach that does not assume any parametric baseline is much more desirable, although it involves some computational and theoretic issues. We will pursue this in the future research.

The proposed approach allows for unobserved risk factors that are correlated among family members, provided that there is no interaction effect between the study mutation and unobserved risk factors. When the interaction effect is present, the proposed approach can produce considerably large bias on the penetrance estimates. More advanced methods such as the frailty model could be helpful in resolving this problem, for example, Hsu and others (2004) and Hsu and Gorfine (2006). The development of an inference procedure is still under way.

When the disease is not rare, as demonstrated in Section 2.4 and the simulation studies, assuming zero penetrance for noncarriers can produce considerably large bias on the penetrance estimate of carriers. In such situation, genotypes from affected relatives are helpful to improve estimation with the proposed approach. Therefore, it is important to collect genotype information from both affected and unaffected relatives when adopting such a case–family design for the penetrance estimation. We hope our proposed method could make this potentially very useful design more accessible for the future study of rare mutations.


Intramural Program of the National Institutes of Health to H.Z. and K.Y.; Natural Science Foundation of China (10701067) to H.Z.; Institut National du Cancer to S.O.

Supplementary Material

[Supplementary Material]


We would like to thank Dr. Gilles Thomas for helpful discussions, Dr B. J. Stone for editorial help, Drs C. Lasset, Q. Wang, P. Hutter, M. P. Buisine, R. Etienne, C. Caron, V. Bourdon, and S. Baert-Desurmont for data collection. Conflict of Interest: None declared.


  • Chatterjee N, Kalaylioglu Z, Shih JH, Gail MH. Case-control and case-only designs with genotype and family history data: estimating relative risk, residual familial aggregation, and cumulative risk. Biometrics. 2006;62:36–48. [PubMed]
  • Chatterjee N, Wacholder S. A marginal likelihood approach for estimating penetrance from kin–cohort designs. Biometrics. 2001;57:245–252. [PubMed]
  • Cox DR. Regression models and life-tables. Journal of the Royal Statistical Society, Series B (Methodological) 1972;34:187–220.
  • Efron B, Tibshirani R. An Introduction to the Bootstrap. New York: Chapman & Hall; 1993.
  • Gail MH, Pee D, Benichou J, Carroll R. Designing studies to estimate the penetrance of an identified autosomal dominant mutation: cohort, case-control, and genotyped-proband designs. Genetic Epidemiology. 1999;16:15–39. [PubMed]
  • Gail MH, Pee D, Carroll R. Kin–cohort designs for gene characterization. Journal of the National Cancer Institute. Monographs. 1999;26:55–60. [PubMed]
  • Gail MH, Pfeiffer RM, Wheeler W, Pee D. Probability of detecting disease-associated single nucleotide polymorphisms in case-control genome-wide association studies. Biostatistics. 2008;9:201–215. [PubMed]
  • Hsu L, Chen L, Gorfine M, Malone K. Semiparametric estimation of marginal hazard function from the case-control family studies. Biometrics. 2004;60:936–944. [PubMed]
  • Hsu L, Gorfine M. Multivariate survival analysis for case-control family data. Biostatistics. 2006;7:387–398. [PubMed]
  • Olschwang S, Yu K, Lasset C, Baert-Desurmont S, Buisine MP, Wang Q, Hutter P, Rouleau E, Caron O, Bourdon V. and others. Age-dependent cancer risk is not different in between MSH2 and MLH1 mutation carriers. Journal of Cancer Epidemiology. 2009 doi:10.1155/2009/791754. [PMC free article] [PubMed]
  • Wacholder S, Hartge P, Struewing JP, Pee D, McAdams M, Brody L, Tucker M. The kin-cohort study for estimating penetrance. American Journal of Epidemiology. 1998;148:623–630. [PubMed]
  • Wang Y, Clark LN, Marder K, Rabinowitz D. Nonparametric estimation of age-at-onset distributions from censored kin-cohort data. Biometrika. 2007;94:403–414.
  • Wang Y, Ottman R, Rabinowitz D. A method for estimating penetrance from families sampled for linkage analysis. Biometrics. 2006;62:1081–1088. [PMC free article] [PubMed]
  • Yu K, Li Q, Bergen AW, Pfeiffer RM, Rosenberg PS, Caporaso N, Kraft P, Chatterjee N. Pathway analysis by adaptive combination of P-values. Genetic Epidemiology. 2009;33:700–709. [PMC free article] [PubMed]

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press