|Home | About | Journals | Submit | Contact Us | Français|
We show how to use reports of cancer in family members to discover additional genetic associations or confirm previous findings in genome-wide association (GWA) studies conducted in case-control, cohort, or cross-sectional studies. Our novel family-history-based approach allows economical association studies for multiple cancers, without genotyping of relatives (as required in family studies), follow-up of participants (as required in cohort studies), or oversampling of specific cancer cases, (as required in case-control studies). We empirically evaluate the performance of the proposed family-history-based approach in studying associations with prostate and ovarian cancers, using data from GWA studies previously conducted within the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial. The family-history-based method may be particularly useful for investigating genetic susceptibility to rare diseases, for which accruing cases may be very difficult, by using disease information from non-genotyped relatives of participants in multiple case-control and cohort studies designed primarily for other purposes.
In many epidemiologic studies exploring genetic association, participants provide information at enrollment on history of cancers in family members, particularly first-degree relatives. Investigators commonly use this family history information to control for confounding in standard logistic regression. Thornton et al.1 have suggested using family history for optimally weighting cases and controls in association studies as a way to improve power. Our novel approach for using family history in genome-wide association (GWA) studies exploits Mendelian rules of inheritance to detect genetic associations. In essence, we infer the relatives’ genotype distributions from the observed genotypes of participants, using patterns of Mendelian inheritance. The test for association between relatives’ inferred genotypes and their disease status is, in fact, exactly the same as the test for association between participants’ genotypes and relatives’ disease status. Hence, in practice, we compare the genotype distribution of the participants with and without family history to provide evidence of association with the underlying risk of disease.
We illustrate the method using as examples studies of prostate and ovarian cancer in the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial. We test for associations with prostate cancer reported in fathers, brothers, or sons of men and women in the cohort, and associations with ovarian cancer reported in mothers, sisters or daughters of the same cohort members. For prostate cancer we compare results with the standard GWA case-control results for participants in the Cancer Screening trial. Because some participants were genotyped for a nested case-control study of prostate cancer and most were genotyped for unrelated studies, we were able to compare the results from a classic case-control design to the family-history design.
Details of the design and other features of the Cancer Screening trial have been described previously.2 Briefly, this is an ongoing randomized clinical trial to test the effect of screening on mortality for prostate, lung, colorectal and ovarian cancers. 154,901 men and women aged 55–74 years were enrolled between 1993 and 2001 and randomized to either the screening or the control arm.
In two self-administered questionnaires (one at baseline and one about ten years after enrolment began), participants were asked about history of cancer in first-degree relatives and half-siblings. In the baseline questionnaire (https://www.plcostars.com/Public/Documents/PLCO/BQF.pdf), respondents were asked to list family members (including parents, full and half-siblings, and children) with a history of cancer, and to indicate the cancer type and age at diagnosis. Of the 154,901 trial participants, 149,980 (97%) completed the baseline questionnaire. The supplemental questionnaire (https://www.plcostars.com/Public/Documents/PLCO/SQX.pdf), completed by 104,007 (69%) of 149,980 participants in 2006–2007, collected information on history of breast, prostate, lung, ovarian, lymphoma, colorectal, endometrial, bladder, leukemia, and “other cancers” in first-degree relatives.
We defined the family-history variable for prostate cancer as positive if one or more male first-degree relatives were reported to have prostate cancer. We considered family history to be positive if the respondent indicated family history of prostate cancer in either the baseline or supplemental questionnaires, and negative if both responses were “no” or “missing.” We defined the family-history variable for ovarian cancer correspondingly.
We pooled information over the two questionnaires to estimate the number of male first-degree relatives for the participants and the number of those reported to have prostate cancer. We did not have complete family structure from the baseline questionnaire; the questions did not ask about the number of sons for each participant, nor did they differentiate between full and half-brothers. The supplemental questionnaire asked for the number of full brothers and sons of each participant. From this we were able to calculate the average number of full brothers and sons for each participant, and apply that average to estimate the number of full brothers and sons for all participants. In counting the number of relatives reported to have prostate cancer, we considered a male relative to be affected if he was reported in either of the two questionnaires. We followed the same rules for ovarian cancer.
Blood was collected for the screening arm participants, and buccal cell samples were collected from control arm participants. Using these samples, 6,695 participants (4%) were genotyped as part of several previously published and ongoing GWA studies (prostate cancer component of the Cancer Genetic Markers of Susceptibility project,3–5 the Pancreatic Cancer Cohort Consortium,6,7 and a multi-stage GWA study of bladder cancer8,9). We used a list of single-nucleotide polymorphisms (SNPs) from http://www.genome.gov/gwastudies/ (accessed July 20, 2010) that have been reported to be associated with prostate cancer. We restricted the analysis to a single SNP per locus when two or more SNPs within a locus were in strong linkage disequilibrium. For ovarian cancer, we examined a list of 44 previously published and subsequently confirmed SNPs.10–12 Genotypes for 11 of the 44 ovarian SNPs were imputed using the program IMPUTE v2.13 We used 1000 Genomes project (June 2010 release) and HapMap 3 (release 2), both based on the human genome build 36, as the imputation reference set.
We restricted our analyses to white participants (n=6,411; 96% of all participants with GWA data). For the family-history-based method, we fit a logistic regression model for each SNP, with the SNP genotype as an independent variable and the family history of the participant as the dependent variable. The test for association from the family-history-based model corresponds to an indirect test for association between the SNP and the underlying risk of prostate or ovarian cancer, with a participant’s family as the unit of observation. For comparison with the disease-based GWA approach, we performed the conventional direct test for association for the prostate-related SNPs with the disease status of the participant as the response. We considered the 1151 participants with prostate cancer (identified as part of Cancer Genetic Markers of Susceptibility Prostate Cancer GWA study as described elsewhere3), as cases and the remaining 5260 genotyped participants as controls for studying disease-based association. Under the null hypothesis of no association between the SNP genotype and the disease, the family-history-based test maintains the correct Type I error, as does the standard case-control test for association.
For both models, we assumed the additive mode of inheritance. The logistic regression on imputed SNPs used the expected count instead of the exact count of minor alleles.13 We included the top five principal components (based on a selected panel of about 12,000 independent SNPs14) in the regression to control for population stratification.15 For the family-history-based regression model we adjusted for the participants’: sex and age at enrollment, treating age as a continuous covariate. For the disease-based model we adjusted for age at enrollment and family history of prostate cancer. All statistical analyses were performed using SAS, version 9.1 (SAS Institute, Cary, NC).
Characteristics of the participants are reported in Table 1. Of the 6411 participants available for analyses, 738 (12%) participants reported having one or more first-degree relatives with prostate cancer. The participants included 2243 control subjects and 4168 cases from one or more GWA studies. The first column of p-values in Table 1 next to each cancer type corresponds to a test of whether cases for that cancer type are more likely to have reported history of prostate cancer in their families than the rest. As expected, the 1151 men selected as cases for prostate cancer GWA study reported substantially more family history of prostate cancer than the remaining 5260 participants (19% vs. 10%; one-sided p-value = 1.1×10−16), while the participants chosen as cases for other cancer types were not more likely to have a family history of prostate cancer.
Three hundred and two (5%) participants out of 6411 reported one or more first-degree relatives with ovarian cancer. Study participants chosen as cases for GWA studies of bladder and kidney cancers were more likely to have family history of ovarian cancer than the remaining participants (5.5% in 1682 bladder cancer cases vs. 4.4% in remaining 4729 participants, one-sided p-value = 0.03; 7.5% in 281 kidney cancer cases vs. 4.6% in remaining 6130 participants, one-sided p-value = 0.01).
Three-fourths (77%) of participants were men. A greater percentage of men than women reported family history of prostate cancer (12% vs. 10%; two-sided p-value = 0.03). However, when we excluded the 1151 men included in the case series of the prostate cancer GWA study (23% among 4955 men in our analysis), there was no longer a difference in reported family history by sex (two-sided p-value = 0.97). Participants with and without a history of prostate cancer among first-degree relatives had similar age distributions (mean age = 63.8 vs. 64.1; two-sided p-value = 0.12).
Women were more likely than men to report one or more family members with ovarian cancer (7.4% vs. 3.9% respectively; two-sided p-value = 2.9×10−8), as was also noted previously by Pinsky et al.16 Participants with and without ovarian-cancer-affected relatives had similar age distributions (mean age = 64.4 vs. 64.1; two-sided p-value =0.22).
The history of prostate and ovarian cancer among the first-degree relatives of the 6411 participants is generally consistent with what is known about the epidemiology of these cancers. The 4,010 participants who responded to the supplemental questionnaire reported averages of 1.5 brothers, 1.4 sisters, 1.2 sons and 1.2 daughters. Participants reporting family history of prostate cancer had more brothers and sons on average than the participants reporting no family history (2.0 vs. 1.4 brothers; 1.3 vs. 1.2 sons). Similarly, participants with reports of ovarian cancer in family members had more sisters and daughters than participants without (2.1 vs. 1.4 sisters; 1.5 vs. 1.2 daughters). Parents were most likely to be reported as having had prostate (7% of fathers) or ovarian cancer (6% of mothers), with a slightly lower percentage of siblings (4% of brothers; 3% of sisters) and a negligible percentage of sons (0.1%) and daughters (0.3%). We also examined these counts separately for the 1,151 participants selected as cases for prostate cancer GWA study and the remaining 5260 participants, and observed similar patterns. As expected, the prostate cancer Study cases were more likely to report a family history of prostate cancer (6% of first-degree relatives had prostate cancer) than the remaining 5260 participants (3%) or all 6411 participants (3%).
Table 2 presents the p-values for associations with each of the selected 32 prostate cancer SNPs, and the corresponding odds ratio estimates, for both the family-history-based and disease-based methods. The family-history-based p-values are below 0.05 for 6 of these established prostate cancer SNPs and marginally above 0.05 for 3 additional SNPs. Specifically, for the most statistically significant SNP (rs4242382) the family-history-based p-value is 7.4×10−5; for the second and third most significant SNPs (rs1859962 and rs1512268) the p-values are 0.002 and 0.004 respectively. The disease-based method yielded p-values smaller than 0.05 for many of the previously confirmed SNPs. Comparing the two sets of p-values, those based on family history were weaker (p-value for one-sided Wilcoxon signed-rank test comparing matched pairs of p-values = 0.001).
For three SNPs (rs12621278, rs1512268 and rs620861), which were confirmed with subsequent replication,4,17,18 the family-history-based method provided evidence of association when the disease-based method was not informative. For these three SNPs, the family-history-based p-values were 0.02, 3.5×10−3, and 0.04; while the disease-based p-values were 0.47, 0.16 and 0.18.
The odds ratio estimates from the family-history-based method are biased estimates of effect sizes and should be interpreted with caution. A scatter plot of the log odds ratio estimates from the two methods (eFigure 1, http://links.lww.com) shows that the estimates go mostly in the same direction, as one would expect. For SNPs with estimates in the same direction, there is a tendency for the family history-based estimates to be smaller in magnitude than the disease-based estimates.
We also evaluated the family-history-based method for studying association with ovarian cancer. Table 3 presents the p-values for testing associations with the 44 ovarian SNPs via the family-history-based method. None of our participants were genotyped for a nested case-control study of ovarian cancer, and so we could not make a standard disease-based comparison as for prostate cancer. Eleven of the 12 SNPs at 9p22, identified by Song et al.12 as susceptibility markers for ovarian cancer, were associated with the risk of disease at 0.05 level of significance. Two SNPs (rs2301301 at 2q31 and rs17138237 at 7p21) out of 30 in 9 loci examined by Goode et al.,10 yielded p-values less than 0.05.
Our work shows that information routinely collected about family history in GWA studies can provide new and independent evidence of associations between genetic variants and disease at low marginal cost. The most exciting application of this family-history-based method is the feasibility of a GWA study of a rare disease using genotypes from persons previously studied in case-control or cohort studies for other purposes, when history of the rare disease in family members is available. The method can also add to the information available from cohort members or cases and controls; the method allows one to learn about genetic associations with a sex-specific disease, such as ovarian cancer, using male as well as female study participants.
Like the kin-cohort method,19,20 the family-history-based method indirectly assesses genetic susceptibility using error-in-variable methods facilitated by knowledge of the error mechanism from Mendelian principles.21 The kin-cohort method estimates penetrance, or risk of disease, in those with a high-risk genotype, whereas the proposed family-history-based approach tests for genetic association. Although the odds ratio estimate from the family-history-based approach is a biased estimate of the effect of the SNP on disease risk, the method provides an unbiased test for association.
Evidence from the family-history-based method may be useful either for discovery of associations or as supplementary information when prior data already exist. The information provided by the family-history-based approach can add power to conventional studies or to pooled analyses. When used by itself, the usual cautions for GWA studies apply.
Thornton et al.1 used family history information, together with the disease status of the study participants, to gain power in a standard analysis of case-control data. These authors exploited the information on disease status in relatives to construct optimal weights for study participants. Our family-history based approach uses family history of relatives, notably older ones (for adult cancer), for studying association. This approach does not use disease status of the participants, which may be unavailable at the start of a prospective cohort study. We emphasize our ability to study diseases that are not the focus of the main study; a case-control study of one disease can contribute information about risks of many others, and a cohort study that enrolls middle-aged subjects can be useful for investigating associations with a late-onset disease such as prostate cancer or Alzheimer disease, without having to wait for cases to accrue in the cohort.
When comparing the p-values from the family-history-based and disease-based analyses, we should note that the prostate-cancer cases were part of the discovery set for the prostate cancer GWA study. This naturally biases the disease-based analysis favorably for the SNPs that were discovered by that study and gives a spurious power advantage over the family-based analysis. However, there is real attenuation in power for the family-history-based method due to measurement error from using participants’ genotypes in studies of relatives’ disease, and this attenuation is far more important. For example, one might genotype men and collect information on their fathers’ disease status. The weak correlation between their genotypes contributes to the loss in power. Other sources of power loss are inaccurate reports of disease status of family members leading to phenotype misclassification, and the smaller number of men with affected fathers than in case-control studies, where cases are oversampled. This attenuation in power may be partially countered by larger sample sizes or by collecting information from several family members.
We used simulations to compare the performances of the family-history-based and disease-based methods. For odds ratios ranging from 1.2 to 1.8, we studied relative efficiencies (defined as the inverse ratio of sample sizes needed for 70% power at 0.05 level of significance) of family-history-based analyses, as we increase the number of family members, compared with the standard case-control analysis. We fixed disease prevalence at 8% and considered a SNP with 30% minor allele frequency for our simulations. The family-history-based test with a single first-degree relative requires approximately 4 times the sample size as a standard case-control test for association; as we gather information on more relatives, this ratio decreases (eFigure 2, http://links.lww.com). To check the sensitivity of our observations across various parameters, we performed these simulations for different values of disease prevalence and minor allele frequency (eTable 1, http://links.lww.com). We found that disease prevalence is the primary determinant of the comparative performance of the family-history-based and case-control analysis.
The quality of the reports on relatives’ disease history is the central concern in using reported family history. In the absence of electronically linkable medical records or on-line family history tools, accuracy of study participants’ reports of relatives’ disease history will vary by characteristics of the participants, the relatives and the disease. Mai et al.22 found that reports on breast, prostate, colorectal and lung cancer in family members of participants in the population-based 2001 Connecticut Family Health Study had low-to-moderate sensitivity and positive predictive value, but high specificity and negative predictive value. Participants’ knowledge of disease in family members varied with the disease (reported history of breast cancer in family members had the highest sensitivity, while colorectal cancer had the lowest) and with the degree of relatedness to the participant (reports on first-degree relatives were more accurate than reports on second-degree relatives). Accuracy of reports also depends on other factors: disease status, age, ethnicity, sex and family size of the participant, as well as age and sex of the family members. Family history was not directly verified in the Cancer Screening Trial, but Pinsky et al.16 indirectly assessed validity of reported family history by comparing reported rates of various cancers in family members with expected rates derived from the National Cancer Institute’s Surveillance, Epidemiology, and End Results (SEER) database. Overall, the authors observed a ratio of reported rates to expected rates of approximately 0.7; for most cancers, this ratio was in the range of 0.6–1.0 for women and 0.3–0.8 for men. Incomplete or inaccurate family history information collected from participants can reduce power and, when reporting accuracy is differential between study participants with and without family history (as is most likely the case), can induce false-positive reports.
Different family-history variables might be appropriate for different settings. We show that the family-history-based method works with even the simplest yes/no family history variable, which might be the most common in existing studies and is the cheapest to obtain for future questionnaires. We observe that participants reporting family history of cancer had more first-degree relatives on average than participants with no family history; this suggests that formulations of family history taking into account family size (such as proportion of affected family members) may perform better. With additional details, we can construct a family history score23 based on family structure and risk covariates for family members. As a first step, we fitted a polytomous logistic regression (eTable 2, column 3 [http://links.lww.com]) treating the response variable (family history of prostate cancer) as a nominal variable (none, one, and multiple family members with prostate cancer); we saw no substantial improvement.
We assume an additive mode of inheritance in our analyses. To check sensitivity of our results to the modeling assumptions, we also fit the general two-degrees-of-freedom model (eTable 2, columns 4 and 5 [http://links.lww.com]). The observed pattern holds even under the general model. The two-degrees-of-freedom model changes the p-values only marginally in an analysis that does not adjust for population substructure (eTable 2, columns 6 and 7 [http://links.lww.com]).
Information on family history of diseases can add value to association studies without additional genotyping. Family history information can supplement disease-based associations, and can be particularly useful when deciding on a set of variants for follow-up in further studies. It may be possible to increase the power of conventional disease-based association studies by combining information on disease status for participants and their relatives, accounting for correlation. In this context, the quasi-likelihood score test proposed by Thornton et al.1 can be used to construct pedigree-based weights and thus improve power. The power gain from combining family history with disease status may be particularly useful in assessing SNPs with borderline significance based on standard analysis of the study population.
Leveraging family history for GWAS
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
SDC Supplemental digital content is available through direct URL citations in the HTML and PDF versions of this article (www.epidem.com). This content is not peer-reviewed or copy-edited; it is the sole responsibility of the author.