|Home | About | Journals | Submit | Contact Us | Français|
Prediction of genetic risk for disease is needed for preventive and personalized medicine. Genome-wide association studies have found unprecedented numbers of variants associated with complex human traits and diseases. However, these variants explain only a small proportion of genetic risk. Mounting evidence suggests that many traits, relevant to public health, are affected by large numbers of small-effect genes and that prediction of genetic risk to those traits and diseases could be improved by incorporating large numbers of markers into whole-genome prediction (WGP) models. We developed a WGP model incorporating thousands of markers for prediction of skin cancer risk in humans. We also considered other ways of incorporating genetic information into prediction models, such as family history or ancestry (using principal components, PCs, of informative markers). Prediction accuracy was evaluated using the area under the receiver operating characteristic curve (AUC) estimated in a cross-validation. Incorporation of genetic information (i.e., familial relationships, PCs, or WGP) yielded a significant increase in prediction accuracy: from an AUC of 0.53 for a baseline model that accounted for nongenetic covariates to AUCs of 0.58 (pedigree), 0.62 (PCs), and 0.64 (WGP). In summary, prediction of skin cancer risk could be improved by considering genetic information and using a large number of single-nucleotide polymorphisms (SNPs) in a WGP model, which allows for the detection of patterns of genetic risk that are above and beyond those that can be captured using family history. We discuss avenues for improving prediction accuracy and speculate on the possible use of WGP to prospectively identify individuals at high risk.
SKIN cancer is the most common form of cancer, and its incidence has increased in recent decades. In Queensland, Australia, it has been estimated that half of the population is likely to develop skin cancer during their lifetime (World Cancer Report 2008). In its most severe form (i.e., melanoma), skin cancer can be deadly. Although protection against sunburn (Ziegler et al. 1994) is widely believed to reduce the harmful effects of sun exposure on the skin, many individuals continue to seek out intense sun exposure without such protection (Robinson 1990). This may be due in part to belief that their risk of skin cancer is too low to be a serious concern. Such beliefs can be maintained and rationalized by the observation that many individuals exposed to such risk do not experience the adverse event. If individuals could be provided with personalized information about their individual risk, it might promote greater use of preventive measures among those at greatest risk.
Although ultraviolet (UV) exposure and light skin pigmentation are major risk factors for all types of skin cancers [e.g., it is estimated that 80% of melanoma is caused by ultraviolet damage to sensitive skin (IARC 1992)], evidence suggests that genetic factors can also play a role, independent of skin pigmentation. Predictive models are usually based on standard covariables and family history. Additionally, several genetic variants, such as the MC1R, ASIP, TYR, EXOC2, and UBAC2 and the 1p36 and 1q42 loci, have been shown to be associated with basal and squamous cell carcinomas, as well as with melanomas, independent of skin pigmentation (Gudbjartsson et al. 2008; Stacey et al. 2008). These variants typically account for a small proportion of genetic-based disease risk (Han et al. 2006; Pharoah 2008).
As with other phenotypic and disease traits, the inability of loci discovered by genome-wide association studies (GWAS) to explain a substantial proportion of heritability has led to much debate regarding where this so-called “missing heritability” lies (Manolio et al. 2009). It has been suggested that the underlying genetic architecture of many human traits and diseases may involve a substantial number of small-effect genes, thus conforming to the so-called infinitesimal model of quantitative genetics (Fisher 1918; Bulmer 1980; Lander and Schork 1994; Goddard and Hayes 2007). However, in most genetic risk prediction models currently being tested, only a few [i.e., <500 single-nucleotide polymorphisms (SNPs)] statistically significant SNPs are included.
The recognition that complex human traits and diseases could be affected by a large number of genes has motivated many researchers in other fields (Lee et al. 2008; Wray et al. 2008; Purcell et al. 2009; Hill 2010; Yang et al. 2010; de los Campos et al. 2010a) to propose the use of statistical methods tailored for the prediction of complex traits. These methods, largely developed in the field of animal breeding, were first proposed by Meuwissen et al. (2001), who suggested predicting genetic factors by regressing phenotypes on a large number of markers covering the entire genome. The markers are assumed to be in linkage disequilibrium (LD) with one or many loci affecting the phenotypic traits, and the estimates of individual marker effects are expected to be small (Goddard and Hayes 2007). Such models and variations thereof have been used successfully in animal and plant breeding for prediction of production-related traits (de los Campos et al. 2009; Hayes et al. 2009; VanRaden et al. 2009; Crossa et al. 2010; Weigel et al. 2010). More recently, several authors have proposed and used this methodology for the prediction of complex human traits such as height (Yang et al. 2010; Makowsky et al. 2011) and several cancer outcomes (Vazquez 2010).
In this study, we determine whether genetic predisposition to skin cancer could be used to predict disease outcome. Compared with height, skin cancer is less heritable, more complex, and highly relevant. Due to these features, we compared different methods to account for genetic susceptibility to skin cancer: (1) pedigree-based and SNP-based predictions via (2) whole-genome prediction (WGP) for liability to skin cancer, using thousands of evenly distributed markers across the genome and via (3) the principal components of a subset of independent SNPs (previously used to predict geographical origin). To perform this study, we extended the Bayesian LASSO (Park and Casella 2008) regression with a probit link (Dempster and Lerner 1950) to model skin cancer.
The data set consists of 5132 participants from the Framingham Heart Study, which has collected phenotypic information across three generations of families (Dawber et al. 1951, 1963). Subjects in this study have been characterized every other year from adulthood to death on risk factors, outcomes of physical exams, and disease status. Participants included in our study belong to the original cohort (n = 1498) and to the offspring cohort (n = 3634), with a total of 2319 males and 2813 females, all of whom were genotyped. Subjects from the third generation cohort were not included in our study because the follow-up period of this cohort was too short. The skin cancer outcomes were collected by Bernard E. Kreger (Boston University, Boston, study accession no. pht000039) (Kreger et al. 1991). The study declares cancerous all subjects whose pathology reports confirm their cancer. After the 1980s, cases were validated using medical records. The available data represent primary tumors only. The skin cancer outcomes study was updated in 2006, containing life-long follow up, i.e., 1948–2006 for the original cohort and 1971–2006 for the offspring cohort.
All subjects were genotyped for SNPs with the Affymetrix 500K chip. Due to computational limitations (memory requirements), we fitted models using 41,188 evenly spaced SNPs. Evidence from U.S. Holstein cattle has indicated that predictive ability for several traits does not increase markedly when using >10,000 SNPs for a panel of 50,000 maximum (Vazquez et al. 2010). Nevertheless, the degree of linkage disequilibrium differs across species. However, a recent study on human height with the same data set indicated that, in this population and with this sample size (n = 5117), the increase in predictive ability when using >30,000 SNPs was limited (Makowsky et al. 2011).
The outcome () was defined as presence ( = 1) or absence ( = 0) of skin cancer at any site, excluding skin of labia majora, vulva, penis, and scrotum, as a primary tumor site. We modeled probability of skin cancer using the probit link or threshold model (Dempster and Lerner 1950; Harville and Mee 1984). Here, probability of disease equals the standard normal cumulative density function, , evaluated at a subject-specific risk score, with either model
which was represented as the sum of an intercept () plus a regression on the “fixed effects” of sex (as dummy variable), cohort (a factor with three levels), and ethnicity covariates (explained below) , plus either a “random effects” regression on marker genotypes or a random effects being the genetic liability to skin cancer derived based on pedigree connections. Therefore, the joint conditional probability of the data, , given the unknown regression coefficients, and , was
Assigning a prior density to the vector of model unknowns, β, and u, completes the Bayesian model. We assigned a flat prior to the intercept and to the effects of sex, cohort, and ethnicity covariates. This yielded estimates of effects comparable to those obtained with maximum likelihood and used the Bayesian Lasso of Park and Casella (2008) to structure the prior density of marker effects. This prior density yielded shrunken estimates of marker effects and has been successfully used for WGP (de los Campos et al. 2009). The joint prior density was
where is a normal density assigned to the marker effect; with prior mean equal to zero and marker-specific prior variance (); is an exponential prior for variances of marker effects ; is a gamma distribution for with shape and rate parameters and , respectively; is a normal distribution for u with mean and variance × A, which is the additive genetic relationship matrix based on the pedigree; and finally is an inverse chi-square distribution for with parameters S and . In the analysis, these hyperparameter values were set to ; this gives a relatively flat prior over a wide range of values of the regularization parameter (see Pérez et al. 2010 for further details), as S and were 0.19 and 5, respectively. Models were fitted using a modified version of the BLR package (de los Campos and Pérez 2010) of R (R Development Core Team 2010), which can be used for regressions for binary outcomes according to the model described above.
Using the specification described above, we defined a sequence of models and evaluated the performance of each model, using cross-validation. Our baseline model (covariates) included only the effects of sex and cohort. This model was first extended by adding to the regression the effects of the first two principal components of a set of 1000 European ethnicity-informative SNPs previously reported (model denoted as PC-SNP). The panel of ethnicity-informative SNPs used here was those reported by Drineas et al. (2010). Figure 1 shows a scree plot of the first 20 eigenvalues derived from the markers reported by Drineas et al. In a number of studies, the first two principal components (PC1 and PC2) of a large set of genetic variants have been shown to be effective predictors of the ancestral/geographical origin (latitude and longitude) of individuals of European descent (Price et al. 2008; Tian et al. 2008; Novembre et al. 2008; Drineas et al. 2010). For this reason we included two PCs in the model, even though PC2 is only slightly higher than the following PCs. Additionally, the covariates model was extended by adding subsets of SNPs (250, 500, 1000, 50,000, 10,000, and 41,000) distributed over the whole human genome; these models are denoted as 0.25K-SNP, 0.5K-SNP, 1K-SNP, 5K-SNP, 10K-SNP, and 41K-SNP, respectively. The 41,000 SNPs were obtained by choosing ~1 of every 12 SNPs from the original SNP panel. The SNPs in the smaller sets are all included in the larger sets. In this series of increasingly denser models, we aim to discover how many markers are needed to increase the predictive ability. Finally, we extended the covariates model by adding a random effect representing a regression on the pedigree, and this model was denoted as pedigree.
Probabilities and odds ratio for the relative risk of developing skin carcinoma for groups of sex, cohort, and ethnicity were estimated with the PC-SNP model at a fixed level of the dichotomous covariables and fixing the principal components at the mean, first, and third percentile values of the eigenvectors of PC1 and PC2 (see Table 2). Similarly, probabilities and odds ratio of the genetic effects were estimated with the 41K-SNP model for differences in the first and third percentiles of the genetic effects for fixed levels of sex and cohort.
Models were compared based on prediction accuracy, evaluated in a 20-fold cross-validation with subjects assigned to folds at random. The 20-fold cross-validation yielded predictions of risk scores that were derived without using the ith observation or any of the observations assigned to the same fold to which the ith observation was assigned. Using the pairs of points , we estimated false positive rate and area under the receiver operating characteristic curve (AUC) (see Fawcett 2006), using the R package ROCR (Sing et al. 2005).
To assess the uncertainty of the AUC estimate due to sampling variability, we performed 500 random partitions in training and testing sets with sizes equal to those of the cross-validation folds (testing n = 257 subjects and training n = 4875 subjects), maintaining the sizes of the subsets in the cross-validation (i.e., on each replicate, 5% of the individuals were randomly assigned to testing and the remaining 95% to training). Each replicate yielded an estimate of AUC by model, and variability across replicates was reflective of uncertainty due to sampling of training and testing data sets. From this analysis, we reported the number of times one model outperformed another for AUC.
The skin cancer incidence in our data set was 14.1% for the entire period evaluated, starting at 1948 with the original cohort and at 1971 with the offspring cohort, and followed until the 2006 update. Incidence did vary, however, across sex and cohorts. The incidence was higher in males (16%) than in females (13%) and higher in the original cohort (17%), which had a longer follow-up period, than in the offspring cohort (13%). The first two eivenvectors of the PCs decomposition derived from 1000 ethnicity-informative SNPs are displayed in Figure 2A. Figure 2B shows the empirical distribution of PC1 for individuals with and without skin cancer. PC1 has the highest discriminating power (Figure 2A) and, at the marginal level, lower values at the eigenvector were associated with higher incidence of skin cancer (Figure 2B). This PC1 has been reported to track northern vs. southern European ancestry (Drineas et al. 2010). Further evidence of the marginal association of incidence of skin cancer and PC1 is given in Table 1, where average incidence of skin cancer is presented by quartiles of PC1 and PC2.
In the PC-SNP model, the effects of cohort, sex, and the first two PCs are estimated jointly. We found that the estimated coefficient for male sex in the liability scale was 0.18 [0.09, 0.26] (posterior mean and 95% credibility region in brackets). This estimate implies an increased higher risk of developing skin cancer in males, relative to females. The estimated coefficient for the original cohort with respect to the offspring cohort as baseline was 0.32 [0.23, 0.42], also in liability scale. This implies higher risk of developing skin cancer for members of the original cohort and likely reflects the effect of a longer follow-up period for members of this cohort (original cohort started 23 years before the offspring cohort). The estimated coefficients for the first and second PCs were −11.19 [−14.47, −7.87] for PC1 and 12.97 [9.41, 16.62] for PC2, indicating that risk increases as PC1 decreases and as PC2 increases. All 95% confidence regions for the estimated effects did not include zero, showing evidence of nonnull effects of the predictors considered on skin cancer risk. All the estimates presented above are in the scale of the linear predictor (or liability scale). Given the nonlinearity of the model, these results are difficult to interpret. Table 2 shows the estimated probability of skin cancer risk (and estimates of 95% posterior credibility regions) for different combinations of the predictor variables using the PC-SNP model. The 95% credibility regions of the probability estimates for the two cohorts and for gender do not overlap, suggesting significant differences for these predictors. The odds ratio for the cohort variable is 1.76 [1.50, 2.10] (original relative to offspring cohort) in males and 1.81 [1.52, 2.18] in females, while the odds ratio for sex is 1.36 [1.17, 1.58] (male relative to female) in the original cohort and 1.39 [1.19, 1.64] in the offspring cohort. All were estimated at the mean value of the two PCs.
When the covariates model was extended by adding genetic effects connected by the pedigree, the predicted genetic effects in the liability scale ranged between −0.5 and 1.37 (Figure 3). Likewise, the covariates model was also extended by adding the joint effects of SNPs evenly spaced along the genome (from 250,000 to 41,000). In models including genome-wide SNPs, the total contribution of all SNPs, to the cancer risk, is summarized by the linear score , where are marker genotypes and are estimates of marker effects. In our sample, this score () ranged between −1.07 and 2.7 for the model with 41,000 markers (Figure 3). These results suggest the existence of variation due to genetic factors that was captured by either the pedigree or the markers. Figure 3 shows the predicted genetic scores derived from the pedigree and the whole genome regression (WGR) (41K-SNP) (both derived from models fitted to the entire data set). The correlation between the genetic scores was 0.783. Both scores exhibit a bimodal distribution: the group with lower scores corresponds to individuals with no personal or family history of skin cancer, while the second group primarily includes individuals with some personal or family history of skin cancer. The within-group dispersion of the pedigree-based score around the mean of the clusters is much smaller than that of the score derived from the WGR. This occurs because, although WGR captures family history, this approach also allows for the borrowing of information across nominally unrelated individuals.
The estimates presented in the preceding section indicate that all the predictor variables are significantly associated with the risk of developing skin cancer. Additionally, we evaluated the prediction accuracy of each of the models with a 20-fold cross-validation to assess how useful each of these models is in the assessment of risk in individuals with yet-to-be-observed skin cancer outcomes. Figure 4 shows the AUC obtained in the 20-fold cross-validation for (Figure 4A) the covariates, pedigree, PC-SNP, and 41K-SNP, and for (Figure 4B) genomic-enabled models at increasing SNP density from zero to 41,000. The covariates model had an AUC of 0.534 (baseline model). Accounting also for the pedigree relations yielded an AUC of 0.579, improving the AUC by 8.4% [calculated as ]. The PC-SNP had an AUC of 0.622, 16.5% higher than that of the baseline model. Finally, the 41K-SNP model had an AUC of 0.635, 18.9% higher than that of the baseline model. The evaluation of the AUC of the models including sex, cohort, and varying numbers of evenly spaced SNPs showed a monotonic increase in AUC with the number of SNPs, from the covariates model (with zero SNPs) to 41K-SNP (Figure 4B).Family relationships between training and testing sets have been shown to affect prediction accuracy (Habier et al. 2010; Pérez-Cabal et al. 2012). To assess the impact of family relationships on the prediction accuracy derived from models incorporating pedigree or markers, we calculated the AUC of each of the models for subjects that did not have data from relatives in the training data sets (n = 871) and for those that had at least one relative in the training data set (n= 4248) (see Table 3). Results show that in the pedigree and the 41K-SNP model, part of the gains in prediction accuracy can be explained by information provided by relatives. However, as shown in Figure 4A, the 41K-SNP model outperformed the pedigree model, and in Figure 4B the 41K-SNP model had higher predictive accuracy than all the other models for individuals whose risk was predicted without having any relative in the training data set (first row in Table 3). Therefore, we conclude that although part of the prediction accuracy of the 41K-SNP model can be explained by information coming from relatives, this model is capturing patterns of genetic risk beyond those that can be captured by family history.
The above results suggest that whole-genome markers can increase prediction accuracy of skin cancer susceptibility by a nonnegligible margin. To evaluate uncertainty about these point estimates, we replicated a training–testing evaluation 500 times (see Materials and Methods above). The covariates model was improved by the pedigree model in 70% of the replicates, by the PC-SNP model in 90% of the replicates, and by the 41K-SNP model in 94% of the replicates. The 41K-SNP model had higher prediction accuracy than the pedigree model in 90% of the replicates and was higher in accuracy than the PC-SNP model in 66% of the replicates. Figure 5 shows predictive correlation and AUC results by replicate for two models simultaneously (y-axis and x-axis). At the 45° line, both models performed equally, while above the line, the model represented on the y-axis performed better and vice versa.
Skin cancer is the most frequent form of cancer. Further, nonmelanoma skin cancer is the most frequent type of cancer in light-skinned populations (World Cancer Report 2008). In our sample, skin cancer was also the most prevalent type of cancer. The main risk factors for skin carcinogenesis are ultraviolet light exposure, skin type, and geographical location [e.g., no-melanoma skin cancer is 5 times higher in the United States and 20–40 times higher in Australia than in Europe (World Cancer Report 2008)]. In our study, geographical location was not a source of variation, since all data came from the same geographical location (Framingham, MA). However, the harm from sun exposure may have increased over time due to, e.g., atmospheric changes. Indeed, it has been estimated that incidence of nonmelanoma skin cancer increased by 77% from 1992 to 2006 (Stern 2010, p. 13). The estimated effect of cohort in our data might be simultaneously reflecting two factors that may have opposing effects: subjects from the original cohort have had longer exposure (offspring cohort data collection started 23 years later) and year of exposure (since skin cancer incidence has increased over the past 50 years). Gender differences have also been reported in the literature, and they indicate a lower incidence among females (Diepgen and Mahler 2002). Our results are consistent with this (Table 2). The association between sex and risk of developing skin cancer has been largely attributed to different lifestyles (e.g., males are more exposed to sun and less likely to use sun protection) (McCarthy et al. 1999). Additionally, there is evidence of sex-based biological differences at the skin level, relevant to skin cancer liability (Thomas-Ahner et al. 2007).
Previous studies (Han et al. 2006; Gudbjartsson et al. 2008; Stacey et al. 2008; World Cancer Report 2008) have indicated that genetic factors play an important role in predisposition to skin cancer. However, predictive models for skin cancer, although accurate, do not usually account for genetic factors (e.g., Soong et al. 2010). In this study, we show that considering genetic information, under the form of familial relationships, SNP-derived PCs or WGP using markers evenly distributed in the genome can increase the prediction accuracy of risk of developing skin cancer.
Simply considering family history, under the form of pedigree relationships linked to phenotypes, increased prediction accuracy. However, there is a limit to how much prediction accuracy can be gained by considering family history alone. One of the limitations of using pedigree connections in a predictive model is that family size in humans is usually small. Our sample, for instance, has some unrelated individuals. Other limitations of models using pedigrees are (1) they capture important elements of genetic variability, such as variability due to substructure or admixture, and (2) these models cannot describe genetic differences between individuals with identical pedigree (e.g., full sibs) due to sampling of alleles at meiosis. Therefore, describing genetic background using markers can potentially boost prediction accuracy above and beyond what can be achieved using family history. In agreement with previous studies in animals (VanRaden et al. 2009), our study confirms this and suggests that simply considering two-SNP–derived PCs can increase prediction accuracy substantially. Skin color, and therefore ethnicity, is known to be highly correlated with skin cancer (World Cancer Report 2008), and skin color varies even among individuals of European descent. We found that cancer incidence was high at low levels of PC1 and at high levels of PC2. In previous studies, PC1 among Europeans has been found to correspond to ancestry along the northwest to southeast European geographical axis (Campbell et al. 2005; Novembre et al. 2008; Drineas et al. 2010). Therefore, one possible explanation of the increase in prediction accuracy obtained by considering two-SNP–derived PCs is that these PCs are capturing ancestry, which correlates with skin color and with risk of developing skin cancer.
Naturally, there is also a limit in the proportion of genetic variability at causal loci that can be explained by two PCs. Our study confirms this: the WGP model using 41K-SNP outperformed all the other models we considered, including the one with PCs. The prediction accuracy of WGP increases monotonically with marker density, a finding that is consistent with the hypothesis that genetic risk to skin cancer is affected by a large number of variants. A polygenic genetic architecture has also been suggested for other human traits (e.g., Vattikuti et al. 2012).
Empirical evidence for complex traits in animals (Vazquez et al. 2010) and humans (Makowsky et al. 2011) has shown that prediction accuracy increases with marker density, and our results are consistent with this. However, prediction accuracy depends on many other factors, perhaps most importantly on the size of the training set (n) (VanRaden et al. 2009). Preliminary evidence of (unpublished) studies we conducted with human height suggests that the level at which the curve relating prediction accuracy and marker density reaches a plateau is highly dependent on marker density. Prediction accuracy also depends on the selection criteria of the markers incorporated as predictors in the model (Vazquez et al. 2010). In that study, the predictive correlation obtained with a set of 300 markers highly associated with the trait of interest was, on average for six traits in cattle, 0.18 higher than that obtained with a set of 300 evenly spaced markers. Here, markers are evenly distributed, representing the whole genome, and are not particularly associated with skin cancer. We expect predictive accuracy to increase if the markers were selected based on their association with the disease. Therefore, we speculate that further increases in marker density, targeting markers associated with skin cancer, accompanied by increases in sample size, could increase prediction accuracy of WGP even further.
The WGR models implemented in this study account for additive effects of markers. Potentially, these additive models could be extended to account for interactions of alleles within loci (i.e., dominance) and between loci (i.e., epitasis). With p markers, modeling additive and dominance effects involves estimating 2p effects, and this can be done by using the methods similar to those described in this article. However, modeling epistatic interactions is much more difficult because the number of contrasts required and, consequently, the number of parameters to be estimated grow exponentially with the number of markers and the order of the interaction. Alternatively, one can attempt to capture departures from the linear model, using WGP with nonparametric procedures, such as penalized neural networks or reproducing kernel Hilbert spaces (Gianola et al. 2006; de los Campos et al. 2010b). However, even in cases where complex interactions among alleles hold at the causal level, a large proportion of interindividual differences in genetic risk may manifest as additive variance (Hill et al. 2008), and the information provided by data for estimation of nonadditive effects may be small (Hill et al. 2008). Because of this, it is not necessarily the case that use of models that account for nonadditive effects will yield higher prediction accuracy than that of an additive model.
Although accurate in predicting survival rate once the signs of the disease are present, previous predictive models for skin cancer do not account for genetic susceptibility factors (Soong et al. 2010), and therefore they have limited use for preventive measures that can be applied early in life. In our study, prediction substantially improved by using genetic parameters in the predictive models. Further, methods including genome-wide markers information outperformed models with genetic risk estimates derived from the pedigree. WGP is a promising tool for estimating individual genetic predisposition to skin cancer before it is detected or even developed. We speculate that genomic information may be used to prospectively identify individuals with particularly high risk of developing skin cancer.
We gratefully acknowledge scientists conducting the Framingham Heart Study for collecting, analyzing, and providing the genotypic and phenotypic data used in the present study. We also acknowledge the invaluable dedication of all participants in the Framingham Heart Study for their contribution over many years. We acknowledge National Institute of Health grants R01GM077490, R01GM099992, and R01GM101219, and a Kraft grant, as well as the Wisconsin Agriculture Experiment Station, for their financial support.
Communicating editor: F. Zou