Home | About | Journals | Submit | Contact Us | Français |

**|**HHS Author Manuscripts**|**PMC3741669

Formats

Article sections

- Abstract
- INTRODUCTION
- RESULTS
- DISCUSSION
- MATERIALS AND METHODS
- Supplementary Material
- REFERENCES AND NOTES

Authors

Related links

Sci Transl Med. Author manuscript; available in PMC 2013 August 13.

Published in final edited form as:

Published online 2012 April 2. doi: 10.1126/scitranslmed.3003380

PMCID: PMC3741669

NIHMSID: NIHMS505281

Nicholas J. Roberts,^{1,}^{*} Joshua T. Vogelstein,^{2,}^{*} Giovanni Parmigiani,^{3} Kenneth W. Kinzler,^{1} Bert Vogelstein,^{1,}^{†} and Victor E. Velculescu^{1,}^{†}

The publisher's final edited version of this article is available at Sci Transl Med

See other articles in PMC that cite the published article.

New DNA sequencing methods will soon make it possible to identify all germline variants in any individual at a reasonable cost. However, the ability of whole-genome sequencing to predict predisposition to common diseases in the general population is unknown. To estimate this predictive capacity, we use the concept of a “genometype”. A specific genometype represents the genomes in the population conferring a specific level of genetic risk for a specified disease. Using this concept, we estimated the capacity of whole-genome sequencing to identify individuals at clinically significant risk for 24 different diseases. Our estimates were derived from the analysis of large numbers of monozygotic twin pairs; twins of a pair share the same genometype and therefore identical genetic risk factors. Our analyses indicate that: (i) for 23 of the 24 diseases, the majority of individuals will receive negative test results, (ii) these negative test results will, in general, not be very informative, as the risk of developing 19 of the 24 diseases in those who test negative will still be, at minimum, 50 - 80% of that in the general population, and (iii) on the positive side, in the best-case scenario more than 90% of tested individuals might be alerted to a clinically significant predisposition to at least one disease. These results have important implications for the valuation of genetic testing by industry, health insurance companies, public policy makers and consumers.

As a result of continuing advances in high-throughput sequencing technologies (1–4), whole-genome sequencing will soon become an affordable approach to identify all sequence variants in an individual human. Recent evidence suggests that each human genome has more than 3 million sequence variants, some common, some infrequent (5). To date, several thousand genomic variants have been associated with human diseases, either as rare variants in Mendelian disorders or as common SNPs in genome-wide association studies (GWAS) (6, 7). Whole-genome or whole-exome sequencing has recently been used to identify new disease predisposing variants in various familial disorders, such as familial pancreatic cancer (8) and Miller syndrome (9). However, the potential utility of genome-wide sequencing for personalized medicine in the general population is unclear. Suppose, for example, that sequencing becomes sufficiently inexpensive that all individuals, at birth, could have their genomes sequenced at negligible cost. What fraction of the population would benefit from such sequencing? “Benefit” in this context is defined as receiving information indicating that the risk of disease is increased or decreased to a degree that would alter an individual's lifestyle or medical management.

On the surface, it might seem impossible to answer this question at present, as there are millions of genetic variants in every individual and the contribution of nearly all of these variants to any disease is unknown. However, there is one group of individuals in which this question can be immediately addressed: monozygotic twin pairs. If one twin of the pair has a disease, then the probability of the other twin developing that disease is dependent on the genome whenever that disease has some genetic component. We show below that when this logic is applied to a large numbers of twins, estimates of the potential benefits of genome-wide sequencing in the general (non-twin) population can be made.

The key to our analysis is the concept of a “genometype”. We do not know the genomic sequences of the twin pairs analyzed in the studies described herein, but we do know that each twin pair shares a nearly identical genome (10) and that a genome confers a particular genetic risk to every disease. For each disease, we group genomes that confer identical genetic risks into genometypes. For example, genometypes could be grouped into 20 bins, with genometypes in bin 1 conferring zero genetic risk, genometypes in bin 2 conferring 3% genetic risk, genometypes in bin 3 conferring 10% genetic risk, etc. We can then estimate what distributions of genometypes in the population best reflect the observed monozygotic twin concordancy and discordancy for any given disease.

In twin studies on diseases, heritability (defined in Box 1) is generally based on the difference in the incidence of a disease in monozygotic *versus* dizygotic twins (11, 12). Heritability reflects the average genetic contribution to disease in a twin population. We are interested in the distribution of genetic risks rather than the average. For example, a 30% average risk could reflect a small fraction of twin-pairs with genometypes conferring high genetic risk or a larger fraction of twin-pairs with genometypes conferring a moderate genetic risk. Among all the distributions of genometypes that are compatible with the twin epidemiologic data, we wished to find the distributions that maximized or minimized the potential clinical utility of identifying those genometypes by genomic sequencing.

Whole-genome sequencing-based tests, like any genetic test, can be informative in two ways: negative and positive tests would indicate a substantially lower or higher risk, respectively, than that of the general population. The challenge is to define “substantially” in clinically meaningful and quantitative terms. An example might help put this challenge into perspective. Suppose a woman receives a whole-genome test result indicating that she has a 90% lifetime risk (the total risk over her entire life) of developing breast cancer. She may decide to have a prophylactic double mastectomy to prevent this outcome. Similarly, if the test indicated an 80% or even a 50% lifetime risk of developing breast cancer, she may consider mastectomy. On the other hand, if the test indicated only a 14% risk of developing breast cancer, then mastectomies would be considered by very few women, given that most women today do not opt for prophylactic mastectomies even though the lifetime risk of developing breast cancer in the general population is 12%.

This example illustrates that the risk threshold required for clinical utility represents a balance between the risk reduction afforded by an intervention and its negative consequences. A precedent exists for defining this threshold, in that the decision to implement genetic tests is often based on a positive predictive value (PPV) of at least 10%, implying that more than 1 in 10 patients with a positive test result are expected to develop disease (13). While the choice of this threshold will depend on the specific intervention and should ideally be left to the individual, we use this 10% threshold for our population-level analyses of 20 of the 24 diseases analyzed (table S1). In the other four diseases (chronic fatigue syndrome, gastro-esophageal reflux disorder, coronary heart disease-related death and general dystocia), which occur at relatively high frequency in the population, this 10% threshold is inadequate to distinguish individuals with a significantly increased genetic risk from the rest of the population. For these four diseases (table S1), a more appropriate threshold corresponds to one conferring a genetic risk that is at least as great as that of the non-genetic component. Individuals with genometypes conferring this degree of genetic risk would therefore have a total risk at least-twice as large as those without any genetic predisposing factors. This 2x threshold in relative risk is similar to those widely used as clinical benchmarks for common diseases (14–18).

For whole-genome testing in healthy individuals, we thereby defined a threshold at which a positive test result would be clinically meaningful as follows. If the non-genetic risk was <5%, then the threshold was set at 10%. If the non-genetic risk was >5%, then the threshold was set at 2x the non-genetic contribution. Though we have used these particular thresholds in most of the examples described below, we also describe how these results varied when other thresholds were considered.

We collated monozygotic twin pair data from the Swedish Twin Registry, Danish Twin Registry, Finnish Twin Cohort, Norwegian National Birth Registry and the National Academy of Science – National Research World War II Veteran Twins Registry (19–31) (Table 1). From these registries, we selected data representing 24 diseases of diverse etiologies including autoimmune diseases, cancer, cardiovascular diseases, genitourinary diseases, neurological diseases and obesity-associated diseases. Three of these conditions (coronary heart disease, cancer and stroke) represent the leading causes of mortality in the United States, accounted for 54.2% of total deaths in 2007, and are therefore of major public health importance (32). The thresholds for a clinically meaningful test result, as defined above, were calculated from disease prevalence and non-genetic risks in the populations from which the twins were drawn (19–31) (Materials and Methods, Table 1 and table S2).

We then developed computational methods to evaluate possible frequency (*f*) and genetic risk (*r*) combinations for a population containing 20 genometypes. Genometype frequency is defined as the proportion of twin pairs in the population that have a given genometype (Box 1). Genometype genetic risk is defined, for each disease, as the absolute increment in risk that an individual with that genometype will face compared to someone with no genetic risk at all (Box 1). For any combination of genometypes, each with a certain frequency and genetic risk, we obtain an expected distribution of disease-affected individuals among a monozygotic twin cohort. Many different combinations of genometype frequencies and genetic risks match the observed distributions in monozygotic twins; we are interested in those combinations (distributions) that maximize or minimize clinical utility, thus putting bounds on the expectations from whole-genome sequencing. The mathematical framework for our study, and associated statistical and technical issues, are detailed in the Material and Methods.

These analyses allowed us to address various measures of potential clinical utility. First, for each disease, what is the maximum and minimum fraction of patients with the disease that would receive a positive test, i.e., a result indicating that they have a substantially increased risk of that disease? The answers to this question are graphically shown in Fig. 1 for each of the 24 diseases (for three diseases, we present different answers for males and females, resulting in a total of 27 disease categories). As can be seen from Fig. 1, the fraction of patients that would receive a positive test varies widely from disease to disease. The majority of patients (>50%) who would ultimately develop 13 of the 27 disease categories would *not* test positive, even in the best-case scenario. On the other hand, there were four disease categories - thyroid autoimmunity, type I diabetes, Alzheimer's disease, and coronary heart disease-related deaths in males - for which genetic tests might identify more than 75% of the patients who ultimately develop the disease. Genometype risk and frequency distributions for all diseases are shown in table S3 and graphically for representative diseases in fig. S1.

We could also determine the maximum and minimum fraction of individuals in the population (rather than the fraction of patients with disease) who would receive positive test results for each disease. As shown in Fig. 2, this fraction is generally small, as expected, because the incidence of most diseases is relatively low. Do these negative tests, which would be received by the great majority of individuals for most diseases, have value? Negative tests could be valuable to individual patients if they indicated a considerably lower total risk than would be assumed in the absence of testing. As can be seen from Fig. 3, though, negative tests are generally not very informative in the case of whole-genome sequencing as they are limited by the non-genetic compoment of risk. For 22 of the 27 disease categories studied, a negative test would not indicate a risk that is less than half that in the general population, even in the best-case scenario. This level of risk reduction is probably not sufficient to warrant changes of behavior, lifestyle, or preventative medical practices for these individuals (33, 34). On the other hand, there was one disease category (Alzheimer's disease, Fig. 3) in which a negative test result might indicate as little as a ~12% relative risk of disease compared to the entire twin cohort, at least in the best-case scenario. Knowledge of such a reduced risk might be comforting and relieve anxiety, particularly to those with a family history of Alzheimer's disease.

What is the maximum fraction of individuals that could receive at least one positive test result, i.e., a report indicating that s/he is at risk for at least one of the 24 diseases assessed? From the data depicted in Fig. 2, we estimate that >95% of men and >90% of women could receive at least one positive test result if the risk alleles were actually distributed in the way that produced maximal sensitivity in our model. We assumed that the risk alleles for these 24 diseases were independent in these estimates; if they were not independent, then these figures represent overestimates. On the other hand, these frequencies may represent underestimates as there are a number of additional diseases with hereditary components that have not yet been studied in monozygotic twins or included in our analyses. At the very least, if we consider only distinct disease categories whose pathogenesis is unlikely to be shared, our analyses suggest that, in the best-case scenario, the majority of tested individuals might be alerted to a clinically meaningful risk by whole-genome sequencing.

It was of interest to determine how the results described above varied with the threshold chosen for the analysis. For example, it might be argued that a threshold of 10% was too low for true clinical utility. Our analyses show that the maximum fraction of affected cases testing positive, as well as the maximum fraction of the total population that tests positive, is not changed much when the thresholds are changed to 20% (tables S4 and table S5). With very high thresholds, however, both these measures of sensitivity decrease significantly (table S4 and table S5). Moreover, the maximum predictive value of a negative test drops precipitously at higher thresholds (table S6).

The general public does not appear to be aware that, despite their very similar height and appearance, monozygotic twins in general do not always develop or die from the same maladies (35, 36). This basic observation, that monozygotic twins of a pair are not always afflicted by the same maladies, combined with extensive epidemiologic studies of twins and statistical modeling, allows us to estimate upper- and lower-bounds of the predictive value of whole-genome sequencing.

On the negative side, our results show that the majority of tested individuals would receive negative tests for most diseases (Fig. 2). Moreover, the predictive value of these negative tests would generally be small, as the total risk for acquiring the disease in an individual testing negative would be similar to that of the general population (Fig. 3). On the positive side, our results show that, at least in the best-case scenario, the majority of patients might be alerted to a clinically meaningful risk for at least one disease through whole-genome sequencing.

These conclusions are consistent with what is now known about risk allele loci from genome-wide association studies (GWAS) (37). In general, GWAS have shown that many loci can predispose to disease and that each risk allele confers a relatively small effect (38, 39). For example, a recent analysis of large cohorts of individuals with colorectal cancer showed that only ~1.3% of phenotypic variance could be accounted for by the 10 loci discovered through GWAS (40). However, it could be argued that the relatively low level of utility that might be inferred from such studies is misleading. In particular, it is possible that a more complete knowledge of disease-associated variants and their epistatic relationships would be able to reliably predict who will and who will not develop disease in the general population. Our results allow us to estimate the *maximum possible* reliability of such tests.

Several of our conclusions are based on the genometype frequency and risk distributions that would *maximize* the clinical utility of genetic testing, i.e., are best-case scenarios. The actual frequency and risk distributions of genometypes in the population are not likely to be distributed in this way. Indeed, other distributions are also consistent with the monozygotic twin data on which our maxima are determined and all other distributions yield less clinical utility than those of the maxima, as shown in Figs. 1 to to3.3. Moreover, in the real world, it is unlikely that the biomedical correlates of every genetic variant and the epistatic relationships among these variants will ever be completely known, or that the analytic validity of genetic testing will be perfect - as we assume in our ideal scenario. Thus, our conclusions purposely overestimate the value of whole-genome sequencing that will be achieved - they represent an absolute upper bound that cannot be improved by improvements in technology or genetic knowledge. As a practical example of this principle, we estimate that a negative whole-genome sequencing-based test *could* indicate a ~ two-fold decrease in risk for prostate cancer in men and a similar two-fold decrease for urinary incontinence in women. But this two-fold decrease would only apply in a world in which the risk alleles are distributed in a fashion that maximizes the sensitivity of whole genome testing (Fig. 3). In the real world, the risk alleles are not likely to be distributed in this ideal fashion, and omniscience about every variant is not likely to be realized. Thus, the risk of these diseases in patients who test negative will likely be even more similar to that of the general population. For diseases with a lower heritable component, such as most forms of cancer, whole-genome based genetic tests will be even less informative. Thus, our results suggest that genetic testing, at its best, will not be the dominant determinant of patient care and will not be a substitute for preventative medicine strategies incorporating routine checkups and risk management based on the history, physical status and life style of the patient.

It is important to point out that our study focused on testing relatively common diseases in the general population and did not address the utility of whole-genome sequencing to identify the genetic basis of rare monogenic diseases. In such unusual cases, it has already been shown that whole-genome sequencing can prove highly informative (8, 9).

As with any model-based study, our conclusions have a number of caveats. Our analyses are based on data from twin studies and the assumptions made therein (11). Specifically, we do not model gene-environment interactions and rely on the prevalence of disease in the twin cohorts; this prevalence, as well as the operative non-genetic contributions, may differ from that in the general population. Though twins are likely to be representative of the general population, the estimates provided by our model could be improved through analyses of larger twin cohorts as these become available, as well as through a more complete phenotypic evaluation of twins of varying ethnicities. Another caveat is that our conclusions about potential utility are based on thresholds that represent a complex balance of personal choices, demographic influences, disease characteristics and the clinical intervention(s) available. We have used a minimum 10% total risk and a minimum relative risk of 2 as the threshold in our analyses. Other thresholds may be more appropriate and meaningful for given situations, though the data in table S4 to table S6 show that our major conclusions are not altered much by the choice of threshold.

In sum, no result, including ours, can or should be used to conclude that whole-genome sequencing will be either useful or useless in an absolute sense. This utility will depend on the results of testing, the individual tested, and the perspectives of individuals and societies. What we hoped to accomplish with this study is to put the debate about the value of such sequencing in a mathematical framework so that the potential merits and limitations of whole-genome sequencing, for any disease, can be quantitatively assessed. Recognition of these merits and limits can be useful to consumers, researchers, and industry, as they can minimize unrealistic expectations and foster fruitful investigations.

We used data from twin studies arising from population-based twin registries to investigate the distribution of disease risk within the population (19–31). The registries in our study included the Swedish Twin Registry, Danish Twin Registry, Finnish Twin Cohort, Norwegian National Birth Registry and the National Academy of Science – National Research Council World War II Veteran Twins Registry. Traits were chosen that represented diverse etiologies or were conditions of significant public health importance. We evaluated diseases in the following categories: autoimmune (T1D, thyroid auto-antibodies), neoplastic (breast, colorectal and prostate cancer), cerebrovascular (coronary heart disease-related death and stroke-related death), genitourinary (general dystocia, pelvic organ prolapse, and urinary incontinence), unknown etiology (irritable bowel syndrome, chronic fatigue), neurological (Parkinson disease, Alzheimer's disease and dementia) and obesity-associated (T2D, gallstone disease).

To be included in our analyses, the following data had to be available for each twin study:

*n*– total number of monozygotic (MZ) twin pairs where the disease status of each twin was known._{t}*n*– number of disease-concordant MZ twin pairs._{c}*n*– number of disease-discordant MZ twin pairs._{d}*n*– number of healthy-concordant MZ pairs._{h}- Heritability (
*HER*) – calculated as the proportion of the polygenic liability variation associated with genetic factors.

Using the data from population-based twin studies, we define cohort risk (CR) - the fraction of people in the cohort that had the disease - as follows:

$$CR=(2{n}_{c}+{n}_{d})\u2215\left(2{n}_{t}\right)$$

(1)

We define the following generative model that characterizes the joint distribution of an individual having a pre-specified disease and a particular genometype. Each individual is characterized by: (i) a binary (Bernoulli) random variable, *Z*, specifying whether or not s/he has the disease, and (ii) a categorical random variable, *G*, indicating the genometype of the individual. This means that of the *d* assumed extant genometypes, each individual can have only one of them. The joint distribution of both the disease and genometype for an individual is given by *P*(*Z, G*). This joint distribution decomposes into a product of the likelihood of getting the disease given the genometype, *P*(*Z* | *G*), and the prior probability of having the genometype, *P*(*G*)

$$P(Z,G)=P(Z\mid G)P\left(G\right)$$

(2)

Thus, to proceed, we specify both the likelihood function, *P*(*Z* | *G*), and the prior, *P*(*G*). As mentioned above, *G* is a categorical random variable taking values *g*_{1},*g*_{2},...,*g _{d}*, each of which with some probability. Therefore we have:

$$P(G={g}_{i})={f}_{i}$$

(3)

for all *i*=1,2,...,*d*. In words, a person can have one of the *d* assumed extant genometypes, and the probability of having genometype *i* is given by *f _{i}*.

The probability of having the disease given a genometype is *q _{i}*=

$$P(Z\mid G={g}_{i})=\{\begin{array}{cc}e+{r}_{i},\hfill & \text{if}\phantom{\rule{thickmathspace}{0ex}}z=1,\hfill \\ 1-e-{r}_{i},\hfill & \text{if}\phantom{\rule{thickmathspace}{0ex}}z=0.\hfill \end{array}\phantom{\}}$$

(4)

Thus, the joint distribution of disease and genometype can be written as:

$$P(Z=z,G={g}_{i})={f}_{i}{(e+{r}_{i})}^{z}{(1-e-{r}_{i})}^{1-z},z\in \{0,1\},g\in \{{g}_{1},\dots ,{g}_{d}\}.$$

(5)

If the available data included the genometype and disease status of each individual, then inferring estimates of the parameters, * r* =(

The probability of disease concordant monozygotic twins is given by:

$$P({Z}_{j}={Z}_{k}=1\mid {G}_{j}={G}_{k})=\sum _{i}P({Z}_{j}={Z}_{k}=1\mid {G}_{j}={G}_{k}={g}_{i})P({G}_{j}={G}_{k}={g}_{i}),$$

(6a)

$$=\sum _{i}P({Z}_{j}=1\mid {G}_{j}={g}_{i})P({Z}_{k}=1\mid {G}_{k}={g}_{i})P({G}_{j}={G}_{k}={g}_{i})$$

(6b)

$$=\sum _{i}{(e+{r}_{i})}^{2}{f}_{i}.$$

(6c)

Similarly, the probability of healthy concordant monozygotic twin pairs is given by:

$$P({Z}_{j}={Z}_{k}=0\mid {G}_{j}={G}_{k})=\sum _{i}P({Z}_{j}={Z}_{k}=0\mid {G}_{j}={G}_{k}={g}_{i})P({G}_{j}={G}_{k}={g}_{i}),$$

(7a)

$$=\sum _{i}{(1-e-{r}_{i})}^{2}{f}_{i}.$$

(7b)

And the probability of monozygotic twin pairs discordant for disease is given by:

$$P({Z}_{j}\ne {Z}_{k}\mid {G}_{j}={G}_{k})=2\sum _{i}(e+{r}_{i})(1-e-{r}_{i}){f}_{i}.$$

(8)

For each disease, let *n _{c}, n_{h}* and

$$\mathrm{E}\left[{n}_{c}\right]={n}_{t}\sum _{i\in \left[d\right]}{(e+{r}_{i})}^{2}{f}_{i},$$

(9)

$$\mathrm{E}\left[{n}_{h}\right]={n}_{t}\sum _{i\in \left[d\right]}{(1-e-{r}_{i})}^{2}{f}_{i},$$

(10)

$$\mathrm{E}\left[{n}_{d}\right]={n}_{t}\sum _{i\in \left[d\right]}2(e+{r}_{i})(1-e-{r}_{i}){f}_{i}.$$

(11)

Because we are interested in the limits of utility of genetic testing, we search for a parameter set that maximizes or minimizes the fraction of patients that will receive a positive test result, given certain constraints. Formally, we define the positive fraction (*PF*) as the proportion, among twin pairs with at least one disease case, that possess a genometype sufficient to change clinical action. In our notation:

$$PF(t,e;f,p)=\frac{{\sum}_{i\in \left[d\right]\mid {r}_{i}>t}{f}_{i}[{(e+{r}_{i})}^{2}+(e+{r}_{i})(1-e-{r}_{i})]}{{\sum}_{i\in \left[d\right]}{f}_{i}[{(e+{r}_{i})}^{2}+(e+{r}_{i})(1-e-{r}_{i})]}$$

(12)

where *t* is the genetic risk required for a person to be at the threshold required for clinical utility and *d* is the maximum number of genometypes under consideration. The thresholds for each disease are provided in table S2, and for each disease, *t* is defined as this threshold minus *e*.

We therefore seek to solve the following optimization problem, for each disease:

$${\scriptstyle \begin{array}{c}\hfill \text{maximize}\hfill \\ \hfill f,r\hfill \end{array}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}PF(t,e;f,p),$$

(13)

$$\text{subject to}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}{f}_{i}\ge 0,\sum _{i}{f}_{i}=1,{r}_{i}\in (0,1),\sum _{x\in \{c,h,d\}}{\left({\stackrel{\u2322}{n}}_{x}-\mathrm{E}\left[{n}_{x}\right]\right)}^{2}\le 0.25,$$

(14)

where Eq. (14) enforces that none of the residual errors can be larger than 0.5. The parameter *n _{x}* is the estimated number of twin pairs of each type obtained by plugging the estimated parameters into Eqs. (9) – (11). This is therefore a quadratically constrained nonlinear optimization problem. We utilize the following algorithm to obtain a local optimum.

For *d*’ = 2, i.e., starting with *d*’ = 2 genometypes, we implement a grid search over the parameter space and select the parameters that maximize the likelihood over a constrained search space. Let *θ* = (*f,r*) and Θ be the set of all *θ*'s under consideration, as defined by the feasible region specified in Eq. (14). We then discretize this space into nine bins for each element of *f* and 100 bins for each element of *r* and denote *P*(*Z*|*G*) by *P _{θ}*(

$${\stackrel{\u2322}{\theta}}^{\left(2\right)}={\scriptstyle \begin{array}{c}\hfill \text{argmax}\hfill \\ \hfill \theta \in \u03f4\hfill \end{array}}\prod _{i,j}{P}_{\theta}({Z}_{j},{Z}_{k}\mid {G}_{j}={G}_{k})$$

(15)

where ${\stackrel{\u2322}{\theta}}^{\left({d}^{\prime}\right)}=({\stackrel{\u2322}{f}}^{\left({d}^{\prime}\right)},{\stackrel{\u2322}{r}}^{\left({d}^{\prime}\right)})$ is the parameter estimate assuming only *d*’ genometypes. For each *d*’ = 3,...,20, we seek to solve the above optimization problem. To initialize, we pad the previous solution with zeros, yielding ${\stackrel{\u2322}{f}}_{\left(0\right)}^{({d}^{\prime}+1)}=({\stackrel{\u2322}{f}}^{\left({d}^{\prime}\right)},0)$ and similarly for ${\stackrel{\u2322}{r}}_{\left(0\right)}^{({d}^{\prime}+1)}$. Then we use MATLAB's *fmincon* to find a local maximum of *PF* given the constraints. If no improvement in *PF* is obtained for *d*’ +1 genometypes using the default “padded” initialization, we try randomly initializing. We stop trying random initializations if any of the following criteria are met: (i) if we find an improvement in *PF* with the constraints satisfied, (ii) if we reach 100% *PF*, or (iii) if we reach 15 random initializations. If criterion (i) is met, we denote the parameters achieving the improvement ${\stackrel{\u2322}{\theta}}^{({d}^{\prime}+1)}$ and then increment *d*’ and continue. If criterion (ii) is met, we stop incrementing *d*’, as we have achieved the maximum possible *PF*, so adding additional genometypes cannot possibly maximize it further. If criterion (iii) is met, we let ${\stackrel{\u2322}{\theta}}^{({d}^{\prime}+1)}={\stackrel{\u2322}{\theta}}_{\left(0\right)}^{({d}^{\prime}+1)}$; that is, we let our final estimate for *d*’ +1 simply be our estimate for *d*’ padded with a zero. We then increment *d*’.

We repeat the above approach for each disease. The parameters that we determined using this approach to maximize *PF* were then used to estimate the percentage of the population testing positive for a given disease, as well as the relative risk of disease for those individuals testing negative, as defined below. We apply this approach separately for each disease, thus assuming independence. To find the minimum PFs compatible with the twin data, we used a simiilar procedure.

We determined the relative risk of disease of individuals whose whole-genome sequencing tests were negative after maximizing or minimizing the sensitivity (*PF*) of the test. Disease risk in the population testing negative (*DR _{neg}*) is the ratio of the number of disease cases testing negative to the number of individuals in the population testing negative:

$$D{R}_{neg}=\frac{(2{n}_{c}+{n}_{d})(1-PF)}{2{n}_{t}{\sum}_{i\in \left[d\right]\mid {r}_{i}<t}{f}_{i}}$$

(16)

To determine the relative risk of disease if testing negative (*RR _{neg}*), we calculated the ratio of disease risk of individuals testing negative to the disease risk in the twin cohort (

$$R{R}_{neg}=\frac{D{R}_{neg}}{CR}$$

(17)

We defined relative risk (*RR*) in table S2 as the minimum total risk of individuals with genometypes carrying a given genetic risk compared to the total risk of individuals with genometypes carrying a genetic risk of 0% (i.e., determined solely by non-genetic factors). The minimum total risk was determined using the standard 10% risk threshold described in the text as well as others (tables S4 to S6). In all cases,

$$RR=\frac{PPV+\left(CR(1-HER)\right)}{CR(1-HER)}$$

(18)

Equation (14) enforces that none of the residual errors can be larger than 0.5, such that upon rounding we obtain a perfect fit. Changing this parameter from 0.5 to 0.01 did not alter the *PF*'s depicted in Fig. 1 for any disease.

Instead of maximizing *PF*'s, we also determined the distributions of genometype risks (*r _{i})* and frequencies (

As noted above, we estimated the non-genetic risk as *e* = *CR*(1– *HER*). This risk is somewhat higher than that derived from the standard liability threshold (LT) model. However, it has recently been shown that the LT model underestimates the non-genetic contribution to disease because it does not take into account synergistic interactions among genes (41). The model described herein does not make any assumptions about the nature of the interactions between genes, such as additivity. However, the LT model can also be used to approximate the maximum capacity of whole genome sequencing to detect individuals at pre-defined risks under certain simplifying assumptions about the distribution of risk alleles in the population. The *PF* predictions from the LT model employing 10% thresholds are provided in table S4 and can be compared to the results of the current model with 10% thresholds (table S4).

Finally, our model can be used to calculate the potential clinical utility of whole-genome sequencing under any assumption about the proportion of non-genetic contributions to disease risk, or estimates thereof. Representative values for each disease, with non-genetic contributions ranging from 10% to 90%, are provided in table S7.

### Box 1

Genometype | A set of genomes that confer a specific genetic risk for a given disease |

Genometype genetic risk (r) | The genetic risk conferred by a given genometype |

Genometype frequency (f) | The frequency of a given genometype in the general poulation |

Threshold | Minimum risk for a given disease considered to be clinically meaningful |

Heritability (HER) | Proportion of phenotypic variance associated with genetic factors |

Cohort risk (CR) | Risk of disease in the relevant twin cohort |

Non-genetic risk (e) | Proportion of cohort risk due to non-genetic factors |

Total risk | Sum of genetic risk conferred by a given genometype plus non-genetic risk |

Relative risk | Ratio of total risk associated with a given genometype to cohort risk |

We thank Naomi Wray and Donald Geman for critical comments regarding the manuscript, and Katie Kinzler for technical assistance. **Funding:** The project was supported by The Lustgarten Foundation for Pancreatic Cancer Research, The Virginia and D. K. Ludwig Fund for Cancer Research, AACR Stand Up To Cancer-Dream Team Translational Cancer Research Grant, The Dr. Miriam and Sheldon G. Adelson Medical Research Foundation, The European Community's Seventh Framework Programme, NIH grants CA43460, CA57345, CA62924, CA121113, and NCI contract N01-CN-43302.

**Author contributions:** N.J.R, J.T.V., G.P., K.W.K, B.V. and V.E.V designed the study; N.J.R, J.T.V. and V.E.V. generated and analyzed data; N.J.R., J.T.V, and B.V. wrote the manuscript.

**Competing Interests:** B.V., K.W.K and V.E.V are a co-founders of Inostics and Personal Genome Diagnostics and are members of their Scientific Advisory Boards. K.W.K., B.V., and V.E.V own Inostics and Personal Genome Diagnostics stock, which is subject to certain restrictions under University policy. The terms of these arrangements are managed by the Johns Hopkins University in accordance with its conflict-of-interest policies. G.P. is on the scientific advisory board of Counsyl.

**Citation:** N. J. Roberts, J. T. Vogelstein, G. Parmigiani, K. W. Kinzler, B. Vogelstein, V. E. Velculescu, The Predictive Capacity of Personal Genome Sequencing. *Sci. Transl. Med*. 10.1126/scitranslmed.3003380 (2012).

1. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bryant J, Carter RJ, Keira Cheetham R, Cox AJ, Ellis DJ, Flatbush MR, Gormley NA, Humphray SJ, Irving LJ, Karbelashvili MS, Kirk SM, Li H, Liu X, Maisinger KS, Murray LJ, Obradovic B, Ost T, Parkinson ML, Pratt MR, Rasolonjatovo IM, Reed MT, Rigatti R, Rodighiero C, Ross MT, Sabot A, Sankar SV, Scally A, Schroth GP, Smith ME, Smith VP, Spiridou A, Torrance PE, Tzonev SS, Vermaas EH, Walter K, Wu X, Zhang L, Alam MD, Anastasi C, Aniebo IC, Bailey DM, Bancarz IR, Banerjee S, Barbour SG, Baybayan PA, Benoit VA, Benson KF, Bevis C, Black PJ, Boodhun A, Brennan JS, Bridgham JA, Brown RC, Brown AA, Buermann DH, Bundu AA, Burrows JC, Carter NP, Castillo N, Chiara ECM, Chang S, Neil Cooley R, Crake NR, Dada OO, Diakoumakos KD, Dominguez-Fernandez B, Earnshaw DJ, Egbujor UC, Elmore DW, Etchin SS, Ewan MR, Fedurco M, Fraser LJ, Fuentes Fajardo KV, Scott Furey W, George D, Gietzen KJ, Goddard CP, Golda GS, Granieri PA, Green DE, Gustafson DL, Hansen NF, Harnish K, Haudenschild CD, Heyer NI, Hims MM, Ho JT, Horgan AM, Hoschler K, Hurwitz S, Ivanov DV, Johnson MQ, James T, Huw Jones TA, Kang GD, Kerelska TH, Kersey AD, Khrebtukova I, Kindwall AP, Kingsbury Z, Kokko-Gonzales PI, Kumar A, Laurent MA, Lawley CT, Lee SE, Lee X, Liao AK, Loch JA, Lok M, Luo S, Mammen RM, Martin JW, McCauley PG, McNitt P, Mehta P, Moon KW, Mullens JW, Newington T, Ning Z, Ling Ng B, Novo SM, O'Neill MJ, Osborne MA, Osnowski A, Ostadan O, Paraschos LL, Pickering L, Pike AC, Chris Pinkard D, Pliskin DP, Podhasky J, Quijano VJ, Raczy C, Rae VH, Rawlings SR, Chiva Rodriguez A, Roe PM, Rogers J, Rogert Bacigalupo MC, Romanov N, Romieu A, Roth RK, Rourke NJ, Ruediger ST, Rusman E, Sanches-Kuiper RM, Schenker MR, Seoane JM, Shaw RJ, Shiver MK, Short SW, Sizto NL, Sluis JP, Smith MA, Ernest Sohna Sohna J, Spence EJ, Stevens K, Sutton N, Szajkowski L, Tregidgo CL, Turcatti G, Vandevondele S, Verhovsky Y, Virk SM, Wakelin S, Walcott GC, Wang J, Worsley GJ, Yan J, Yau L, Zuerlein M, Mullikin JC, Hurles ME, McCooke NJ, West JS, Oaks FL, Lundberg PL, Klenerman D, Durbin R, Smith AJ. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53–59. [PMC free article] [PubMed]

2. Drmanac R, Sparks AB, Callow MJ, Halpern AL, Burns NL, Kermani BG, Carnevali P, Nazarenko I, Nilsen GB, Yeung G, Dahl F, Fernandez A, Staker B, Pant KP, Baccash J, Borcherding AP, Brownley A, Cedeno R, Chen L, Chernikoff D, Cheung A, Chirita R, Curson B, Ebert JC, Hacker CR, Hartlage R, Hauser B, Huang S, Jiang Y, Karpinchyk V, Koenig M, Kong C, Landers T, Le C, Liu J, McBride CE, Morenzoni M, Morey RE, Mutch K, Perazich H, Perry K, Peters BA, Peterson J, Pethiyagoda CL, Pothuraju K, Richter C, Rosenbaum AM, Roy S, Shafto J, Sharanhovich U, Shannon KW, Sheppy CG, Sun M, Thakuria JV, Tran A, Vu D, Zaranek AW, Wu X, Drmanac S, Oliphant AR, Banyai WC, Martin B, Ballinger DG, Church GM, Reid CA. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science. 2010;327:78–81. [PubMed]

3. Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, Peluso P, Rank D, Baybayan P, Bettman B, Bibillo A, Bjornson K, Chaudhuri B, Christians F, Cicero R, Clark S, Dalal R, Dewinter A, Dixon J, Foquet M, Gaertner A, Hardenbol P, Heiner C, Hester K, Holden D, Kearns G, Kong X, Kuse R, Lacroix Y, Lin S, Lundquist P, Ma C, Marks P, Maxham M, Murphy D, Park I, Pham T, Phillips M, Roy J, Sebra R, Shen G, Sorenson J, Tomaney A, Travers K, Trulson M, Vieceli J, Wegener J, Wu D, Yang A, Zaccarin D, Zhao P, Zhong F, Korlach J, Turner S. Real-time DNA sequencing from single polymerase molecules. Science. 2009;323:133–138. [PubMed]

4. Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer ML, Jarvie TP, Jirage KB, Kim JB, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J, Lohman KL, Lu H, Makhijani VB, McDade KE, McKenna MP, Myers EW, Nickerson E, Nobile JR, Plant R, Puc BP, Ronan MT, Roth GT, Sarkis GJ, Simons JF, Simpson JW, Srinivasan M, Tartaro KR, Tomasz A, Vogt KA, Volkmer GA, Wang SH, Wang Y, Weiner MP, Yu P, Begley RF, Rothberg JM. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376–380. [PMC free article] [PubMed]

5. Frazer KA, Murray SS, Schork NJ, Topol EJ. Human genetic variation and its contribution to complex traits. Nat Rev Genet. 2009;10:241–251. [PubMed]

6. Cirulli ET, Goldstein DB. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat Rev Genet. 2010;11:415–425. [PubMed]

7. Manolio TA. Genomewide association studies and assessment of the risk of disease. N Engl J Med. 2010;363:166–176. [PubMed]

8. Jones S, Hruban RH, Kamiyama M, Borges M, Zhang X, Parsons DW, Lin JC, Palmisano E, Brune K, Jaffee EM, Iacobuzio-Donahue CA, Maitra A, Parmigiani G, Kern SE, Velculescu VE, Kinzler KW, Vogelstein B, Eshleman JR, Goggins M, Klein AP. Exomic sequencing identifies PALB2 as a pancreatic cancer susceptibility gene. Science. 2009;324:217. [PMC free article] [PubMed]

9. Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM, Huff CD, Shannon PT, Jabs EW, Nickerson DA, Shendure J, Bamshad MJ. Exome sequencing identifies the cause of a mendelian disorder. Nat Genet. 2010;42:30–35. [PMC free article] [PubMed]

10. Bruder CE, Piotrowski A, Gijsbers AA, Andersson R, Erickson S, Diaz de Stahl T, Menzel U, Sandgren J, von Tell D, Poplawski A, Crowley M, Crasto C, Partridge EC, Tiwari H, Allison DB, Komorowski J, van Ommen GJ, Boomsma DI, Pedersen NL, den Dunnen JT, Wirdefeldt K, Dumanski JP. Phenotypically concordant and discordant monozygotic twins display different DNA copy-number-variation profiles. Am J Hum Genet. 2008;82:763–771. [PubMed]

11. Rijsdijk FV, Sham PC. Analytic approaches to twin data using structural equation models. Brief Bioinform. 2002;3:119–133. [PubMed]

12. Visscher PM, Hill WG, Wray NR. Heritability in the genomics era--concepts and misconceptions. Nat Rev Genet. 2008;9:255–266. [PubMed]

13. Clarke-Pearson DL. Clinical practice. Screening for ovarian cancer. N Engl J Med. 2009;361:170–177. [PubMed]

14. Kopelman PG. Obesity as a medical problem. Nature. 2000;404:635–643. [PubMed]

15. Willett WC, Dietz WH, Colditz GA. Guidelines for healthy weight. N Engl J Med. 1999;341:427–434. [PubMed]

16. Alberg AJ, Ford JG, Samet JM. Epidemiology of lung cancer: ACCP evidence-based clinical practice guidelines (2nd edition). Chest. 2007;132:29S–55S. [PubMed]

17. Ott A, Slooter AJ, Hofman A, van Harskamp F, Witteman JC, Van Broeckhoven C, van Duijn CM, Breteler MM. Smoking and risk of dementia and Alzheimer's disease in a population-based cohort study: the Rotterdam Study. Lancet. 1998;351:1840–1843. [PubMed]

18. He J, Ogden LG, Bazzano LA, Vupputuri S, Loria C, Whelton PK. Risk factors for congestive heart failure in US men and women: NHANES I epidemiologic follow-up study. Arch Intern Med. 2001;161:996–1002. [PubMed]

19. Lichtenstein P, Holm NV, Verkasalo PK, Iliadou A, Kaprio J, Koskenvuo M, Pukkala E, Skytthe A, Hemminki K. Environmental and heritable factors in the causation of cancer--analyses of cohorts of twins from Sweden, Denmark, and Finland. N Engl J Med. 2000;343:78–85. [PubMed]

20. Hansen PS, Brix TH, Iachine I, Kyvik KO, Hegedus L. The relative importance of genetic and environmental effects for the early stages of thyroid autoimmunity: a study of healthy Danish twins. Eur J Endocrinol. 2006;154:29–38. [PubMed]

21. Kaprio J, Tuomilehto J, Koskenvuo M, Romanov K, Reunanen A, Eriksson J, Stengard J, Kesaniemi YA. Concordance for type 1 (insulin-dependent) and type 2 (non-insulin-dependent) diabetes mellitus in a population-based cohort of twins in Finland. Diabetologia. 1992;35:1060–1067. [PubMed]

22. Katsika D, Grjibovski A, Einarsson C, Lammert F, Lichtenstein P, Marschall HU. Genetic and environmental influences on symptomatic gallstone disease: a Swedish study of 43,141 twin pairs. Hepatology. 2005;41:1138–1143. [PubMed]

23. Gatz M, Pedersen NL, Berg S, Johansson B, Johansson K, Mortimer JA, Posner SF, Viitanen M, Winblad B, Ahlbom A. Heritability for Alzheimer's disease: the study of dementia in Swedish twins. J Gerontol A Biol Sci Med Sci. 1997;52:M117–M125. [PubMed]

24. Tanner CM, Ottman R, Goldman SM, Ellenberg J, Chan P, Mayeux R, Langston JW. Parkinson disease in twins: an etiologic study. JAMA. 1999;281:341–346. [PubMed]

25. Sullivan PF, Evengard B, Jacks A, Pedersen NL. Twin analyses of chronic fatigue in a Swedish national sample. Psychol Med. 2005;35:1327–1336. [PubMed]

26. Cameron AJ, Lagergren J, Henriksson C, Nyren O, Locke GR, 3rd, Pedersen NL. Gastroesophageal reflux disease in monozygotic and dizygotic twins. Gastroenterology. 2002;122:55–59. [PubMed]

27. Bengtson MB, Ronning T, Vatn MH, Harris JR. Irritable bowel syndrome in twins: genes and environment. Gut. 2006;55:1754–1759. [PMC free article] [PubMed]

28. Zdravkovic S, Wienke A, Pedersen NL, Marenberg ME, Yashin AI, De Faire U. Heritability of death from coronary heart disease: a 36-year follow-up of 20 966 Swedish twins. J Intern Med. 2002;252:247–254. [PubMed]

29. Bak S, Gaist D, Sindrup SH, Skytthe A, Christensen K. Genetic liability in stroke: a long-term follow-up study of Danish twins. Stroke. 2002;33:769–774. [PubMed]

30. Algovik M, Nilsson E, Cnattingius S, Lichtenstein P, Nordenskjold A, Westgren M. Genetic influence on dystocia. Acta Obstet Gynecol Scand. 2004;83:832–837. [PubMed]

31. Altman D, Forsman M, Falconer C, Lichtenstein P. Genetic influence on stress urinary incontinence and pelvic organ prolapse. Eur Urol. 2008;54:918–922. [PubMed]

32. Xu J, Kochanek KD, Murphy SL, Tejada-Vera B. Deaths: final data for 2007. Natl Vital Stat Rep. 2010;58:1–135. [PubMed]

33. Audrain J, Boyd NR, Roth J, Main D, Caporaso NF, Lerman C. Genetic susceptibility testing in smoking-cessation treatment: one-year outcomes of a randomized trial. Addict Behav. 1997;22:741–751. [PubMed]

34. Sabaté E. Adherence to long-term therapies: evidence for action. World Health Organiszation; Geneva: 2003. p. 288 pages.

35. Wong AH, Gottesman II, Petronis A. Phenotypic differences in genetically identical organisms: the epigenetic perspective. Hum Mol Genet. 2005;14:R11–R18. Spec No 1. [PubMed]

36. Identical Twins Not As Identical As Believed [September 10th 2011];ScienceDaily. Available from: http://www.sciencedaily.com/releases/2008/02/080215121214.htm.

37. Wray NR, Goddard ME, Visscher PM. Prediction of individual genetic risk to disease from genome-wide association studies. Genome Res. 2007;17:1520–1528. [PubMed]

38. Hindorff LA, Sethupathy P, Junkins HA, Ramos EM, Mehta JP, Collins FS, Manolio TA. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci U S A. 2009;106:9362–9367. [PubMed]

39. Wray NR, Yang J, Goddard ME, Visscher PM. The genetic interpretation of area under the ROC curve in genomic profiling. PLoS Genet. 2010;6:e1000864. [PMC free article] [PubMed]

40. Tenesa A, Dunlop MG. New insights into the aetiology of colorectal cancer from genome-wide association studies. Nat Rev Genet. 2009;10:353–358. [PubMed]

41. Zuk O, Hechter E, Sunyaev SR, Lander ES. The mystery of missing heritability: Genetic interactions create phantom heritability. Proc. Natl. Acad. Sci. U.S.A. 2012;109:1193–1198. [PubMed]

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |