|Home | About | Journals | Submit | Contact Us | Français|
Classical epidemiologic studies have made seminal contributions to identifying the etiology of most common cancers. Molecular epidemiology was conceived of as an extension of traditional epidemiology to incorporate biomarkers with questionnaire data to further our understanding of the mechanisms of carcinogenesis. Early molecular epidemiologic studies employed functional assays. These studies were hampered by the need for sequential and/or prediagnostic samples, viable lymphocytes and the uncertainty of how well these functional data (derived from surrogate lymphocytic tissue) reflected events in the target tissue. The completion of the Human Genome Project and Hapmap Project, together with the unparalleled advances in high-throughput genotyping revolutionized the practice of molecular epidemiology. Early studies had been constrained by existing technology to use the hypothesis-driven candidate gene approach, with disappointing results. Pathway analysis addressed some of the concerns, although the study of interacting and overlapping gene networks remained a challenge. Whole-genome scanning approaches were designed as agnostic studies using a dense set of markers to capture much of the common genome variation to study germ-line genetic variation as risk factors for common complex diseases. It should be possible to exploit the wealth of these data for pharmacogenetic studies to realize the promise of personalized therapy. Going forward, the temptation for epidemiologists to be lured by high-tech ‘omics’ will be immense. Systems Epidemiology, the observational prototype of systems biology, is an extension of classical epidemiology to include powerful new platforms such as the transcriptome, proteome and metabolome. However, there will always be the need for impeccably designed and well-powered epidemiologic studies with rigorous quality control of data, specimen acquisition and statistical analysis.
The term ‘molecular epidemiology’ made its appearance in the literature in the early 1980s. It was originally conceived of as an extension of traditional (classical) epidemiologic research to incorporate biomarkers (biochemical and molecular) with questionnaire data in order to further our understanding of mechanisms of carcinogenesis and of events throughout the continuum between exposure and cancer development. These biomarkers were initially designated as those connoting susceptibility, exposure or effect (1). Rose (2) pointed out in 1990 the need for traditional epidemiologists ‘to look inside the black box’ between exposure and disease and for the ‘molecular biologists to look outside it’. However, as Ellsworth et al. (3) made clear, a distinction between these two disciplines (traditional and molecular epidemiology) is somewhat artificial and does not permit emphasis on the need for incorporating the rigor of classical epidemiologic population selection and study design into all types of epidemiologic approaches.
Classical epidemiologic studies have made seminal contributions to identifying the etiology of most common cancers and have had substantive public health impact. The smoking and lung cancer association is perhaps best known and has driven the most wide reaching and, arguably, successful cancer prevention initiatives and policy changes. Around the middle of the last century, studies of lung cancer were published in Western Europe (4) and North America (5) leading to the conclusion in 1950 that smoking was an important cause of lung cancer. Likewise, the role of chemical exposures was explored in the British chemical industry in the early 1950s (6). During the same time period, Cornfield (7) and Mantel et al. (8) made substantive contributions to the methodologic and analytic rigor of the case–control study design. Since then many cohort and case–control studies have provided convincing evidence of the etiologic roles of specific lifestyle, occupational, viral and dietary risk factors in a range of cancers. The International Agency for Research on Cancer (IARC) evaluated the cancer-causing potential of >900 likely candidates, placing them into one of the following groups (9):
Although there was growing recognition in these traditional studies of the need to consider complex interactions between these exposures together with the contribution of familial and genetic factors in order to fully understand cancer causation, the molecular tools to explore these associations were yet to be developed.
There is now growing recognition that such environmental challenges not only interact with genes but also might modulate genetic effects and influence phenotypes. It is also increasingly being recognized that environmental exposures may not only be DNA damaging agents but additionally may also alter gene expression through epigenetic mechanisms that could be reversible.
In the new era of employing sophisticated molecular platforms, epidemiologists need to remain focused on accurate assessment of environmental and exposure covariates. Schulte (10) has outlined some of the capabilities of molecular epidemiology, to which we have provided selected illustrative examples. These include the following.
An example might be the human papillomavirus–cervical cancer association. Such an approach provides many opportunities to intervene effectively in the disease continuum from behavioral lifestyle modification through currently ongoing clinical trials of human papillomavirus vaccination programs. Another successful public health example is the hepatitis B virus infection and hepatocellular cancer association.
Environmental studies of cancer etiology have been substantially strengthened by incorporation of biomarkers to refine exposure assessment. Past exposures might be assessed through measurement of protein or DNA adducts, or dosimeters can measure individual or ambient exposures. Tobacco smoke metabolites have been extensively studied as markers of recent exposure (e.g. cotinine). Such markers can also be used as susceptibility markers by conducting genotype/phenotype correlations in which the phenotypes are tobacco metabolite measurements from urine samples that reflect internal dose and/or metabolism (activation or detoxification) of tobacco constituents, and the genotypes relate to these pathways. These specific phenotypes could include metabolites of nicotine, polycyclic aromatic hydrocarbons, benzene, acrolein, crotonaldehyde and ethylene oxide.
The ability to identify premalignant lesions in subjects offers greater opportunities for intervention and also to expand the pool of ‘cases’ for epidemiologic studies. Cancer risk biomarkers have substantive potential for risk stratification. Barrett's esophagus would fit into this category. Reid (11) has stressed the importance of refining risk assessment and of incorporating clinical and epidemiologic variables that are easy to ascertain. Without such robust risk models, screening in patients with Barrett's esophagus becomes inefficient.
There are categories of biomarkers (reviewed in ref. 12) that are appropriate to assess specific exposures. These include biomarkers of internal dose (e.g. urinary tobacco-specific nitrosamines or circulating antibodies), biomarkers of biologically effective dose (e.g. DNA or protein adducts) or biomarkers of early effect (e.g. gene expression profiles). Wild (12) has further pointed out that there are other exposure assessment refinements beyond these standard biomarkers, including geographic information systems, personal and environmental monitoring approaches and more sophisticated questionnaire elements. Wild suggests that biomarkers also have considerable potential to help refine assessment of diet, obesity, energy balance and chemical exposures. Since the strength of the associations between exposure and disease and between genetic variants and disease is modest, we need at all costs to attempt to avoid exposure misclassification (12).
There is increasing recognition of the need to move beyond observational analysis and reporting of exposure and outcome in order to begin to understand the underlying mechanism of these associations. One approach is to exploit mouse models to assess the effects of the environment on the phenotype in order to generate better and more accurate models of human disease (13). This has led to the concept of the envirotype—factors that are exogenous to the organism and that mimic human lifestyles such as diet, stress and immunity. Beckers et al. (13) believe that introducing experimental envirotypes into mouse phenotyping protocols may yield new knowledge about the effects of lifestyle changes. The utility of these models may be further improved by assessing the impact of experimental manipulations, such as exploring the behavioral determinants of energy balance.
Reliable risk prediction tools for estimating the probability of cancer over a defined time period have substantial public health implications and could be of value in early detection and clinical decision making. Further, risk prediction tools could be incorporated into the design of smaller, more powerful and ‘smarter’ prevention trials. Cardiovascular risk profiles using epidemiologic data have been effectively used for nearly two decades (14). Likewise there are well-validated and easily measured risk factors that are powerful predictors of type 2 diabetes (15). In the cancer arena, risk prediction models for breast cancer have the longest history (16), although prognostic models have also been generated for prostate, lung, melanoma, ovary, colorectal and bladder cancers. Such tools hold promise, but their interpretation is complex. For lung cancer prediction, there are an estimated 45 million current smokers in the USA and 49 million former smokers. The challenge is to identify that subset of ever smokers at higher risk for developing lung cancer. Such high-risk individuals could undergo a program of screening surveillance that might not be appropriate for a lower risk population and receive the most intensive smoking cessation interventions. The challenge of incorporating genetic markers into these models is discussed further below.
Some of the earliest molecular epidemiologic studies evaluating susceptibility to carcinogenesis employed functional assays. Peripheral blood lymphocytes were the tissue of first choice for studies that used functional assays to evaluate susceptibility to mutagenic exposures. The types of assays included those using a chemical or physical mutagen challenge (such as the mutagen sensitivity, Comet, micronucleus and induced adduct assays), unscheduled DNA synthesis and measuring cellular ability to remove adducts from plasmids transfected into in vitro lymphocyte cultures by expression of damaged reporter genes (the host-cell reactivation assay). These are discussed in further detail and referenced below. Longitudinal evaluation of these functional assays must be considered the gold standard, rather than cross-sectional analysis. However, the availability of suitable samples was a rate-limiting factor. Very few repositories have sequential samples that have been collected longitudinally from the same individuals or have prediagnostic samples collected in a way that ensures viability of the lymphocytes. Another overriding concern was the need to know how well these functional data (derived from surrogate lymphocytic tissue) reflected events at the level of the target tissue. The issue of ‘reverse causality’ in case–control studies was a constant criticism. A few illustrative examples are summarized below.
DNA repair is a ubiquitous defense mechanism that is critical to maintaining the integrity of the genome and repairing the damage from exposure to exogenous environmental xenobiotics as well as to endogenous damage (e.g. from oxidative metabolism) or spontaneous disintegration of chemical bonds in DNA. There is substantial interindividual variation in DNA repair capacity (DRC) within a population. At the extreme end of this spectrum are patients with xeroderma pigmentosum, who have a defect in nucleotide excision repair and who exhibit 1000-fold increased risks of skin cancer. There is a larger subgroup with reduced DRC who are likely to be at increased cancer risk but are phenotypically normal. In an extensive review of the published literature on DNA repair and susceptibility to cancer in humans, Berwick et al. (17) concluded that ‘the vast majority of studies show a difference (in repair capacity) between cancer case subjects and control subjects’. They also sounded a cautionary note regarding the issues of confounding and the need to develop molecular assays that define both the genetic defect and the repair pathways involved.
Measuring the expression level of damaged reporter genes (host-cell reactivation) is the DNA repair assay of choice. This assay uses undamaged cells, is relatively fast and is an objective way of measuring intrinsic cellular repair. In the assay, lymphocytes are transfected with damaged non-replicating recombinant plasmid harboring a chloramphenicol acetyltransferase reporter gene (pCMVcat) (18). This assay is a direct measure of repair kinetics, unlike the cytogenetic assays that only indirectly infer repair capacity from cellular damage remaining after mutagenic exposure and recovery (18), and as such probably reflect general and non-specific impairment of the DNA repair machinery.
The mutagen challenge employed depended on the risk profile of the cancer being studied. For example, the mutagen was ultraviolet radiation for skin cancers (19) and activated benzo[a]pyrene diol epoxide, a major constituent of tobacco smoke, for tobacco-related cancers (20). The low DRC phenotype has been shown to be an independent risk factor for epithelial cancers that are related to these exposures (19–22). Interestingly, in almost all these studies, cases who were younger at diagnosis, females, lighter smokers and those who reported a family history of cancer exhibited the lowest repair capacity, suggesting that these subgroups may be especially susceptible to cancer. Moreover, potential interactions between repair capacity and genes or environmental factors have been suggested. To simplify the host-cell reactivation assay and to accommodate population studies, Dr Wei's group has replaced Chloramphenicol AcetylTransferase Reporter with luciferase using the plasmid expression vector pCMVluc (23). This plasmid is the same construct containing a human cytomegalovirus immediate promoter and enhancer except for the reporter gene.
Hsu et al. (24) developed the mutagen sensitivity assay that quantifies the frequency of chromatid breaks induced by challenge mutagens in cultured lymphocytes in vitro as an integrated biomarker of mutagen sensitivity and an indirect measure of DRC. Hsu hypothesized a spectrum of mutagen sensitivity within the general population. Patients with chromosome breakage syndromes (such as ataxia telangiectasia and xeroderma pigmentosum) are located at the extreme end of the spectrum. These patients exhibit high rates of spontaneous chromosome breaks, increased susceptibility to induction of breaks by mutagens and increased cancer risk. Thus, the mutagen sensitivity assay identifies host DRC and genomic stability. Using in vitro mutagen challenges to induce specific types of DNA damage provides detailed information regarding host susceptibility for a given DNA repair pathway. The mutagen sensitive phenotype, employing site-specific mutagen challenge, has been shown to increase risks of a variety of epithelial cancers, including upper aerodigestive tract, head and neck, brain and lung cancers, as well as oral premalignant lesions (25).
The comet assay, or single-cell gel electrophoresis, is a rapid visual method for measuring DNA breakage in single cells. With advances in automated imaging technology, this assay has proved to be a fairly promising marker to gauge host susceptibility for cancer in large molecular epidemiology studies. The comet assay appears to have many advantages, including allowing relatively high-throughput screening, requiring a small number of cells and facilitating the detection of primary DNA damage in individual cells (26). Although it requires viable cells, it does not require cell growth and is applicable to any cell line or tissue from which a single-cell suspension can be obtained and can even be applied to terminally differentiated cells (27). The data are obtained within a few hours of sampling, and the assay is cost effective. Moreover, the comet assay has the potential for clinical application. Schmezer et al. (28) have shown that cryopreservation for up to 12 months does not affect the sensitivity of the lymphocytes and the reproducibility of the assay was good. Therefore, the comet assay is an applicable assay to assess in vitro genetic instability in large-scale epidemiologic studies.
The cytokinesis-block micronucleus (CBMN) assay in human lymphocytes is one of the most commonly used methods for measuring DNA damage because it is relatively easier to score micronuclei (MN) than chromosome aberrations (29). MN originate from chromosome fragments or whole chromosomes that fail to engage with the mitotic spindle and therefore lag behind when the cell divides. Compared with other cytogenetic assays, quantification of MN confers several advantages, including speed and ease of analysis, no requirement for metaphase cells and reliable identification of cells that have completed only one nuclear division. This prevents confounding effects caused by differences in cell division kinetics because expression of the MN is dependent on completion of nuclear division (30). Because cells are blocked in the binucleated stage, it is also possible to measure nucleoplasmic bridges (NPBs) originating from asymmetrical chromosome rearrangements and/or telomere end fusions (31). NPBs occur when the centromeres of dicentric chromosomes or chromatids are pulled to the opposite poles of the cell at anaphase. In this assay, binucleated cells with NPBs are easily observed because cytokinesis is inhibited, preventing breakage of the anaphase bridges from which NPBs are derived, and thus the nuclear membrane forms around the NPBs. Both MN and NPBs occur in cells exposed to DNA-breaking agents. The CBMN assay can also detect nuclear buds, which represent a mechanism by which cells remove amplified DNA, and are therefore considered a marker of possible gene amplification [reviewed by Fenech (32)]. The CBMN test is gradually replacing the analysis of chromosome aberrations in lymphocytes because MN, NPBs and nuclear buds are easy to recognize and score and the results can be obtained in a shorter time (33). El-Zein et al. (34) have shown that levels of spontaneous and induced chromosomal damage were significantly higher in lung cancer cases than controls. The simplicity, rapidity and sensitivity of the CBMN test make it a valuable tool for risk assessment.
The two major methods currently in use to measure telomere length in epidemiological/clinical research are Southern blot analysis and real-time quantitative polymerase chain reaction. According to Aviv (35), these methods used in epidemiological studies to measure telomere length may not be sufficiently reliable to detect small changes in telomere lengths over time. Aviv (35) recommends an impartial comparison of the quantitative polymerase chain reaction method versus the Southern blot analysis in measuring telomere parameters across a wide age range.
Telomere length abnormalities are nearly universal in preinvasive stages of human epithelial carcinogenesis and several, but not all, studies have reported an association between short telomeres and increased risk of cancer at several sites, including prostate, esophagus, lymphoma, basal cell skin cancer and cancers of the breast, lung, head and neck, bladder and kidney (36–42).
The completion of the Human Genome Project in 2001 and the Hapmap Project that dissected the genome into highly correlated single-nucleotide polymorphism (SNP) linkage disequilibrium (LD) blocks, together with the unparalleled technologic advances in high-throughput genotyping ability, have revolutionized the practice of molecular epidemiology. In fact, Shpilberg et al. (43) has stated that the Human Genome project was the ‘greatest opportunity for epidemiology since John Snow discovered the Broad Street pump…’.
These resources have stimulated innovative, interactive and crosscutting new disciplines that link epidemiologic and basic science research. Early molecular epidemiology studies were constrained by existing technology to use the traditional hypothesis-driven candidate gene approach to identify genetic variants that conferred susceptibility to cancer. These studies used current knowledge of pathophysiology and cancer biology and prior experimental or in vitro data to explore genes of interest.
In general, the approach has been disappointing. Selection of SNPs to evaluate in an association study is challenging. In complex diseases, many genes are probably involved, some may be unknown or their functions may not have been well characterized. Many early candidate gene studies were underpowered, could not evaluate complex gene–gene and gene–environmental interactions, and initial positive findings were rarely replicated in subsequent larger scale studies. The problem of publication bias is also a real concern. Schmidt (44) has estimated that only 10% of published research papers investigating SNPs in relation to cancer risk have been validated. He also points out the lack of finding ‘big genes with big effects’, such as the highly penetrant BRCA gene variations that raise breast cancer risk by up to 85%. In reality, this candidate gene approach is a simplistic one and cannot address the role of multiple genetic loci in complex diseases with complex environmental backgrounds.
On the other hand, pathway analysis addresses some of these concerns, although the study of interacting and overlapping networks of genes remains a challenge. Kraft et al. (45) have outlined the challenges inherent in multilocus pathway approaches. In most cases, the effects are moderate and often depend upon interactions among the risk alleles of several genes in a pathway or with other environmental risk factors. Another complication comes from the findings of the ENCODE study that suggested that non-coding regions of the genome are also functional and are involved in transcriptional processes (46).
This strategy of looking for genetic variation in candidate genes began to shift in 2005 to more unbiased gene discovery approaches. Whole-genome scanning approaches were designed as agnostic studies that used a dense set of markers that capture much of the common variation across the genome to study germ-line genetic variation in study subjects as risk factors for common complex diseases. This approach moves beyond known genes in known pathways to identify unanticipated genes that contribute to risk. These platforms are built to study relatively common variants, such as those with a minor allele frequency >0.1. It is estimated that ~20% of common SNPs are only partially tagged or not tagged at all and rare variants are not tagged (47). Low-penetrant genes studied to date confer only modest risks (in the order of 1.2–1.3), but still may confer high population burden and population-attributable fraction due to their prevalence. Among the limitations of genome-wide association (GWA) studies (that will not be reviewed here), one of the most challenging is the ability to move from identification of statistical associations to elucidating the functional basis of the link between a genomic region and the trait of interest (47). GWA studies identify loci, but not specific genes, and identification of actual causative loci will require deep resequencing methods and fine mapping approaches. In a thoughtful commentary, Kraft et al. (48) point out that the initial findings from GWA studies explain only a very small proportion of the underlying genetic contribution to disease and that many variants are likely to be responsible for the majority of inherited risk.
Because of the small effect sizes noted with the GWA studies, pooling of results of several GWA studies is needed to attain statistical power for confirmation of cancer susceptibility genes and for replication in multiple populations. Such approaches are often hindered by a lack of consistency in quantifying cancer risk in experimental designs among studies, including the choice of measures and methodologies, and characterization of the phenotypic traits involved. The PhenX cooperative agreement (consensus measures for Phenotypes and eXposures) was initiated by the National Human Genome Research Institute to enhance cross-study analyses of genome wide association studies and other large-scale genomic research efforts, through the use of well-established measures of disease phenotypes, risk factors and environmental exposures (www.phenx.org). Twenty research domains have been selected by a Steering Committee and working groups of researchers are identifying measures and standardized protocols that will be recommended for future genome wide association studies and genomic studies.
The DataSHaPER project (Data Schema and Harmonization Platform for Epidemiological Research) is a joint initiative set up by P3G (the Public Population Project in Genomics), PHOEBE (Promoting Harmonization of Epidemiological Biobanks in Europe) and the Canadian Partnership for Tomorrow project. This project is constructing a suite of harmonization schemas for biobanks and major epidemiological studies. They are also exploring the coverage of the Generic DataSHaPER relative to the questionnaires used by a number of the world's leading population biobanks and epidemiological studies. To date, 28 such questionnaires have been paired to the Generic DataSHaPER. Each variable is classified as being a ‘complete match’ (the DataSHaPER variable can be generated with no information loss), a ‘partial match’ (partial information loss unavoidable) or an ‘impossible match’ (no useful information can be generated). The DataSHaPER project has so far focused on questionnaires written in English (or where a reliable English translation is available).
Likewise, the Office of Population Genomics of the National Human Genome Research Institute was established to promote the application of GWA genotyping to population studies including case–control and cohort studies, randomized clinical trials and biorepositories with phenotypes defined by electronic medical records (49). The Genetic Association Information Network was initiated in late 2005 as a public/private partnership to investigate the genetic basis of common diseases through a series of collaborative GWA studies (50). Six studies involving a total of 18000 DNA samples were selected on the basis of scientific merit, potential for genome-wide genotyping to provide valuable new insights and public health significance of the traits proposed for study. Chief among these has been the Gene–Environment Association Studies component of the genes, environment and health initiative, which began with eight GWA studies and recently was expanded to a total of 14 studies (51).
An unexpected finding from published genome-wide studies is that a genomic interval identified in one phenotype has subsequently also been implicated in other complex traits. Two examples come to mind. Chromosome 8q24 has emerged as a potentially important region in prostate cancer and multiple risk variants in this region have been identified (52,53). This region is also implicated in colorectal, breast and bladder cancers (54–56). These findings underscore that genetic variation in 8q24 is involved in multiple cancer types.
Another example is the region on chromosome 5p15.33 in an area of high LD and represented by two correlated SNPs, rs401681 and rs31489 (D′=1 and r2=0.87 in data from HapMap CEU). Two lung cancer meta-analyses identified another putative causative region at 5p15.33 (57,58). This region contains two biologically relevant genes for lung cancer, the telomerase reverse transcriptase gene, TERT, and cleft lip and palate transmembrane 1 like gene, CLPTM1L. The two groups reported two different variants (respectively, rs402710 and rs401681) that are in high LD to be associated with lung cancer risk [P for the IARC study (57)=2×10−7 and P for Institute of Cancer Research/MD Anderson group (58)=7.9×10−9]. The IARC group identified a second SNP, rs2736100, that was associated with lung cancer and suggested that this variant had an independent effect (P=4×10−6). A third subsequent report from the DeCODE group (59) provided evidence that this 5p15.33 region may be a susceptibility locus for multiple cancer types, including cervix, bladder and skin cancers as well as lung cancer, and found similar evidence for two potential susceptibility alleles in this region. A recent GWA in brain cancer likewise implicated this region in risk (60). Such findings have generated another new term, the ‘diseasome’ (61).
Those GWA studies that identify low-penetrance common susceptibility alleles, as discussed above, raise the possibility of incorporating panels of gene variants into existing clinical/epidemiological risk prediction models and to assess improvement in risk model performance. In theory, these genetic data are stable, accurate and amenable to high-throughput analysis. However, to date, the updated models have shown only modest improvements in discrimination, and few have focused on ethnic groups other than Caucasians.
Gail (62) has shown that adding seven SNPs identified from GWA analyses to the original Breast Cancer Risk Assessment Tool yielded only a modest improvement in area under the curve statistic (AUC) from 0.607 to 0.632. Gail (63) subsequently reported that inclusion of an expanded panel of 11 SNPs exhibits an even smaller improvement in the AUC compared with the BRACTplus 7 model (0.637). However, the receiver operating characteristic curve may not be sensitive to differences in probabilities between models and therefore insufficient to assess the impact of adding a new predictor. A substantial gain in performance may not yield a substantial increase in AUC and a very large independent association of the new marker with risk is required for a meaningful larger AUC to be shown. As an example, the Framingham Risk Score that is widely applied has an AUC of ~0.80. Adding another very sensitive marker of coronary artery disease, i.e. coronary artery calcification, yields an AUC of ~0.84.
One suggested statistic for comparing nested models is the net reclassification index that is useful when risk categories are defined, and there is a consensus as to clinically meaningful cutpoints or thresholds (64). This statistic quantifies overall improvement in sensitivity and specificity of the model. The expanded model is considered to reflect a net improvement in risk classification when there is evidence of upward reclassification of cases and downward reclassification for the controls.
These metrics were recently evaluated in our own internally validated risk prediction model for lung cancer that incorporated easily attainable epidemiologic and clinical variables (65). In a GWA analysis of 315450 tagging SNPs (66) and in a subsequent meta-analysis (58), the strongest associations were for SNPs mapping to 15q25.1. There was also consistent evidence for a new disease locus at 5p15.33 (rs401681; P=4.40×10−6) as discussed above. We therefore added one SNP from 15q25.1 locus (both were in strong LD) and two SNPs from the 5p15.33 region to the baseline model and assessed discrimination improvement (67). Our AUC for the baseline epidemiologic/clinical model including 1016 cases and 1111 controls was 0.661. With addition of the three SNPs, the AUC showed modest, yet significant, improvement to 0.673 (P<0.001). Based on the net reclassification index calculation, the SNPs modestly improved both sensitivity (9%) and specificity (6%). It could be argued that models providing a continuous score are more appropriate in the clinical setting and that a variety of summary measures evaluating model performance are needed to assess these multigenic models.
There have been unanticipated challenges in linking cancer risk to SNPs identified in these GWA studies. Once genome-wide significance of an association is established, the next steps must include replication in similar and different populations, meta-analyses, fine mapping and in vitro studies to establish the functional significance of the variants. As stated above, most susceptibility variants identified from GWA analyses have been associated with very small relative risks. Furthermore, the SNPs explain only a very small proportion of the genetic contribution (48).
Hunter et al. (68) have compared susceptibility markers discovered so far as ‘canaries in the coal mine, signaling the relationship to a disease of a biologically important gene or gene regulatory mechanism in humans whose ultimate importance cannot be estimated until the full set of mutations is found, the biologic pathways understood, and clinical utility demonstrated’. Altshuler et al. (69) have argued that the primary value of genetic mapping is not risk prediction, but providing novel insights about mechanisms of disease. Kraft et al. (48) conclude that currently for most diseases it is premature to test for susceptibility but that the situation may be different in 2–3 years as more susceptibility loci are discovered and replicated.
A unifying premise in the concept of ‘Integrative Epidemiology’ is that some genes or pathways implicated in cancer risk might also be involved in risk of exposure and in prediction of outcome, recognizing also that many additional pathways and targets contribute to treatment efficacy (70). Integrative Epidemiology (Figure 1) is designed to utilize the same populations, biospecimens and technology platforms as in case–control studies of risk to also evaluate outcome and response to therapy, as well as cancer risk taking behaviors (e.g. nicotine dependence). It builds upon the theory that gene discovery moves between studies of molecular epidemiology to those of tumor molecular genetics, thereby enriching and informing both disciplines (70). Caporaso (71) has argued that such an approach is efficient and although the cost of such larger studies is greater, the marginal cost per unit information is actually lower and the scientific payoff greater.
Recent findings from published lung cancer GWA studies exemplify this concept that the same genetic variants may contribute to nicotine dependence and directly to lung carcinogenesis. A common variant in the nicotinic acetylcholine receptor gene cluster on chromosome 15q24–25.1 was associated with lung cancer risk in three recently published independent GWA studies with no consensus as to the relative impact of the variants on the propensity to smoke versus a direct carcinogenic effect (66,72,73). Further analysis of our GWA data (74) argue against a sole effect of nicotine dependence and for a parallel and direct role of SNPs in the CHRNA3/5 region in genetic susceptibility to tobacco carcinogenesis. Bronchial epithelial cells selectively express both alpha 3 and 5 subunits, and nicotine and its metabolites may play a more direct role in lung cancer induction through activation of autocrine-proliferative signaling networks; however, present or past tobacco exposure is also necessary. Bierut et al. (75) demonstrated that the non-synonymous coding SNP of the CHRNA5 gene, rs16969968 (that is in complete LD with rs1051370 identified in the lung GWA studies), is strongly predictive of habitual smoking and further demonstrated that the variant form of the A5 subunit altered receptor function.
At the level of risk assessment, the focus might be on germ line polymorphisms in candidate genes as illustrated above. For early detection, epigenetic events in these same or other genes may be relevant; tumor tissue expression levels, loss of heterozygosity, genomic amplification, rearrangements or somatic mutations in the same classes of genes may determine outcome. During the spectrum of the tumorigenic process, a single gene could initially demonstrate a modest change in expression or function due to a polymorphism resulting in increased carcinogenic exposure and/or a heightened propensity to develop cancer, undergo further inactivation due to epigenetic changes or loss of heterozygosity during tumor initiation and be completely inactivated due to mutation during tumor progression.
Patient response rates to therapy are unpredictable, even controlling for known prognostic factors. Many patients needlessly receive ineffective toxic therapies with delay in implementing optimally effective therapies. More precise prediction of dose needed and therapeutic response could change clinical practice and suggest novel druggable targets. Blockbuster drugs are typically efficacious in only 40–60% of the patient population, and it is reported that medication is the fourth leading cause of death in the USA ahead of pulmonary diseases, diabetes, acquired Immunodeficiency syndrome and pneumonia. Many clinical studies have failed because there was no patient risk stratification.
Adverse side effects and therapeutic failure both have strong genetic components. ‘Pharmacogenetics’ is the study of the role of single-gene interactions in drug metabolism and response pathways with patient drug response phenotypes. Pharmacogenomics is the influence of genetic variation on drug response in patients by correlating gene expression or SNPs with a drug's efficacy or toxicity. These terms, however, are used interchangeably.
With advances in large genome-scale sequencing and in bioinformatic tools in processing large amounts of data, this science has transitioned to involving studies of the entire spectrum of genes in the human genome (76). Molecular-based criteria are already changing the approach to therapy (e.g. gene expression profiles in breast cancer and epidermal growth factor receptor mutation status in lung cancer) and advancing the hope of generating ‘personalized molecular profiles’ to inform personalized care and ensure optimal therapeutic response. While these examples use somatic genetic information, there are examples where germline genetic information is already being used to choose treatment agents (e.g. CYP2D6 in breast cancer treatment with tamoxifen) and to guide selection of drug dose (e.g. CYP2C9 and VKORC1 for warfarin dosing, UGT1A1 for irinotecan and TPMT for 6-mercaptopurine and azathioprine).
Pharmacogenetic studies evaluating the role of germ-line genetic determinants are being conducted within the context of existing case–control and cohort studies or by exploiting therapeutic trials. These data can complement ongoing research in tumor intrinsic molecular abnormalities that may influence response, such as tissue and lysate arrays to assess protein levels and function, functional genomics and chemical genomics screens. This approach integrates epidemiology, molecular biology and genomics and also could incorporate in silico approaches and systems biology to evaluate genotype/phenotype correlations, underlying mechanisms and interactions among gene, drug and host.
Huang et al. (76) propose a five-stage approach to applying pharmacogenetic and pharmaocogenomic studies to cancer therapy. This includes (i) determining the role of genetics in drug response; (ii) screening and identifying genetic markers; (iii) validating genetic markers; (iv) clinical utility assessment and (v) pharmacoeconomic impact.
The application of genome-wide techniques to this discipline is in its infancy. Specifically, this agnostic approach uses germ-line genome-wide data to move beyond known genes in known pathways to identify unanticipated genes that contribute to risk of adverse drug reaction or lack of efficacy and to construct comprehensive prognostic profiles of epidemiologic, clinical and germline genetic data to improve patient outcomes. Genome-wide approaches to identify chemical silencing of gene expression through methylation of gene regulatory regions can also improve the accuracy of the profiles. These approaches mandate the need for comprehensive epidemiologic and clinical databases linked with specimen repositories that encompass germline DNA, serum and tumor tissue.
The temptation for epidemiologists to be lured by high-tech ‘omics’ is immense. Systems biology is defined as the integration of interactions in biological systems from diverse experimental sources using interdisciplinary tools and multiple data sources. ‘Systems Epidemiology’ (77) is conceived of as a discipline that is the observational prototype of systems biology and as an extension of applying the classical epidemiologic approach from genetic predisposition to include powerful new platforms such as the transcriptome, proteome and metabolome, among others. Many of these technologies have not yet been satisfactorily validated and will not be reviewed in this paper. Vineis et al. (78) outline the steps needed in validation of these biomarkers. They also stress the challenge of understanding whether these biomarkers belong to the causal pathway between exposure and disease, are confounded by other exposures or are a disease consequence. Systems Epidemiology has the potential to contribute mechanistic plausibility to observational epidemiology.
Another favored technology is that of microarrays, whether DNA, RNA or tissue based. Webb et al. (79) point out that the appeal of microarrays to epidemiologists lies in their ability to study changes in the target tissue, to correlate these changes with germline genetic changes and exposures and to help sort out associations with risk factors more coherently by having homogenous subsets of cancers. They also caution that such studies should be prospective and sufficiently well powered to reduce bias and address confounding. Finally, they stress the need to address quality control of the assay platform and for validation studies, if they are to help understand exposure-cancer associations. Gene expression levels vary among individuals and are genetically regulated. GOGE is the study of genetic basis of variation in gene expression (80). Microarrays have changed the scale of how gene expression can be measured, and genome-wide studies are now possible. Like GWA studies, they do not require prior knowledge about regulatory mechanisms but do mandate the need to consider exposures that influence expression levels.
Epigenetic mechanisms (such as DNA methylation and histone protein changes) may also mediate environmental influences on gene expression and are therefore also becoming a focus of epidemiologic investigation. The term ‘Epigenetic Epidemiology’ has been deployed as a framework for studies that seek to understand the joint influences of epigenetic changes and environmental exposures on cancer risk. This approach is still in its relative infancy. Methylation can be studied on candidate genes or on a genome-wide scale, although unlike the genotype that is fixed at conception, the epigenome is dynamic and longitudinal specimens and data are highly preferred. Unlike simple genotyping, the levels and patterns of epigenetic changes differ in different tissues and cell types (81) and may not reflect events in target tissues. Therefore, unlike the germline genome, the problem of reverse causality remains an issue with epigenetic studies.
‘MicroRNAs’ (miRNAs) are small single-stranded units of non-coding RNA that are evolutionarily well conserved and regulate gene expression. There are >5000 miRNA sequences recorded in an online repository. Polymorphisms in the miRNA pathway can result in increased or decreased miRNA function and therefore modulate the expression in genes they regulate, such as cell death, cell proliferation and fat metabolism (82). Therefore, polymorphisms in miRNA (as well as epigenetic silencing) may play a defining role in cancer risk (e.g. papillary thyroid cancer), disease progression and prognosis. Mishra et al. (82) project that ‘understanding the role and functions of miR-polymorphisms has a promising future in pharmacogenomics, molecular epidemiology and individualized medicine’.
In 1993, Rothman (83) recognized that developments in the laboratory were outpacing the traditional time frame for epidemiologic research and predicted that such research would be ‘in a state of flux for the foreseeable future’. He cited the need for laboratory investigators to work closely with epidemiologists to address rapidly evolving technology more effectively. In a widely publicized special news report in Science in 1995 (84), Taubes raised concern that epidemiology had exhausted its potential and worse was generating conflicting results that confused the public. Epidemiology was indicted for raising unsubstantiated fears. However, in an editorial the following year, Trichopoulos (85) sounded a more encouraging theme. He believed that epidemiology was likely to expand and flourish but that consumers of epidemiology results should keep in mind the limits of epidemiologic investigations. Comparing epidemiology to democracy as a system of government, he stated that both epidemiology and democracy ‘have many problems and weaknesses, but they still represent the best available approach for the achievement of their respective objectives’.
In a thoughtful paper on the future of epidemiology Susser et al. (86) advocated a paradigm shift of epidemiology to connote the inclusion of systems at different levels, from societal to the molecular—a type of multilevel thinking. The black box represented a single level of design, analysis and interpretation. Chinese boxes are a nest of boxes, each with a succession of smaller ones. The outer box might be the overarching physical environment that in turn contains societies and populations, individuals, tissues, cells and finally molecules. He stressed the need to consider relations within and between these hierarchical levels, using new information systems and new biomedical techniques. This seems to be an early adoption of the systems biology paradigm and of integrating data in a true network manner.
There is now growing recognition of the need for team science and large consortial studies to achieve the sample sizes and statistical power for meta-analyses and replication studies. In this new century, we can add the need to learn to communicate with and to form ‘virtual teams’ with our new collaborators—molecular and computational biologists, mathematicians, computer scientists, physicists and bioinformaticians. The approaches of molecular epidemiology with the ability to obtain and carefully annotate patient data and samples can provide a major integrative force for these disciplines. The culture of the research must be to support and nurture such expensive, high-risk, but high impact, cutting-edge research, to promote data sharing and data and tissue linkages and to facilitate easy access for researchers to such data and tissue resources, with proper consideration of patient protection issues. Of note, there will always remain the need to apply the basic tenets of sound, high quality, epidemiologic research in impeccably designed and well-powered studies with rigorous quality control of data and specimen acquisition and statistical analysis.
Finally, and despite these exciting technological advances, the basic questions that we as epidemiologists continue to ask and to try and answer remain the same—Why did this patient get this disease at this time? Could this be prevented? Could this be predicted?
National Cancer Institute (CA55769 and CA127219 to M.R.S., P30 ES007784 to DiGiovanni).
Conflict of Interest Statement: None declared.