|Home | About | Journals | Submit | Contact Us | Français|
Disease biomarkers are used widely in medicine. But very few biomarkers are useful for cancer diagnosis and monitoring. Over the past 15 years, major investments have been made to discover and validate cancer biomarkers. Despite such investments, no new major cancer biomarkers have been approved for clinical use for at least 25 years. In the last decade, many reports have described new cancer biomarkers that promised to revolutionize the diagnosis of cancer and the management of cancer patients. However, many initially promising biomarkers have not been validated for clinical use. In this commentary, a plethora of parameters before sample analysis, during sample analysis, and after sample analysis that can complicate biomarker discovery and validation and lead to “false discovery” are discussed. Several examples of biomarker discoveries that were published in high-profile journals are also presented, as well as why they were not validated and the lessons learned from these false discoveries, so that similar mistakes can be avoided in the future.
The official National Institutes of Health definition of a biomarker is “a characteristic that is objectively measured and evaluated as an indicator of normal biologic processes, pathogenic processes or pharmacologic responses to a therapeutic intervention” (http://www.everythingbio.com/glos/definition.php?word=biomarker). This definition captures the clinical applications of biomarkers, including population screening, diagnosis, prognosis, monitoring, and prediction of therapeutic response or toxicity. Disease biomarkers range widely in type and include visual inspection (eg, blood in stool); biochemical, enzymatic, spectrometric, or immunological measurements; and molecular changes, to mention a few. The premise behind the use of biomarkers in medicine is that an observation or measurement can be used as a proxy of a biological process and as an indicator that a specific disease is present. Most recent biomarker publications, especially those for cancer biomarkers, have largely reported the inability to validate the biomarker for clinical use, rather than successful validation (for specific examples, see below). In fact, no new major cancer biomarker has been approved for clinical use for at least 25 years, despite the availability of highly sophisticated and powerful technologies and major advances in other areas of biomedical science. (The last biomarker approved by the Food and Drug Administration was HE4 protein for ovarian cancer in 2009; it was approved for monitoring recurrence but not for early detection.) However, before I discuss several biomarkers that have recently failed validation, I must emphasize that currently, biomarkers for many diseases are being successfully used in clinical practice. For example, the diagnosis of diabetes is based on the level of glucose in serum after 12 hours of fasting. The most sensitive indicator of a cardiovascular event (including myocardial infarction) is an elevated level of cardiac troponin in serum. The level of serum creatinine is the single most important indicator of renal function. The level of human choriogonadotropin provides confirmation of an early pregnancy. A high level of thyrotropin is a hallmark of primary hypothyroidism. Consequently, the current status of disease biomarkers is good. The recent revolutionary advancements in molecular diagnostics and high-throughput DNA sequencing will likely provide many new biomarkers (including gene mutations, copy number variations, and/or single nucleotide polymorphisms) for predisposition to various diseases and prediction of therapeutic response to a treatment or its toxicity (1).
What about cancer biomarkers? A handful of cancer biomarkers are currently recommended for clinical use but mainly for monitoring response to treatment among patients with advanced disease. With some notable exceptions (eg, the biomarkers, human choriogonadotropin for germ cell tumors and gestational trophoblastic disease and α-fetoprotein for hepatocellular and testicular carcinoma), most cancer biomarkers in clinical use are not suitable for population screening or for early diagnosis, and the use of prostate-specific antigen for prostate cancer screening is still controversial (2). But why are so few new and effective cancer biomarkers that are suitable for screening and early diagnosis being discovered and validated? The answer cannot be attributed to the lack of pathophysiological knowledge, powerful techniques, or investment of funds and so may reside in difficulties that are associated with biomarker discovery, which have apparently been underestimated.
Many requirements must be fulfilled before a cancer biomarker can be approved for clinical use. If a molecule is to be effective in early diagnosis, it must be released into circulation in appreciable (and easily detectable) amounts by a small asymptomatic tumor (or its microenvironment), a requirement that could be considered an oxymoron. This requirement may explain why many cancer biomarkers detect disease relatively well among patients with late-stage disease but detect disease poorly among patients with early-stage disease or not at all among patients with asymptomatic disease. Another requirement is that the biomarker should be highly specific for the tissue of origin because if other tissues also produce this biomarker, then its background level in normal healthy individuals will likely be high. Thus, the tumor must produce levels of the marker that are substantially higher than background, a requirement that will probably require larger tumors. Another caveat for non–tissue specific biomarkers is that, if the level of a biomarker is affected by a noncancer disease, then its utility for cancer detection may also be compromised. Prostate-specific antigen is an example of such a biomarker; it is well established that prostate-specific antigen is elevated in benign prostatic hyperplasia (resulting from an enlarged prostate) and prostatitis (resulting from inflammation). To date, with the possible exception of posttranslational modifications (eg, pancreatic ribonuclease in pancreatic adenocarcinoma and kallikrein 6 in ovarian cancer), very few, if any, molecules have been identified that are expressed only by a cancer tissue but not by the corresponding normal tissue.
Problems can develop at many stages of biomarker discovery and validation that contribute to the short life span of many “newly discovered” biomarkers. These problems can occur during preanalytical, analytical, and postanalytical phases of cancer biomarker discovery and validation. The preanalytical phase includes aspects that may play a role before sample analysis (such as sample collection). The analytical phase includes aspects of the assay (such as its specificity and sensitivity). The postanalytical phase includes aspects that may play a role after sample analysis (such as data interpretation). In the preanalytical phase, it is important to examine whether various individual characteristics (eg, patient age, diet, sex, ethnicity, lifestyle, drugs, or exercise) and/or storage of tissue samples could independently affect biomarker levels. In addition, a molecule in circulation may be quickly cleared by the kidneys or the liver, captured by other serum molecules, or degraded by serum proteases. After sample collection, the marker could be released by blood cells (eg, red cells or eosinophils) during clotting or centrifugation, altering the originally present concentration. In the analytical phase, a quantitative and validated analytical method must be available that is highly specific, sensitive, and precise to avoid introducing measurement biases (ie, that over- or underestimation of the true concentration) or artifacts. A sufficiently large number of high-quality tissue samples should be available for validation so that any statistically significant results can be unambiguously identified. Finally, in the postanalytical phase, sound data interpretation is essential so that the findings can be generalized to other series of specimens or the general population.
Press releases and news conferences that immediately accompany publication of a high-profile biomarker generate high expectations about the new biomarker. However, media ignores reports that the biomarker failed to be validated for clinical use. Consequently, the general public receives skewed information about the biomarker.
Careful validation in independent datasets by independent investigators and publication of the findings are probably the best way to identify a good biomarker. In 2001, Pepe et al. (3) described five phases of biomarker development that were based on published evidence. My group has used these five phases to classify cancer biomarkers as we prepare clinical guidelines for use of tumor markers (4–7), under the sponsorship of The National Academy of Clinical Biochemistry of USA. Guidelines for the clinical use of tumor markers have also been issued by other organizations, including the American Society for Clinical Oncology. Such guidelines are very useful because they use published evidence or expert opinion to specify the clinical utility of a biomarker. Biomarkers that have recently been discovered (and for which much published evidence does not exist) can be most effectively assessed in several blinded and independent validation studies. The Early Detection Research Network of the National Cancer Institute (http://edrn.nci.nih/gov/) is an organization that supports collaborative efforts for the discovery and validation of cancer biomarkers that are primarily suited for early detection but may also be useful in diagnostic, prognostic, predictive, and monitoring applications. An indication of difficulty in the discovery and validation of clinically useful cancer biomarkers is the fact that Early Detection Research Network has invested hundreds of millions of dollars since its inception approximately 10 years ago searching for new biomarkers, but to my knowledge, none of the new biomarkers discovered by investigators in the Early Detection Research Network has been approved for clinical use. This observation underlines the facts that adequate funding is available for the discovery and validation of biomarkers and that a highly competent group of international investigators participate in the Early Detection Research Network. Many other organizations heavily fund cancer biomarker and translational cancer research investigations, so the cause of recent failures in cancer biomarker discovery and validation is not either shortage of funds or lack of effort.
In 2009, the Early Detection Research Network began a hallmark study to validate a large number of candidate ovarian cancer biomarkers, individually and in panels, by using a phase III blinded design and high-quality clinical samples from the Prostate, Lung, Colorectal, and Ovarian Cancer study (8). Below, I have used publicly released results from this study (8) and from a few high-profile articles on cancer biomarkers that subsequently failed validation to illustrate why I believe that these “successes” soon became “failures.”
In addition to independent validation, promotion of scientific discussions and debates on newly discovered cancer biomarkers at conferences and in journals may accelerate the determination of whether a new biomarker has the potential for clinical utility. This procedure could ensure that valuable resources (time and money) are invested appropriately and promptly or directed instead to other projects. A highly useful online forum, BioMed Critical Commentary (www.bm-cc.org), posts opinions on published papers in biomedical sciences, including cancer biomarkers. This process may assist researchers to quickly identify opposing opinions on published biomarkers. As Ransohoff (9) has pointed out, deficiencies in study design are a major reason for the biomarkers to encounter difficulties during their discovery and validation.
Below, I discuss a few examples of highly publicized “breakthrough” cancer biomarkers that subsequently could not be validated. Close examination of these and other examples, not included, may elucidate the problems with other biomarkers, so that the same mistakes can be avoided in the future. A related issue that is relevant to the discussion below is whether deficiencies in such papers can be identified during the review process so that flawed studies are not published. Obviously, many deficient papers are indeed rejected by peer review, but it would be unrealistic to expect that all would be identified. There are many reasons for this problem, including 1) reviewers and editors do not have access to the primary data or their interpretation, and they do not know whether authors engaged in frequently used practices of selective (favorable) reporting (also known as “cherry-picking”); 2) careful patient selection is crucial and, frequently, this issue is not addressed in detail in the published manuscript or its online supplementary information; 3) the bioinformatic analysis may suffer from overfitting, which means selectively using the statistical procedures and sample groups that will produce the most favorable results; 4) analytical issues such as use of a nonspecific assay may produce seemingly spectacular results, which do not represent the actual concentration of the biomarker in the sample; and/or 5) reviewers may provide less negative critiques of papers from a successful and highly reputable group or investigators. We are thus bound to continue seeing seemingly spectacular advances that cannot be subsequently validated by independent investigators or organizations using different sample sets.
In 1986, Fossel et al. (10) published a method for detecting malignant tumors by proton nuclear magnetic resonance of plasma samples. The method was simple and highly effective with sensitivity of approximately 100%. However, when other groups attempted to validate the method, they found a lower sensitivity (46%) and specificity (48%) and determined that the method could not predict whether a person had cancer (11–13). In addition, it was determined that the data used to discriminate between cancer and noncancer groups were associated with plasma lipid composition, particularly triglycerides, and that plasma lipid composition was associated with age, sex, and diet but not with cancer. Because the authors were not aware of these associations, patients in the cancer and noncancer groups were not matched for lipid composition, age, sex, and diet, and so the results were biased. During the validation analyses, case patients and control subjects were matched for age and sex, and it was found that cancer and noncancer groups could not be distinguished by nuclear magnetic resonance analysis alone (11–13).
This “biomarker” is an example of patient selection bias between the comparison groups. The bias originated from differences in lipid composition, age, sex, and diet.
In 1998, Xu et al. (14) reported that lysophosphatidic acid is an effective biomarker for ovarian carcinoma. Elevated plasma lysophosphatidic acid levels were detected in 90% of patients with stage I ovarian cancer, in 100% of patients with stage II, III, or IV disease, and in 100% of patients with recurrent ovarian cancer. The test was also highly effective with other gynecologic cancers (including endometrial and cervical cancer) with a sensitivity of 92%. Lysophosphatidic acid was found to be far superior to CA125, which had a sensitivity of approximately 60% for all patients in their study and of only 22% for those with stage I disease.
Results from that study (14) led to the formation of a company, Atairgin, which invested tens of millions of dollars to validate the test in a multicenter clinical trial in the mid-1990s. Lysophosphatidic acid could not be validated as a biomarker for ovarian cancer, and Atairgin ceased to exist in the mid-2000s. In 2002, Baker et al. (15) published an independent validation of the lysophosphatidic acid test for ovarian cancer and concluded that the level of lysophosphatidic acid could not distinguish case patients with ovarian cancer from control subjects or could not distinguish patients with other gynecologic malignancies from control subjects, thus invalidating lysophosphatidic acid as a cancer diagnostic marker. It was realized that lysophosphatidic acid is a component of many blood cells (eg, eosinophils) and activated platelets that is released into the serum or plasma in an uncontrolled and unpredictable manner (15).
Lysophosphatidic acid is a nonspecific marker that is highly affected by small changes in sample collection practices, such as time and speed of centrifugation. The uncontrolled leakage of lysophosphatidic acid from blood cells introduces biases in the results (preanalytical shortcomings) (14).
In 2005, Mor et al. (16) reported that a four-analyte panel had a sensitivity of 95% and a specificity of 95% for detection of ovarian cancer. The panel consisted of four markers: leptin (whose level was decreased in cancer patients), prolactin (whose level was increased in cancer patients), osteopontin (whose level was increased in cancer patients), and insulinlike growth factor 2 (whose level was decreased in cancer patients). The high sensitivity of 95% applied also to patients with stage I or II disease. In a subsequent refinement of the test to a six-analyte panel (including the addition of macrophage inhibitory factor and CA125, whose levels are increased in cancer patients), the authors claimed a sensitivity of 95% and a specificity of 99.4% for detection of ovarian cancer (17). There were two problems with these results. First, results from the four-analyte panel were problematic because two of the analytes appeared to decrease in serum from cancer patients (compared with that from control subjects) and, as explained elsewhere, decreases in serum biomarker levels in cancer patients (as opposed to increases) are highly unlikely (18). Second, the two markers that increased in cancer (ie, prolactin and osteopontin) showed highly unexpected sensitivities that exceeded that of CA125, the classical ovarian cancer biomarker. However, in the recent blinded validation analysis of the four-analyte panel by the Early Detection Research Network (8), it was found that, by use of serum samples that were collected less than 6 months from ovarian cancer diagnosis, CA125 alone had a sensitivity of 80%, at 95% specificity. This result is important because it validated the expected performance of the classical ovarian cancer biomarker. In the same analysis, the Early Detection Research Network found the following sensitivities at a 95% specificity for the panel members: 20% sensitivity for macrophage inhibitory factor, 8% for leptin, 8% for prolactin, 16% for osteopontin, and 4% for insulinlike growth factor 2. The six-marker panel had a sensitivity of 52%, which was inferior to that of CA125 alone (a panel member!).
In 2008, McIntosh et al. (19) questioned the efficacy of the six-member panel by carefully examining the validation design, as described previously (17), and concluded that serious methodological issues may have highly exaggerated the results. Specifically, they noted that in calculating panel performance, the authors violated fundamental principles of statistical analysis when they used combined data (training set and test set) to calculate accuracy. Ideally, only data from the training set should be used to select a classifier and only the test set be used to evaluate the classifier. In addition, case patients and control subjects came from two different sources, which may introduce independent biases that are not related to ovarian cancer but may be related to other factors, including stress, time of blood draw, or time since last meal (19).
It is rather premature to draw definitive conclusions on the reasons of failure of the two panels for ovarian cancer. McIntosh et al. (19) suggested inappropriate statistical analysis and possible biases in selection of case patients and control subjects.
In 2002, Kim et al. (20) proposed that osteopontin is a serum biomarker for ovarian carcinoma. At a specificity of 80%, osteopontin had a sensitivity of 80% for early-stage disease and of 85% for late-stage disease. The Early Detection Research Network validation study (8) used the same set of samples for all experiments and concluded, at a 95% specificity, that CA125 had a sensitivity of 80% and that osteopontin had a sensitivity of 16%.
The Early Detection Research Network validation study suggested that osteopontin is not an effective biomarker for ovarian carcinoma. More studies will be necessary to uncover the reasons for the large differences between the originally reported and validated sensitivities.
Early prostate cancer antigen-2 (EPCA-2) is a new putative prostate cancer biomarker that was reported in the original (21) and a subsequent (22) publications to perform better than prostate-specific antigen for diagnosis, prognosis, and monitoring of prostate cancer. However, close examination of the methods used to measure EPCA-2 found major deficiencies, as outlined elsewhere (23,24). These deficiencies included reporting values that were beyond the detection limit of the assay and use of inappropriate agents to “capture” EPCA-2 (ie, use of undiluted serum, instead of, eg, a specific antibody). Other putative biomarkers for colon cancer, which likely suffer from the same methodological shortcomings, have been proposed by the same group (25). Consequently, I concluded (23,24) that the reported results are likely due to artifacts of the analytical methods used. These methodological shortcomings were not addressed by the authors, despite invitation (23). In September 2009, Onconome, the company that sponsored this research for many years, filed a lawsuit against the investigators and their institutions. In response to this lawsuit, the first author of these publications, who had assayed EPCA-2 in the samples in the first article (21), stated that he was the only person to ever get the EPCA-2 assay to work and that it only worked for him once (never before, never since) (26).
Methodological artifacts can produce misleadingly “highly promising” data. Before biomarker validation studies are initiated, the analytical method must be carefully evaluated for precision and accuracy.
In 2002, Petricoin et al. (27) reported a method for early ovarian cancer diagnosis that used mass spectrometry to detect proteomic patterns from serum samples. The reported sensitivities and specificities were approximately 100%, even for serum from patients with early-stage ovarian cancer. This article is the most highly cited research article by these investigators, receiving approximately 256 citations per year and more than 2000 citations since its publication! The outstanding investigators and the high-profile journal that published the data, which was accompanied by a powerful editorial, generated widespread media coverage and optimism that a new era in cancer diagnostics has started. However, soon after this article appeared, I reported methodological shortcomings of this method and questioned its validity (28–30). Subsequently, others have identified bioinformatic artifacts and concluded that the results obtained by Petricoin et al. were further compromised by variations in sample collection and storage, as reviewed (29). Despite the hundreds of seemingly positive publications that used similar approaches for other cancer types, as reviewed (28,29), an independent validation study of this method for prostate cancer diagnosis, sponsored by Early Detection Research Network, has shown zero utility (31).
Despite its promise, this method has not yet found its way into clinical practice, 8 years after the original report. The failure can be attributed to shortcomings of the analytical method, bias in sample collection and storage, and the bioinformatic analysis.
In 2006, Villanueva et al. (32) reported that peptidomic patterns in serum, which had been generated by exoprotease activities and identified by mass spectrometry, could be used for near-perfect diagnosis of various forms of cancer (32). In my analysis of this article, I identified flaws in the method of patient selection (33). Specifically, in the study of prostate cancer, the average age of control subjects was 40 years compared with that of 60 years for case patients, and a large fraction of the control subjects were women, whereas all case patients were men. It is well known that women have no prostate and so do not suffer from prostate cancer. Furthermore, prostate cancer is relatively rare in men aged 40 years or younger. Consequently, case patients and control subjects were not matched well or wisely. Recently, a validation study (34) of this method has shown that peptides generated ex vivo from serum proteins by exoproteases are not useful biomarkers in ovarian cancer.
This study is an example of poor study design. There was large bias in matching of case patient and control groups by age and sex.
The examples mentioned above confirm the old saying: Just because it is in print, does not mean it is true. I used selected examples from highly reputable and high-impact journals, which have very rigorous internal and external peer review and statistical review. Unfortunately, these examples likely represent the tip of the iceberg. The examples show that problems with preanalytical, analytical, postanalytical study design and statistical and/or bioinformatic deficiencies (9) could lead to serious misinterpretations and to generation of data that could be highly misleading. How could experienced authors, reviewers, and editors overlook these important deficiencies? Part of the problem is that many scientists, including even Nobel Prize winners in basic sciences, have embarked on biomarker discovery and validation with experiences from qualitative fields of science. Many of these individuals lack the necessary analytical and clinical backgrounds for such undertakings. I hope that highlighting these issues with cancer biomarkers will lead to the design of improved biomarker discovery studies in the future. An additional aid would be validation studies that follow recommendations, such as those in the reporting guidelines for Standards for Reporting of Diagnostic Accuracy, which clearly outline how validation studies should be designed and executed (35,36). It should be noted that similar concerns regarding overoptimistic reports for cancer prognostic factors, including gene expression signatures generated by microarray profiling, have been expressed by others (37–39). Last, but not least, it may be appropriate, in the future, not to call a molecule a biomarker until it passes at least one independent validation study.
Early Detection Research Network of the National Cancer Institute, USA, and the Natural Sciences and Engineering Research Council of Canada.
The sponsors had no role in study design, analysis, or interpretation of the data, in the decision to submit this article for publication or for writing the article. The author has full responsibility for all activities mentioned above.