|Home | About | Journals | Submit | Contact Us | Français|
Less than 1% of published cancer biomarkers actually enter clinical practice. Although best practices for biomarker development are published, optimistic investigators may not appreciate the statistical near-certainty and diverse modes by which the other 99% (likely including your favorite new marker) do indeed fail. Here, patterns of failure were abstracted for classification from publications and an online database detailing marker failures. Failure patterns formed a hierarchical logical structure, or outline, of an emerging, deeply complex, and arguably fascinating science of biomarker failure. A new cancer biomarker under development is likely to have already encountered one or more of the following fatal features encountered by prior markers: Lack of clinical significance, hidden structure in the source data, a technically inadequate assay, inappropriate statistical methods, unmanageable domination of the data by normal variation, implausibility, deficiencies in the studied population or in the investigator system, and its disproof or abandonment for cause by others. A greater recognition of the science of biomarker failure and its near-complete ubiquity is constructive and celebrates a seemingly perpetual richness of biological, technical, and philosophical complexity, the full appreciation of which could improve the management of scarce research resources.
Fundamentally, biomarker development is observational, or empiric, research. As such, it shares the shortcomings characteristic of observational methods. It lacks the strengths of a designed experiment. For example, interventional clinical trials permit the use of randomization and other powerful tools and controls not available to observational clinical studies. The latter may disprove, but cannot alone prove, causal relationships.
At the inception of most “new” biomarkers, the research is generally ambiguous as to the key decisions needed to perform the study (1). Investigators may not yet have fixed the scope or the variety of data types to be collected, the quality control rules to use for data inclusion/exclusion and data processing, when to interrupt data collection to interrogate the data, and whether to resume data collection after some data have been analyzed. Because this interrogation can be part of the quality-control mechanisms, it is not feasible to prohibit all forms of premature data analysis. Initially, also unsettled will be the precision and the number of questions to be directed towards the data. The lock-down rules, under which a final data set is defined, may be delayed; such decisions may not be possible prior to starting data analysis. These lock-down rules should be published in all emerging biomarker reports, but generally are not. Experimental research, in contrast, need not violate these ideals as often.
“Outcomes data” are an essential component of both experimental and observational research. In biomarker development, the markers are intended to find diseases yet unknown or predict events yet to unfold. These outcomes data, however, are often unreliable. Under-reporting and erroneous reporting are common. Sources of outcomes data, such as chart review and adverse-event reporting, have frustrating limitations. For example, it is possible that the outcome of interest to the biomarker study was never systematically gathered. A simple outcome inquiry, such as “Were you ever diagnosed to have a cancer?”, may not have been posed as a question to the patients. A measure of significance, such as “Was the progression of the cancer the cause of death” may not have been recorded in the medical record examined. Different data types may have differing levels of flaws, with serious biases and confounded variables selectively affecting the more unreliable data types.
Pervasive problems are not limited to biomarker development. Referring to biomedical research as a whole, Ioannidis offered a strong argument that most published scientific results are wrong (2). Referring to research published in psychology, Simmons explained that “It is unacceptably easy to publish ‘statistically significant’ evidence consistent with any hypothesis” (1).
The considerable resources devoted to biomarker research belie a similar, sobering truth, that few biomarkers enter clinical practice. If one were to define as “success” the creation of an insurance-reimbursable biomarker test employed to alter clinical decision-making, less than 1% of new cancer biomarkers have been successful. A sizable group of highly successful markers is identified by the reimbursements paid to pathologists to interpret new immunohistochemical markers for tumor classification. Another group, comprising unsuccessful markers, is defined by the paucity of new blood-based screening or prognostic markers among the clinical tests offered by medical clinics. Diamandis observed that “very few, if any, new circulating cancer biomarkers have entered the clinic in the last 30 years” (3). The problem of identifying novel cancer biomarkers must have been persistently underestimated (4, 5).
Underlying the difficulty is a general failure to validate clinically that a marker has utility. An ability to change a clinical decision arises from the marker’s predictive value or its ability to discriminate among disease classifications. Such clinical validation differs from a technical validation. The latter is stringently provided when the results of one assay are in agreement with those of another, orthogonal, assay. For example, the results from hybridization-based mRNA profiling may indicate that the more aggressive tumors express gene X at a high level. A technical validation is provided when a PCR-based or a protein-based assay confirms the elevated levels of gene X in the same samples. Functional studies may further confirm technically that the activity of gene X is also elevated in the tumors. The investigators may claim, with some justification, that the markers had been “rigorously validated”. Nonetheless, a subsequent attempt at a clinical validation, using a different panel of samples, may still find that gene X is not generally associated with tumor aggressiveness. This is possible because technical validations are inherently a circular process of discovery. For example, a finding in sample set Q will be technically validated as true when re-tested with varying methods applied in the same sample set, Q. When the newly discovered clinical association is tested in a different sample set, sample set R, the analytic strategy is no longer circular; clinical weaknesses may now be revealed.
It is typical to use high-dimensional data to identify new biomarkers for cancer. In this situation, the number of features determined for each sample (p, the quantity of parameters or dimensions in the data) greatly exceeds the sample size (n, the number of samples analyzed), producing the “large p, small n” or “p > > n” paradigm. Such data arise from exceedingly parallel assays such as microarrays, mass spectroscopy, or DNA sequencing. Most of these marker publications have not been validated clinically by outside investigators or, when they were, failed the external validation attempt (6). For example, among 35 detailed studies reporting a new molecular tumor-classifier (selected for studies having both a statistical cross-validation procedure used on an initial or “discovery” population and a validation attempt performed on a separate “external” or “validation” population), which in turn were cited by a total of 758 subsequent publications, only one of the citations constituted an additional independent validation of the reported classifier (7). As another example, the NIH’s Early Detection Research Network (EDRN) performed a large validation study of 35 reported markers for early diagnosis of ovarian cancer to attempt to improve on the available but inadequate marker, CA125. They found none that surpassed the inadequate marker (8). Even among the top five markers in this study, unless the specimens were obtained within a few months prior to the time of the initial clinical cancer diagnosis, the sensitivity ranged from 0.73 to 0.40 at marker cutoff levels producing a specificity of 0.95. Even this level of specificity was generous; i.e., it presented the new markers in a favorable light: the 5% rate (1 − 0.95 = 0.05) of false positives would likely be prohibitively high if used to screen asymptomatic populations.
Failures in marker development equate to lost resources, from consuming money, calendar years, labor, talent, and credibility for the field. The volume of misleading publications raises false hopes, poses ethical dilemmas, and trigger improper policy changes and purposeless debate.
An improved recognition of the patterns of failure should improve future performance of the field. As Heisenberg had recalled Einstein saying, “It is the theory which decides what can be observed.” Optimal scientific procedures thus require that we consider a great diversity of theories to interpret any observed pattern, not excluding those theories that notify us of our failures, so that we can move towards a situation wherein “theory guides, but experiment decides.” This diversity represents in a sense our marker-development vocabulary. Repeatedly, however, the literature in a biased manner presents the arguments in favor of marker successes; this bias will limit our vocabulary for communicating failures.
Biomarker development is typically a team process. Seldom, however, does a single team member command the topic completely. Some aspects of biomarker evaluation are obscure to most biomedical researchers. Omissions, or conceptual blind spots, among the team strategy can thus go undetected. It can be difficult to adjudicate among team members when differences of opinion arise as to the strategy, the final quality desired for the studies, and the scope and limitations of the conclusions justifiably drawn from the studies. A substantial attempt could be made to formulate a more complete common vocabulary. And because failures are more common and diverse than the successes in biomarker development, this effort could be voluminous. Such an effort to collate, to organize, and to teach this vocabulary is presented in the supplemental document, “Flaws to anticipate and recognize in biomarker development”, and builds upon focused prior efforts.
A number of general roadmaps have been offered to guide successful marker development and offer a beginning vocabulary of marker failure. Pepe et al described a sequential series of five stages of biomarker discovery, validation, and implementation (9). Comprehensive reviews of the best practices were offered for proteomic biomarker development (10) and for the validation steps for molecular biomarker signatures (11). Professional associations facilitate the incorporation of new, evidence-based methods into standards of care; examples include the National Comprehensive Cancer Network (NCCN) cancer care guidelines and the American Cancer Society (ACS) cancer screening guidelines.
Funding agencies such as the NIH launched supportive platforms for biomarker development. They created disease-focused opportunities, assembled qualified samples sets, and provided resources for reference laboratories. Efforts were manifest in the creation of the EDRN, the fostering of team science by special funding mechanisms (SPORE and PO1 grants by the NIH), and efforts to lower regulatory and intellectual property hurdles.
Guidelines are available as prepublication checklists to be used in manuscript creation and peer review. These reflect broad to narrow interests: REMARK guidelines for reporting of tumor marker studies (12), STARD standards for reporting of diagnostic accuracy (13, 14), MIQE guidelines for minimum information for publication of quantitative real-time PCR experiments (15), a checklist for statistical analysis and reporting of DNA microarray studies for cancer outcome (16), recommendations for improved standardization of immunohistochemistry and in-situ hybridization (17, 18), guidelines for tissue microarray construction representing multicenter prospective clinical trial tissues (19), and recommendations to record and lock down the statistical methods used in biomarker discovery and validation (11).
Instructive anecdotes relating specifically to cancer biomarkers have been collected and lessons derived. Productive authors in this area have included Ransohoff, Diamandis, McShane, Pepe, Berry, Baggerly, and Coombes. The published commentary at the BioMed Critical Commentary website addresses flaws in hundreds of cancer biomarker publications.
The accompanying review (see the supplemental document) incorporates an outline of lessons derived from the above sources.
Briefly, investigators involved in developing a cancer biomarker are likely to encounter one or more of the following flaws. 1) The marker may provide a valid classification of disease or risk, and yet still lack clinical utility. Alternately, a marker may not provide a valid classification despite promising indications. For example, the marker may have appeared promising 2) due to hidden structure in the data (a form of bias producing an invalid classification), 3) due to the assay being technically inadequate, 4) due to use of inappropriate statistical methods, 5) due to normal variation in the results dominating over any possible useful information from the marker, 6) due to a low prior probability of the marker being useful (i.e., an implausible marker), 7) due to deficiencies inherent to the studied population or 8) due to deficiencies in the investigator system (including the research team, the research institution, the funding agencies, and the journal and its review process), or 9) owing to the marker, perhaps quietly, having already been disproved or abandoned by other knowledgeable investigators.
Many of these categories reflect the influence of a bias. According to Ransohoff, “Bias will increasingly be recognized as the most important ‘threat to validity’ that must be addressed… a study should be presumed ‘guilty’ – or biased – until proven innocent… Because there are many potential biases and some do not have consensus names or definitions, the process of identifying and addressing bias is not amenable to a simple checklist” (20). Many others reflect the error from statistical “overfitting” of high-dimensional data (11, 21) or from, as Berry warned, the ubiquitous and often insidious threat of multiplicities when the same patients have been analyzed repeatedly (22, 23).
Exploring the above categories of failure, one notes a remarkable complexity of failure modes.
Simple errors are commonly encountered during biomarker development. They may doggedly persist owing to an under-appreciation of the immense diversity of such flaws and of the quiet erosive power they carry. An improved recognition of these patterns of failure could improve future biomarker development as well as biomarker retirement.
This work has been supported by NIH grant CA62924 and the Everett and Marjorie Kovler Professorship in Pancreas Cancer Research.
Financial interest and conflicts of interest: None to declare