Clinical management decisions for patients with cancer are increasingly being guided by prognostic and predictive markers. Use of these markers should be based on a sufficiently comprehensive body of unbiased evidence to establish that benefits to patients outweigh harms and to justify expenditure of health care dollars. Careful assessments of the clinical utility of markers by using comparative effectiveness research methods are urgently needed to more rigorously summarize and evaluate the evidence, but multiple factors have made such assessments difficult. The literature on tumor markers is plagued by nonpublication bias, selective reporting, and incomplete reporting. Several measures to address these problems are discussed, including development of a tumor marker study registry, greater attention to assay analytic performance and specimen quality, use of more rigorous study designs and analysis plans to establish clinical utility, and adherence to higher standards for reporting tumor marker studies. More complete and transparent reporting by adhering to criteria such as BRISQ [Biospecimen Reporting for Improved Study Quality] criteria for reporting details about specimens and REMARK [Reporting Recommendations for Tumor Marker Prognostic Studies] criteria for reporting a multitude of aspects relating to study design, analysis, and results, is essential for reliable assessment of study quality, detection of potential biases, and proper interpretation of study findings. Adopting these measures will improve the quality of the body of evidence available for comparative effectiveness research and enhance the ability to establish the clinical utility of prognostic and predictive tumor markers.
Many papers have been published in biomedical journals reporting on the development of prognostic and therapy-guiding biomarkers or predictors developed from high-dimensional data generated by omics technologies. Few of these tests have advanced to routine clinical use.
We discuss statistical issues in the development and evaluation of prognostic and therapy-guiding biomarkers and omics-based tests.
Concepts relevant to the development and evaluation of prognostic and therapy-guiding clinical tests are illustrated through discussion and examples. Some differences between statistical approaches for test evaluation and therapy evaluation are explained. The additional complexities introduced in the evaluation of omics-based tests are highlighted.
Distinctions are made between clinical validity of a test and clinical utility. To establish clinical utility for prognostic tests it is explained why absolute risk should be evaluated in addition to relative risk measures. The critical role of an appropriate control group is emphasized for evaluation of therapy-guiding tests. Common pitfalls in the development and evaluation of tests generated from high-dimensional omics data such as model overfitting and inappropriate methods for test performance evaluation are explained, and proper approaches are suggested.
The cited references do not comprise an exhaustive list of useful references on this topic, and a systematic review of the literature was not performed. Instead, a few key points were highlighted and illustrated with examples drawn from the oncology literature.
Approaches for the development and statistical evaluation of clinical tests useful for predicting prognosis and selecting therapy differ from standard approaches for therapy evaluation. Proper evaluation requires an understanding of the clinical setting and what information is likely to influence clinical decisions. Specialized expertise relevant to building mathematical predictor models from high-dimensional data is helpful to avoid common pitfalls in the development and evaluation of omics-based tests.
biomarker; omics; clinical test; diagnostic test; therapy selection
Next Generation Sequencing (NGS) technologies are used to detect somatic mutations in tumors and study germ line variation. Most NGS studies use DNA isolated from whole blood or fresh frozen tissue. However, formalin-fixed paraffin-embedded (FFPE) tissues are one of the most widely available clinical specimens. Their potential utility as a source of DNA for NGS would greatly enhance population-based cancer studies. While preliminary studies suggest FFPE tissue may be used for NGS, the feasibility of using archived FFPE specimens in population based studies and the effect of storage time on these specimens needs to be determined. We conducted a study to determine whether DNA in archived FFPE high-grade ovarian serous adenocarcinomas from Surveillance, Epidemiology and End Results (SEER) registries Residual Tissue Repositories (RTR) was present in sufficient quantity and quality for NGS assays. Fifty-nine FFPE tissues, stored from 3 to 32 years, were obtained from three SEER RTR sites. DNA was extracted, quantified, quality assessed, and subjected to whole exome sequencing (WES). Following DNA extraction, 58 of 59 specimens (98%) yielded DNA and moved on to the library generation step followed by WES. Specimens stored for longer periods of time had significantly lower coverage of the target region (6% lower per 10 years, 95% CI: 3-10%) and lower average read depth (40x lower per 10 years, 95% CI: 18-60), although sufficient quality and quantity of WES data was obtained for data mining. Overall, 90% (53/59) of specimens provided usable NGS data regardless of storage time. This feasibility study demonstrates FFPE specimens acquired from SEER registries after varying lengths of storage time and under varying storage conditions are a promising source of DNA for NGS.
While there is ample literature reporting on the identification of molecular biomarkers for head and neck squamous cell carcinoma, none is currently recommended for routine clinical use. A major reason for this lack of progress is the difficulty in designing studies in head and neck cancer to clearly establish the clinical utility of biomarkers. Consequently, biomarker studies frequently stall at the initial discovery phase. In this paper, we focus on biomarkers for use in clinical management, including selection of therapy. Using several contemporary examples, we identify some of the common deficiencies in study design that hinder success in biomarker development for this disease area, and we suggest some potential solutions. The goal of this article is to provide guidance that can assist investigators to more efficiently move promising biomarkers in head and neck cancer from discovery to clinical practice.
prognostic biomarkers; molecular markers; clinical utility; head and neck cancer; HNSCC
The incorporation of biomarkers into the drug development process will improve understanding of how new therapeutics work and allow for more accurate identification of patients who will benefit from those therapies. Strategically planned biomarker evaluations in phase II studies may allow for the design of more efficient phase III trials and better screening of therapeutics for entry into phase III development, hopefully leading to increased chances of positive phase III trial results. Some examples of roles that a biomarker can play in a phase II trial include predictor of response or resistance to specific therapies, patient enrichment, correlative endpoint, or surrogate endpoint. Considerations for using biomarkers most effectively in these roles are discussed in the context of several examples. The substantial technical, logistic, and ethical challenges that can be faced when trying to incorporate biomarkers into phase II trials are also addressed. A rational and coordinated approach to the inclusion of biomarker studies throughout the drug development process will be the key to attaining the goal of personalized medicine.
In breast cancer, immunohistochemical assessment of proliferation using the marker Ki67 has potential use in both research and clinical management. However, lack of consistency across laboratories has limited Ki67’s value. A working group was assembled to devise a strategy to harmonize Ki67 analysis and increase scoring concordance. Toward that goal, we conducted a Ki67 reproducibility study.
Eight laboratories received 100 breast cancer cases arranged into 1-mm core tissue microarrays—one set stained by the participating laboratory and one set stained by the central laboratory, both using antibody MIB-1. Each laboratory scored Ki67 as percentage of positively stained invasive tumor cells using its own method. Six laboratories repeated scoring of 50 locally stained cases on 3 different days. Sources of variation were analyzed using random effects models with log2-transformed measurements. Reproducibility was quantified by intraclass correlation coefficient (ICC), and the approximate two-sided 95% confidence intervals (CIs) for the true intraclass correlation coefficients in these experiments were provided.
Intralaboratory reproducibility was high (ICC = 0.94; 95% CI = 0.93 to 0.97). Interlaboratory reproducibility was only moderate (central staining: ICC = 0.71, 95% CI = 0.47 to 0.78; local staining: ICC = 0.59, 95% CI = 0.37 to 0.68). Geometric mean of Ki67 values for each laboratory across the 100 cases ranged 7.1% to 23.9% with central staining and 6.1% to 30.1% with local staining. Factors contributing to interlaboratory discordance included tumor region selection, counting method, and subjective assessment of staining positivity. Formal counting methods gave more consistent results than visual estimation.
Substantial variability in Ki67 scoring was observed among some of the world’s most experienced laboratories. Ki67 values and cutoffs for clinical decision-making cannot be transferred between laboratories without standardizing scoring methodology because analytical validity is limited.
The intraclass correlation coefficient (ICC) is widely used in biomedical research to assess the reproducibility of measurements between raters, labs, technicians, or devices. For example, in an inter-rater reliability study, a high ICC value means that noise variability (between-raters and within-raters) is small relative to variability from patient to patient. A confidence interval or Bayesian credible interval for the ICC is a commonly reported summary. Such intervals can be constructed employing either frequentist or Bayesian methodologies.
This study examines the performance of three different methods for constructing an interval in a two-way, crossed, random effects model without interaction: the Generalized Confidence Interval method (GCI), the Modified Large Sample method (MLS), and a Bayesian method based on a noninformative prior distribution (NIB). Guidance is provided on interval construction method selection based on study design, sample size, and normality of the data. We compare the coverage probabilities and widths of the different interval methods.
We show that, for the two-way, crossed, random effects model without interaction, care is needed in interval method selection because the interval estimates do not always have properties that the user expects. While different methods generally perform well when there are a large number of levels of each factor, large differences between the methods emerge when the number of one or more factors is limited. In addition, all methods are shown to lack robustness to certain hard-to-detect violations of normality when the sample size is limited.
Decision rules and software programs for interval construction are provided for practical implementation in the two-way, crossed, random effects model without interaction. All interval methods perform similarly when the data are normal and there are sufficient numbers of levels of each factor. The MLS and GCI methods outperform the NIB when one of the factors has a limited number of levels and the data are normally distributed or nearly normally distributed. None of the methods work well if the number of levels of a factor are limited and data are markedly non-normal. The software programs are implemented in the popular R language.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2288-14-121) contains supplementary material, which is available to authorized users.
Confidence interval; Credible interval; Generalized confidence interval; Intraclass correlation coefficient; Modified large sample
Predictive biomarkers to guide therapy for cancer patients are a cornerstone of precision medicine. Discussed herein are considerations regarding the design and interpretation of such predictive biomarker studies. These considerations are important for both planning and interpreting prospective studies and for using specimens collected from completed randomized clinical trials. Specific issues addressed are differentiation between qualitative and quantitative predictive effects, challenges due to sample size requirements for predictive biomarker assessment, and consideration of additional factors relevant to clinical utility assessment, such as toxicity and cost of new therapies as well as costs and potential morbidities associated with routine use of biomarker-based tests.
The US National Cancer Institute (NCI), in collaboration with scientists representing multiple areas of expertise relevant to ‘omics’-based test development, has developed a checklist of criteria that can be used to determine the readiness of omics-based tests forguiding patient care in clinical trials. The checklist criteria cover issues relating to specimens, assays, mathematical modelling, clinical trial design, and ethical, legal and regulatory aspects. Funding bodies and journals are encouraged to consider the checklist, which they may find useful for assessing study quality and evidence strength. The checklist will be used to evaluate proposals for NCI-sponsored clinical trials in which omics tests will be used to guide therapy.
To update the American Society of Clinical Oncology (ASCO)/College of American Pathologists (CAP) guideline recommendations for human epidermal growth factor receptor 2 (HER2) testing in breast cancer to improve the accuracy of HER2 testing and its utility as a predictive marker in invasive breast cancer.
ASCO/CAP convened an Update Committee that included coauthors of the 2007 guideline to conduct a systematic literature review and update recommendations for optimal HER2 testing.
The Update Committee identified criteria and areas requiring clarification to improve the accuracy of HER2 testing by immunohistochemistry (IHC) or in situ hybridization (ISH). The guideline was reviewed and approved by both organizations.
The Update Committee recommends that HER2 status (HER2 negative or positive) be determined in all patients with invasive (early stage or recurrence) breast cancer on the basis of one or more HER2 test results (negative, equivocal, or positive). Testing criteria define HER2-positive status when (on observing within an area of tumor that amounts to >10% of contiguous and homogeneous tumor cells) there is evidence of protein overexpression (IHC) or gene amplification (HER2 copy number or HER2/CEP17 ratio by ISH based on counting at least 20 cells within the area). If results are equivocal (revised criteria), reflex testing should be performed using an alternative assay (IHC or ISH). Repeat testing should be considered if results seem discordant with other histopathologic findings. Laboratories should demonstrate high concordance with a validated HER2 test on a sufficiently large and representative set of specimens. Testing must be performed in a laboratory accredited by CAP or another accrediting entity. The Update Committee urges providers and health systems to cooperate to ensure the highest quality testing.
High-throughput ?omics? technologies that generate molecular profiles for biospecimens have been extensively used in preclinical studies to reveal molecular subtypes and elucidate the biological mechanisms of disease, and in retrospective studies on clinical specimens to develop mathematical models to predict clinical endpoints. Nevertheless, the translation of these technologies into clinical tests that are useful for guiding management decisions for patients has been relatively slow. It can be difficult to determine when the body of evidence for an omics-based test is sufficiently comprehensive and reliable to support claims that it is ready for clinical use, or even that it is ready for definitive evaluation in a clinical trial in which it may be used to direct patient therapy. Reasons for this difficulty include the exploratory and retrospective nature of many of these studies, the complexity of these assays and their application to clinical specimens, and the many potential pitfalls inherent in the development of mathematical predictor models from the very high-dimensional data generated by these omics technologies. Here we present a checklist of criteria to consider when evaluating the body of evidence supporting the clinical use of a predictor to guide patient therapy. Included are issues pertaining to specimen and assay requirements, the soundness of the process for developing predictor models, expectations regarding clinical study design and conduct, and attention to regulatory, ethical, and legal issues. The proposed checklist should serve as a useful guide to investigators preparing proposals for studies involving the use of omics-based tests. The US National Cancer Institute plans to refer to these guidelines for review of proposals for studies involving omics tests, and it is hoped that other sponsors will adopt the checklist as well.
Analytical validation; Biomarker; Diagnostic test; Genomic classifier; Model validation; Molecular profile; Omics; Personalized medicine; Precision Medicine; Treatment selection
Efficient development of targeted therapies that may only benefit a fraction of patients requires clinical trial designs that use biomarkers to identify sensitive subpopulations. Various randomized phase III trial designs have been proposed for definitive evaluation of new targeted treatments and their associated biomarkers (eg, enrichment designs and biomarker-stratified designs). Before proceeding to phase III, randomized phase II trials are often used to decide whether the new therapy warrants phase III testing. In the presence of a putative biomarker, the phase II trial should also provide information as to what type of biomarker phase III trial is appropriate. A randomized phase II biomarker trial design is proposed, which, after completion, recommends the type of phase III trial to be used for the definitive testing of the therapy and the biomarker. The recommendations include the possibility of proceeding to a randomized phase III of the new therapy with or without using the biomarker and also the possibility of not testing the new therapy further. Evaluations of the proposed trial design using simulations and published data demonstrate that it works well in providing recommendations for phase III trial design.
Human biospecimens are subject to a number of different collection, processing, and storage factors that can significantly alter their molecular composition and consistency. These biospecimen preanalytical factors, in turn, influence experimental outcomes and the ability to reproduce scientific results. Currently, the extent and type of information specific to the biospecimen preanalytical conditions reported in scientific publications and regulatory submissions varies widely. To improve the quality of research utilizing human tissues it is critical that information regarding the handling of biospecimens be reported in a thorough, accurate, and standardized manner. The Biospecimen Reporting for Improved Study Quality (BRISQ) recommendations outlined herein are intended to apply to any study in which human biospecimens are used. The purpose of reporting these details is to supply others, from researchers to regulators, with more consistent and standardized information to better evaluate, interpret, compare, and reproduce the experimental results. The BRISQ guidelines are proposed as an important and timely resource tool to strengthen communication and publications around biospecimen-related research and help reassure patient contributors and the advocacy community that the contributions are valued and respected.
The Reporting Recommendations for Tumor Marker Prognostic Studies (REMARK) checklist consists of 20 items to report for published tumor marker prognostic studies. It was developed to address widespread deficiencies in the reporting of such studies. In this paper we expand on the REMARK checklist to enhance its use and effectiveness through better understanding of the intent of each item and why the information is important to report.
REMARK recommends including a transparent and full description of research goals and hypotheses, subject selection, specimen and assay considerations, marker measurement methods, statistical design and analysis, and study results. Each checklist item is explained and accompanied by published examples of good reporting, and relevant empirical evidence of the quality of reporting. We give prominence to discussion of the 'REMARK profile', a suggested tabular format for summarizing key study details.
The paper provides a comprehensive overview to educate on good reporting and provide a valuable reference for the many issues to consider when designing, conducting, and analyzing tumor marker studies and prognostic studies in medicine in general.
To encourage dissemination of the Reporting Recommendations for Tumor Marker Prognostic Studies (REMARK): Explanation and Elaboration, this article has also been published in PLoS Medicine.
The REMARK “elaboration and explanation” guideline, by Doug Altman and colleagues, provides a detailed reference for authors on important issues to consider when designing, conducting, and analyzing tumor marker prognostic studies.
Exciting new technologies for assessing markers in human specimens are now available to evaluate unprecedented types and numbers of variations in DNA, RNA, proteins, or biological structures such as chromosomes. These markers, whether viewed individually, or collectively as a 'signature', have the potential to be useful for disease risk assessment, screening, early detection, prognosis, therapy selection, and monitoring for therapy effectiveness or disease recurrence. Successful translation from basic research findings to clinically useful test requires basic, translational, and regulatory sciences and a collaborative effort among individuals with varied types of expertise including laboratory scientists, technology developers, clinicians, statisticians, and bioinformaticians. The focus of this commentary is the many statistical challenges in translational marker research, specifically in the development and validation of marker-based tests that have clinical utility for therapeutic decision-making.
Marker; biomarker; biostatistics; prognostic; predictive; treatment effect modifier; clinical test; translational research
Human biospecimens are subject to a number of different collection, processing, and storage factors that can significantly alter their molecular composition and consistency. These biospecimen preanalytical factors, in turn, influence experimental outcomes and the ability to reproduce scientific results. Currently, the extent and type of information specific to the biospecimen preanalytical conditions reported in scientific publications and regulatory submissions varies widely. To improve the quality of research utilizing human tissues, it is critical that information regarding the handling of biospecimens be reported in a thorough, accurate, and standardized manner. The Biospecimen Reporting for Improved Study Quality recommendations outlined herein are intended to apply to any study in which human biospecimens are used. The purpose of reporting these details is to supply others, from researchers to regulators, with more consistent and standardized information to better evaluate, interpret, compare, and reproduce the experimental results. The Biospecimen Reporting for Improved Study Quality guidelines are proposed as an important and timely resource tool to strengthen communication and publications around biospecimen-related research and help reassure patient contributors and the advocacy community that the contributions are valued and respected.
The molecular drivers that determine histology in lung cancer are largely unknown. We investigated whether microRNA (miR) expression profiles can differentiate histological subtypes and predict survival for non-small cell lung cancer.
We analyzed miR expression in 165 adenocarcinoma (AD) and 125 squamous cell carcinoma (SQ) tissue samples from the Environmental And Genetics in Lung cancer Etiology (EAGLE) study using a custom oligo array with 440 human mature antisense miRs. We compared miR expression profiles using t-tests and F-tests and accounted for multiple testing using global permutation tests. We assessed the association of miR expression with tobacco smoking using Spearman correlation coefficients and linear regression models, and with clinical outcome using log-rank tests, Cox proportional hazards and survival risk prediction models, accounting for demographic and tumor characteristics.
MiR expression profiles strongly differed between AD and SQ (global p<0.0001), particularly in the early stages, and included miRs located on chromosome loci most often altered in lung cancer (e.g., 3p21-22). Most miRs, including all members of the let-7 family, were down-regulated in SQ. Major findings were confirmed by QRT-PCR in EAGLE samples and in an independent set of lung cancer cases. In SQ, low expression of miRs down-regulated in the histology comparison was associated with 1.2 to 3.6-fold increased mortality risk. A 5-miR signature significantly predicted survival for SQ.
We identified a miR expression profile that strongly differentiated AD from SQ and had prognostic implications. These findings may lead to histology-based therapeutic approaches.
Clinical biomarker tests that aid in making treatment decisions will play an important role in achieving personalized medicine for cancer patients. Definitive evaluation of the clinical utility of these biomarkers requires conducting large randomized clinical trials (RCTs). Efficient RCT design is therefore crucial for timely introduction of these medical advances into clinical practice, and a variety of designs have been proposed for this purpose. To guide design and interpretation of RCTs evaluating biomarkers, we present an in-depth comparison of advantages and disadvantages of the commonly used designs. Key aspects of the discussion include efficiency comparisons and special interim monitoring issues that arise because of the complexity of these RCTs. Important ongoing and completed trials are used as examples. We conclude that, in most settings, randomized biomarker-stratified designs (ie, designs that use the biomarker to guide analysis but not treatment assignment) should be used to obtain a rigorous assessment of biomarker clinical utility.
Carbohydrate antigen arrays (glycan arrays) have been recently developed for the high-throughput analysis of carbohydrate macromolecule interactions. When profiling serum, information about experimental variability, inter-individual biological variability, and intra-individual temporal variability is critical. In this report, we describe the characterization of a carbohydrate antigen array and assay for profiling human serum. Through optimization of assay conditions and development of a normalization strategy, we obtain highly reproducible results with a within-experiment coefficient of variation (CV) of 10.8% and an overall CV (across multiple batches of slides and days) of 28.5%. We also report antibody profiles for 48 human subjects and evaluate for the first time the effects of age, race, sex, geographic location, and blood type on antibody profiles for a large set of carbohydrate antigens. We found significant dependence on age and blood type of antibody levels for a variety of carbohydrates. Finally, we conducted a longitudinal study with a separate group of 7 serum donors to evaluate the variation in anti-carbohydrate antibody levels within an individual over a period ranging from 3 to 13 weeks and found that, for nearly all antigens on our array, antibody levels are generally stable over this period. The results presented here provide the most comprehensive evaluation of experimental and biological variation reported to date for a glycan array and have significant implications for studies involving human serum profiling.
Glycan array; carbohydrate antigens; serum antibodies; microarray; variability; normalization
MiR arrays distinguish themselves from gene expression arrays by their more limited number of probes, and the shorter and less flexible sequence in probe design. Robust data processing and analysis methods tailored to the unique characteristics of miR arrays are greatly needed. Assumptions underlying commonly used normalization methods for gene expression microarrays containing tens of thousands or more probes may not hold for miR microarrays. Findings from previous studies have sometimes been inconclusive or contradictory. Further studies to determine optimal normalization methods for miR microarrays are needed.
We evaluated many different normalization methods for data generated with a custom-made two channel miR microarray using two data sets that have technical replicates from several different cell lines. The impact of each normalization method was examined on both within miR error variance (between replicate arrays) and between miR variance to determine which normalization methods minimized differences between replicate samples while preserving differences between biologically distinct miRs.
Lowess normalization generally did not perform as well as the other methods, and quantile normalization based on an invariant set showed the best performance in many cases unless restricted to a very small invariant set. Global median and global mean methods performed reasonably well in both data sets and have the advantage of computational simplicity.
Researchers need to consider carefully which assumptions underlying the different normalization methods appear most reasonable for their experimental setting and possibly consider more than one normalization approach to determine the sensitivity of their results to normalization method used.
A workshop sponsored by the National Cancer Institute and the US Food and Drug Administration addressed past lessons learned and ongoing challenges faced in biomarker development and drug and biomarker codevelopment. Participants agreed that critical decision points in the product life cycle depend on the level of understanding of the biology of the target and its interaction with the drug, the preanalytical and analytical factors affecting biomarker assay performance, and the clinical disease process. The more known about the biology and the greater the strength of association between an analytical signal and clinical result, the more efficient and less risky the development process will be. Rapid entry into clinical practice will only be achieved by using a rigorous scientific approach, including careful specimen collection and standardized and quality-controlled data collection. Early interaction with appropriate regulatory bodies will ensure studies are appropriately designed and biomarker test performance is well characterized.
TOC Summary: Atypical BSE is probably not sporadic and not related to sporadic Creutzfeldt-Jakob disease.
Strategies to investigate the possible existence of sporadic bovine spongiform encephalopathy (BSE) require systematic testing programs to identify cases in countries considered to have little or no risk of orally acquired disease or to detect a stable occurrence of atypical cases in countries in which orally acquired disease is disappearing. To achieve 95% statistical confidence that the prevalence for sporadic BSE is no greater than 1 per million (i.e., the annual incidence of sporadic Creutzfeldt-Jakob disease [CJD] in humans) would require negative tests in 3 million randomly selected older cattle. A link between BSE and sporadic CJD has been suggested on the basis of laboratory studies but is unsupported by epidemiologic observation. Such a link might yet be established by the discovery of a specific molecular marker or of particular combinations of trends over time of typical and atypical BSE and various subtypes of sporadic CJD, as their numbers are influenced by a continuation of current public health measures that exclude high-risk bovine tissues from the animal and human food chains.
Bovine spongiform encephalopathy; bovine amyloid spongiform encephalopathy (BASE); Creutzfeldt-Jakob disease; diagnostic screening tests; perspective