Search tips
Search criteria 


Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
J Thorac Oncol. Author manuscript; available in PMC 2010 December 1.
Published in final edited form as:
PMCID: PMC2796575

Biostatistics: A toolkit for exploration, validation and interpretation of clinical data


Biostatistics plays a key role in all phases of clinical research starting from the design to the monitoring, data collection, data analysis and interpretation of the results. A clear understanding of the statistical framework as it relates to the study hypothesis, reported results and interpretation is vital for the scientific integrity of the study and its acceptance in the general medical community. In this brief report, we will put in perspective the general analytical framework for exploring and validating prognostic factors using data from large databases.

Keywords: Biostatistics, multivariable analysis, univariable analysis


Biostatistics refers to the application of statistical techniques to biologic data collected prospectively and/or retrospectively. Briefly speaking, statistics plays a key role in all phases of a research project starting from the design stage and continuing through the monitoring, data collection, data analysis and interpretation of the results in clinical terms. A clear understanding of the statistical approach as it relates to the study hypothesis, reported results and interpretation is vital for the scientific integrity and interpretation of the study findings in the general medical community. The important role that statistics plays in the field of medical research, and the common statistical reporting errors and ways to avoid those errors are well recognized and widely published [1-3].

In this brief report, we will put in perspective the statistical approach utilized to identify, explore and validate prognostic factors using data from large databases using the Chansky et al. [4] article as a reference. The authors applied a systematic approach to explore (and validate) several prognostic factors for overall survival in a surgically managed group of NSCLC patients using data from the large international staging database of the International Association for the Study of Lung Cancer (IASLC) [4]. A similar data analytic approach was used by Foster et al. [5], and Albain et al. [6] to identify prognostic factors and prognostic subgroups in small cell lung cancer (SCLC), and advanced non-small cell lung cancer (NSCLC) respectively. Figure 1 illustrates a general outline of the descriptive and analytical steps involved in the data analysis process. We discuss the salient features of each step in this report.

Figure 1
General Data Analysis Framework

Data Gathering and Descriptive Summaries

The first step in a database analysis is the selection criteria used to determine the cases (i.e., data elements) to be included in the analysis to ensure that: 1) there is minimal to no selection bias, and 2) results arising from the analysis of this data is reproducible, i.e., the selection criteria is not ad-hoc. While the selection of cases is largely driven by the scientific question at hand, it also depends on the subject population, the variables to be explored, adequacy of available follow-up information for the endpoints, and attributes of the missing data values (i.e., missing by design, missing at random etc.). Different approaches to handle missing data information can be found in the statistical literature [7]. The Chansky et al. [4] article succinctly outline the strategy they used to arrive at their analysis sample, both overall as well as the different subgroups (dependent on the missing information for certain covariates). This is a critical step in evaluating the scientific merit of the research study.

Once the analysis data set is identified and set up, the next step is to explore and describe more thoroughly the endpoints and the explanatory (or independent or prognostic) variables. The outcome (i.e., endpoint) variables typically fall into one of the three classes:

  1. categorical, which is further classified into: a) binary or two categories (for example, limited vs extensive stage disease; male vs. female etc.), b) nominal or multiple categories with no specific order (for example, blood group type: A, B, O, AB etc.), and c) ordinal or multiple ordered categories (for example, performance status: 0 vs.1 vs. 2; Likert type scales: strongly agree, agree, neutral, disagree), or
  2. continuous (for example, age, white blood cell counts etc.), or
  3. time-to-event (for example, overall survival, time to any recurrence etc.).

The definition of the endpoint, percentage of data completeness, and adequacy of follow-up information are all critical elements that help assess the accuracy and interpretability of the results as it relates to the endpoint. For example, in the case of time to event endpoints, the following issues need careful attention: 1) the starting (date of diagnosis, date of randomization etc.) and the ending time point (date of death, date of recurrence, date of last follow-up) used for defining the specific endpoint, 2) information regarding the uniformity (and length) of follow-up (for example, all patients are followed for a minimum of 2 years), 3) the loss to follow-up/drop out rates (censors) and the event rate for assessing data completeness and determination of the number of covariates that can be explored, and 4) the presence of competing risks (for example, both death and distant recurrence are competing events when the event of interest is local recurrence) and/or recurrent events (for example, recurrent heart attacks in a coronary heart disease study).

For the explanatory variables, running simple descriptive and graphical summaries to identify obvious deviations such as outliers, sparse data within certain groups, and/or questionable data points is usually recommended. At this time, decisions regarding collapsing categories and/or categorization of continuous covariates are explored, which are typically based on 1) distribution of the data, 2) underlying biologic or clinical rationale, and 3) ease of interpretation. However, the rationale and impact of categorizing a continuous covariate on the model estimates are often not adequately explored, explained or reported in a manuscript. Several data driven (such as mean, and median) and outcome oriented techniques (such as the minimum p-value approach, and two-fold cross validation) to identify optimal cutpoint(s) for categorizing a continuous covariate are published in the literature [8-10]. The pitfalls associated with categorizing a continuous covariate especially the potential loss of information and incorrect assumption of the distribution of the data post categorization are well documented in the literature [11, 12]. The most commonly used approach for categorizing a continuous covariate is a data-driven approach backed by a sound clinical rationale as done in the Chansky et al. [4] article for the categorization of age. The authors also discuss the impact (which was minimal) of this categorization on the model assumptions and the estimates.

Univariable versus Multivariable Analysis

The next and the most critical step is the analytical aspect that helps to draw conclusions from the data. The general analytical approach typically includes the following four elements:

  1. Definition of the training and validation data sets, if applicable;
  2. Detailed description of all statistical procedures utilized to address the research questions, including the testing and model building framework;
  3. Clearly describing the pathway from univariable to multivariable analyses;
  4. Setting threshold for declaring statistical / clinical significance a priori for the main effects, interactions and subgroup analyses keeping in mind the multiple comparisons issue [13].

The use of a training data set (i.e., developmental data set) and the validation (i.e., test data set) is critical to the discovery and validation process in order to gauge the predictive accuracy and performance of the original analysis in practice [14]. Two common approaches include: 1) cross validation methods, which refer to repeated partitioning of one large data set into training and test sets, and 2) use of two independent data sets (with similar data attributes), one utilized only for development, and another used exclusively for validation [14]. Chansky et al. [4] have used the latter approach in their article, where they developed the prognostic subgroups using the IASLC database and validated it on the SEER database. Using the same “data” to both develop and validate the findings from an analysis is fundamentally flawed as it provides no protection against false-positive findings due to testing the hypotheses that are originally suggested by the data itself.

The statistical techniques used to analyze the data are closely tied to the nature of the data and the research hypothesis. A univariable analysis helps to assess the strength of association of the explanatory variable on the outcome of interest by itself, independent of other variables. This helps to gauge the ability of the covariate to influence outcome on its own. In the Chansky et al. [4] article, the authors used the Kaplan-Meier method and Cox proportional hazards (PH) regression models for the univariable analyses. It is well understood in the statistical and clinical literature that in reality the impact (effect size and significance) of an explanatory variable on outcome when explored by itself is susceptible to change when explored in the presence of other explanatory variables. For example, Foster et al. [5] report the effect of performance status (PS) on overall survival in univariable and multivariable analysis in limited stage SCLC. In this example, while PS was not a significant predictor of overall survival in the univariable analysis, it was borderline significant when explored in a multivariable setting adjusting for other covariates. Thus, it is important to assess the possible independent effect of a covariate on outcome in a multivariable model.

The trajectory for moving from a univariable to a multivariable model can follow numerous paths, some of which include: explore only those that are statistically and/or clinically significant in a univariable analysis [14]; include all previously known important covariates as the base model and then build a full model adding new covariates in a step wise fashion [5]; explore all covariates explored in the univariable analysis in a multivariable analysis regardless of their significance in the univariable setting [4]; or build a multivariable model using the pool of all covariates through a selection technique (forward selection, backward elimination or step wise selection) [6]. The statistical significance for these analyses is usually determined by the number of models and factors explored, the sample size / number of events, and the clinical relevance.

A multivariable model also includes the exploration of two-way interaction effects to assess if the effect of a variable on outcome differs depending on the level of another variable. Specifically, two variables are said to interact if a particular combination of the variables leads to results that would not be anticipated on the basis of the main effects of those variables. For example, Foster et al. [5] describe an interaction effect between PS and gender for patients with extensive stage SCLC, where the effect of PS on survival outcomes differed by gender. While there was no impact of PS on outcomes among women, patients with PS 0 had the most favorable prognosis, followed by PS 1 and then PS 2 patients among male patients. Similarly, a small but statistically significant interaction effect between gender and histology was reported by Chansky et al. [4].

In addition to the multivariable Cox PH regression method for assessing independent covariate effects and interaction effects, a recursive partitioning and amalgamation (RPA) analyses can be used to generate tree based models to identify prognostic subgroups with similar survival outcomes [4-6, 15] The RPA analysis uses the log rank tests to determine the best splits of the data at each point to form the terminal groupings of the tree and a bootstrap resampling approach (to account for over-fitting and adaptive nature of the algorithm) to find the optimal sub tree [15]. Unlike the multivariate Cox models that utilize only complete data on all variables explored, RPA utilizes all data on any individual covariate while determining the splits. Moreover, the ability to identify prognostic subgroups and explore interactions between covariates using RPA is relatively straightforward (as that is the primary intent of an RPA analysis) unlike Cox models where the final model estimates have to be used in a regression equation (that can be quite complex) to come up with prognostic subgroups. While the RPA analysis can be used to confirm the results from a Cox model (in terms of the variables that are significant) it is more often used to supplement the Cox models to readily and easily identify prognostic subgroups.


Statistical analysis refers to a collection of methods used to process large amounts of data and report overall trends. It includes the collection, examination, summarization, manipulation, and interpretation of quantitative data to discover its underlying causes, patterns, relationships, and trends. A clear understanding of the aims of the research project and the research data are key to using the right statistical tools, which in turn is essential for the accurate interpretation of the results. In this report, we provide a general outline of both the descriptive and the analytical framework of statistics using real examples to illustrate how, when and what these analyses mean in clinical terms. This is essential for the scientific integrity of the research study and its acceptance in the general medical community.


Supported in part by the National Cancer Institute Grants: Mayo Clinic Cancer Center (CA-15083), and the North Central Cancer Treatment Group (CA-25224).


1. Sprent P. Statistics in medical research. Swiss Med Wkly. 2003;133(3940):522–9. [PubMed]
2. Lang T. Twenty statistical errors even you can find in biomedical research articles. Croat Med J. 2004;45(4):361–70. [PubMed]
3. Zinsmeister AR, Connor JT. Ten common statistical errors and how to avoid them. Am J Gastroenterol. 2008;103(2):262–6. [PubMed]
4. Chansky K, Sculier JP, Crowley JJ, et al. The International Association for the Study of Lung Cancer Staging Project: prognostic factors and pathologic TNM stage in surgically managed non-small cell lung cancer. J Thorac Oncol. 2009;4(7):792–801. [PubMed]
5. Foster NR, Mandrekar SJ, Schild SE, et al. Prognostic factors differ by tumor stage for small cell lung cancer: a pooled analysis of North Central Cancer Treatment Group trials. Cancer. 2009;115(12):2721–31. [PMC free article] [PubMed]
6. Albain KS, Crowley JJ, LeBlanc M, et al. Survival determinants in extensive-stage non-small-cell lung cancer: the Southwest Oncology Group experience. J Clin Oncol. 1991;9(9):1618–26. [PubMed]
7. Little RJA, Rubin DB. Statistical Analysis with Missing Data. Wiley Series in Probability and Statistics. John Wiley; New York: 2002.
8. Contal C, O'Quigley J. An application of changepoint methods in studying the effect of age on survival in breast cancer. Computational Statistics and Data Analysis. 1999;30:253–270.
9. Mandrekar JN, Mandrekar SJ, Cha SS. Cutpoint determination methods in survival analysis using SAS®. Proceedings of the 28th SAS Users Group International Conference (SUGI); 2003. pp. 261–28. paper.
10. Mazumdar M, Glassman JR. Categorizing a prognostic variable: review of methods, code for easy implementation and applications to decision-making about cancer treatments. Statistics in Medicine. 2000;19:113–132. [PubMed]
11. Altman DG, Lausen B, Sauerbrei W, et al. Dangers of using “optimal” cutpoints in the evaluation of prognostic factors. Journal of the National Cancer Institute. 1994;86:829–835. [PubMed]
12. Abdolell M, LeBlanc M, Stephens D, et al. Binary partitioning for continuous longitudinal data: categorizing a prognostic variable. Statistics in Medicine. 2002;21:3395–3409. [PubMed]
13. Hsu JC. Multiple comparisons - theory and methods. Chapman & Hall; London: 1996.
14. Harrell FE., Jr . Regression modelling strategies with applications to linear models, logistic regression, and survival analysis. Springer-Verlag; New York: 2001.
15. LeBlanc M, Crowley J. Survival trees by goodness of split. Journal of the Americal Statistical Association. 1993;88:457–467.