In this study, data from 637 female patients, mean age 59.5 years, were analyzed. The same patients as in the study by Sundquist et al. was used [8
]. The aim of that study was to assess the applicability of histopathological grading as a prognostic index applied to a defined breast cancer population. Only patients without any sign of distant metastasis at the time of surgery were included in the analysis. Tumors with an invasive component of 2 millimetres or less in diameter were excluded from analysis because the small size did not permit proper grading in accordance with the protocol.
To obtain a more comprehensive dataset for this study, patient data were retrieved from three different registers, i.e. the regional breast cancer-, tumor markers- and cause of death registers.
Predictors and outcomes
In order to answer important questions such as: "Which variables might be important predictors for recurrence of breast cancer? Is the time interval after diagnosis important? Is there a way to determine the importance of each predictor when there is more than one type of recurrence?" two sets of predictors and outcomes (see Table ) were selected by consulting and collaborating with oncologists and studying the literature in the domain.
List of variables in both sets
Age of the patient and variables regarding tumor specifications based on pathology reports, physical examination and tumor markers were selected as predictors. Two variables in the outcome set, distant metastasis and loco-regional recurrence were observed at different time intervals after diagnosis.
After retrieving information in different registers about selected predictors and outcomes (Table ) for 637 patients, the raw data were transformed and converted as illustrated in Table .
Transformation rules and the study population characteristics
For some variables such as LN involvement, periglandular growth and multiple tumors, dichotomization was done based on their presence or absence in the patients. Other variables such as tumor location, side, Nottingham Histologic Grade and DNA ploidy were already categorical. The remaining variables, i.e. age, tumor size, estrogen and progesterone receptors, S-phase fraction and DNA index, were transformed from continuous to dichotomous variables (Table ).
Missing values were substituted using the Expectation maximization (EM) algorithm [9
]. This algorithm is a parameter estimation method, which falls within the general framework of maximum likelihood estimation and is an iterative optimization algorithm.
Canonical Correlation Analysis
Because the outcome set consists of several variables, CCA, which is a technique for analyzing the relationship between two sets of variables, was performed.
The fundamental principle behind CCA is the creation of a number of canonical solutions [5
], each consisting of a linear combination of one set of variables, which has the form:
Ui = a1(predictor1) + a2(predictor2) + ... + am(predictorm)
and a linear combination of the other set of variables, which has the form:
Vi = b1(outcome1) + b2(outcome2) + ... + bn(outcomen)
The goal is to determine the coefficients (a's and b's) that maximize the correlation between canonical variates Ui and Vi. The number of solutions is equal to the number of variables in the smaller set. The first canonical correlation is the highest possible correlation between any linear combination of the variables in the predictor set and any linear combination of the variables in the outcome set.
A way of interpreting the canonical solutions is to look at the correlations between the canonical variates and the variables in each set. These correlations are called structure coefficients or loadings. The logic here is that variables that are highly correlated with a canonical variate have more in common with it and they should be considered more important when deriving a meaningful interpretation of the related canonical variate. This way of interpreting canonical variates is identical to the interpretation of factors in factor analysis [10
]. The criterion for choosing the important variables in each canonical variate is the structure coefficients (loadings). As a rule of thumb for meaningful loadings, an absolute value equal to or greater than 0.3 is often used [11
Significance of the canonical correlations was tested with randomization tests, and robustness of the estimates of the loadings was tested with bootstrapping [13
SPSS version 11 [14
] was used for data transformation and replacing missing values. For running CCA, the CANCORR macro, a part of the Advanced Statistics module of SPSS, was used. Tests of significance for canonical correlations and bootstrapping were done using MATLAB Ver 6.5 [15