Subject population
A case control study was conducted using biospecimens and data from the Ontario Familial Breast Cancer Registry (OFBCR), a participating site in the NIH-funded Breast Cancer Family Registry [
8]. Written informed consent was obtained from all subjects, and the study protocol was approved by the Mount Sinai Hospital Research Ethics Board. Cases of invasive breast cancer, pathologically confirmed and diagnosed between 1996 and 1998 in the province of Ontario, were identified from the population-based Ontario Cancer Registry. All female cases under the age of 55 years were identified as well as all males under the age of 80 years. A random sample (35%) of female cases aged 55 to 69 was also selected. Physician permission to contact patients was granted for 91% of the cases (7,668 of 8,453). Patients were then mailed a cancer family history questionnaire and 65% (4,957) of them completed it. All respondents who met a defined set of genetic risk criteria (that is, Ashkenazi Jewish; diagnosed before age 36 years; previous ovarian or breast diagnosis; one or more first- or two or more second-degree relatives with breast or ovarian cancer; one or more second- or third-degree relatives with either breast cancer diagnosed before age 36 years, ovarian cancer diagnosed before age 61 years, multiple breast or breast and ovarian primaries, or male breast cancer; three or more first-degree relatives with any combination of breast, ovarian, colon, prostate, or pancreatic cancer or sarcoma, with at least one diagnosis before age 51 years) were included in the study [
9] and a random sample of 25% of those not meeting these criteria were selected to continue to participate in the OFBCR (n = 2,580). This participation included providing a blood sample, which was provided by 62% of all eligible participants (n = 1,601). For the current study, we restricted the sample to women who identified themselves as Caucasian and were less than 55 years old. The study was restricted to the younger women because they are more likely to have genetic causes. Women in this age group might also be more homogeneous regarding their breast cancer risk factors. As we had randomly sampled 25% of those who did not meet genetic risk criteria, we also randomly sampled 25% of those who did meet genetic risk criteria in order to create a more representative sample of cases. Therefore, the selected cases should better represent all cases without enrichment for genetic risk criteria such as family history. A total of 21.6% of cases in the present study had a first-degree family history of breast cancer, which is consistent with the 17–22% frequency reported in cases in a number of large case-control studies [
10-
12]. Of the 459 breast cancer cases with an available blood sample, 398 were successfully genotyped and included in this study after excluding cases with insufficient DNA and those who were not Caucasians.
Controls were identified by calling randomly selected residential telephone numbers from across the province of Ontario and were frequency-matched to all female OFBCR cases by five-year age group. The number of telephone numbers was 14,653, but 1,101 (8%) were invalid and no contact could be made for 841 (6%). Of the 12,711 households contacted, 7,829 (62%) did not have an eligible individual. No information on eligibility was provided for 2,194 (17%) households. Of the 2,688 eligible individuals identified on the telephone, 1,726 (64%) completed the mailed risk factor questionnaire and 75% of these individuals agreed to be contacted about providing a blood sample. Information regarding past history of cancer and family history of cancer was collected on the controls. The 676 women under age 55 who had agreed to be approached about blood sampling were asked to provide a blood sample and 419 (62%) did so. Individuals who were not Caucasian were excluded from the analysis, as were those with insufficient DNA or those subsequently found to be ineligible because of age. The remaining 372 population controls were successfully genotyped for this study.
SNP selection strategy
To estimate the breast cancer risk conferred by individual SNPs, as well SNP-SNP interactions, we studied 19 SNPs from 18 key cancer genes involved in several biological pathways. Candidate SNPs were initially selected using the best available evidence from published studies at the beginning of the project in 2000 and were subsequently classified into three categories (high, medium and low rank). Thus, they include SNPs with a wide range of functional evidence. High-ranking SNPs were supported by studies that demonstrated the effect of the SNP on the regulation of expression or protein function. The medium-rank category is likely to include functionally relevant SNPs, as the substitutions are predicted to significantly affect function, although this was not confirmed experimentally at the time of their selection. This category also includes SNPs that are associated with breast cancer risk factors. The low ranking category, on the other hand, contained SNPs with no functional information. Among the SNPs studied,
XPD-[Lys751Gln],
MTHFR-[Ala222Val],
COMT-[Met108/158Val],
GSTP1-[Ile105Val] and
CCND1-[Pro241Pro] have been shown to alter the function or post-translational modification of their encoded protein [
13-
25].
MMP1-[1G(-1607)2G] and
IL10-[G(-1082)A] have been shown to alter the transcription and expression of these genes [
26-
30].
IL13-[Arg130Gln] has been suggested to have functional consequences, while GSTM3-[4595 (3 bp ins/del)] was predicted to create a YY1 transcription factor binding site [
31,
32].
TNFA-[G(-308)A] forms a haplotype with some nearby SNPs and some studies observed increased haplotype-dependent transcriptional activity change while some others did not [
33-
36].
CYP17-[C518T] and
IL13-[Arg130Gln] were found to be associated with other cancer-related variables, such as serum estrogen and IgE levels, respectively [
37-
39].
BARD1-[Pro24Ser] changes a structurally important non-polar proline residue to a positively charged serine. There were no functional speculations for
ESR1-[Ser10Ser],
ESR1-[Pro325Pro],
PTEN-[(IVS4+109)ins/delACTAA],
IL1A-[Ala114Ser],
G-CSF-[Leu185Leu] and
GADD45-[C(IVS3+168)T. Thus, the 19 SNPs studied represent SNPs with a wide range of functional knowledge and evidence. The estimate of the minor allele frequency for the 19 SNPs studied varies between 15% and 48% in the general population. The SNPs studied were also selected to represent more commonly occurring variants, in order to gain statistical power to detect SNP-SNP interactions.
Molecular genotyping
All SNPs were analyzed by TaqMan 5' nuclease assay [
40] using the 7900 HT Sequence Detection System (ABI, Foster City, CA, U.S.A.). Oligonucleotide primers and the dual labeled allele specific probes were designed using Primer Express version 2.0 (PE Biosystems, Foster City, CA, U.S.A.). Positions of primers and probes and their appropriate accession numbers are given in Additional data file
1. A panel of DNA samples was sequenced for each SNP region initially, in order to identify control genotypes to be used in each experiment. PCRs were performed in 96-well plates (AXYGEN, Union City, CA, U.S.A.), with each plate containing four control samples for each possible genotype. Genomic DNA (10 ng) was amplified in a total volume of 10 μl in the presence of 100 μM of each of the dNTPs, 3 pmoles of each of the appropriate primers, 2 pmoles of each of the corresponding dual labeled probes, and 0.025 units of Platinum Taq DNA Polymerase (Invitrogen, Carlsbad, CA, U.S.A.). PCR cycling conditions consisted of 40 cycles of 94°C for 15 s, 55–60°C for 15 s and 72°C for 15 s. The samples were analyzed by ABI PRISM 7900 HT Sequence Detection System (version 2.0). The optimal MgCl
2 concentrations and annealing temperatures for each SNP are given in Additional data file
2. The results were validated by re-genotyping a randomly selected 10% portion of the total study population [
41].
Statistical analysis
To detect complex interactions, we applied and compared three statistical methods, LRM, CART and MDR. Our main analyses are presented without adjustment for other confounding variables. However, we also conducted several analyses adjusted for age, family history and body mass index (BMI) and our results were quite similar to those presented here. Our dataset included five individuals with missing genotypes, so their missing values were randomly imputed from the empirical distribution of the genotypes in the sample. We also carried out several analyses without these individuals and our results remained unchanged. In all of our analyses, the SNP variables were considered as nominal categorical variables with three categories.
Logistic regression model
LRM is a parametric approach that relates one or more explanatory variables, Xs (for example, SNPs and their interactions), to a dependent or outcome variable, Y (for example, breast cancer status). LRM is a particular case in the family of generalized linear models [
42]. The simple LRM for an individual
i assumes that
Yi takes values 0 or 1 and has a Bernoulli distribution with parameter
Pi, that is, the probability of being a case rather than a control:
Pr(Y
i = 1) =
Pi, and
Pr(Y
i = 0) = 1 -
Pi. It is expressed as:
where
i =
1,..,n, and
p is the number of explanatory variables. The
βs are the regression coefficients that are to be estimated:
βj for the SNP main effects (
xj) and
βjk for the SNP-SNP interaction (
xjxk). Because the genetic risk model is unknown for most of the SNPs we studied, we adopted a codominant model, that is, both rare homozygous and heterozygous variant effects are estimated using two dummy variables for the SNP main effects and four product terms for the two-way interaction effects. Interaction effects were tested using a likelihood ratio test (LRT) statistic with four degrees of freedom for the
χ2 values. We used a stepwise logistic-regression procedure based on forward selection to select the most significant SNP-SNP interactions [
43]. Forward stepwise selection procedures are efficient in assessing interaction effects compared to backward elimination when testing multiple interactions. First, it is more computationally efficient and second, compared to backward elimination where a relatively large number of predictor variables may increase the risk of complete separation of the two outcome groups, it has fewer numerical problems when estimating the model parameters [
44]. We tested only two-way interactions with LRM since the investigation of higher-order interactions could have led to numerical difficulties for estimating the model parameters. Instead of using a stepwise selection procedure, one could perform an exhaustive search and select the most significant interactions based on a permutation test or using the false discovery rate (FDR) principle [
41]. This was not done in this study because this strategy is not directly applicable to CART analyses.
Classification and regression trees
Decision trees date back to the early 1960s with the work of Morgan and Sonquist [
45]. Breiman and colleagues [
46] published the first comprehensive description of recursive partitioning methodology. As a powerful data analysis method, trees are used in many fields, such as epidemiology and medical diagnosis, and provide an alternative to more standard model-based regression techniques for multivariate analyses. The tree introduced here is the
S implementation [
47]. Because our outcome (case or control status) is binary, we used classification trees. Through binary recursive partitioning, a tree successively splits the data along the coordinate axes of the predictors such that, at each division, the resulting two subsets of data are as homogeneous as possible with respect to the response of interest [
46]. At each step in the construction algorithm, an optimal split is identified. This local optimality does not guarantee that the overall optimal tree will be found. Deviance is a natural splitting criterion based on likelihood values. In our analyses, we used the
S defaults: a node must include at least 10 observations and the minimum node deviance before the tree growing stops should be 1% of the root node. The subsets that are not further split are the terminal nodes. The SNP variables were considered as nominal categorical variables with three categories. Figure depicts the first few nodes of a tree built on our data and illustrates how a tree is used for prediction.
Multifactor dimensionality reduction
MDR [
48-
51] is a non-parametric data mining approach that uses constructive induction or attribute construction to reduce two or more SNPs, for example, to a new single variable that is then evaluated using a classifier such as naive Bayes or logistic regression. The rationale behind this method is to identify the multi-locus genotypes that best predict the outcome of interest. It applies data reduction techniques to address problems associated with testing for interactions in high-dimensional space and with generally modest sample sizes. The algorithm works as follows.
First, select all subsets of k explanatory variables (that is, SNPs) among m available and for each subset k, enumerate all possible genotype combinations. These genotype combinations represent the interacting SNPs. For each combination, compute the case-control ratio and partition the multi-locus genotypes into two subgroups labeled as high or low risk (for example, ratio ≥1 or ratio <1). This step reduces a k-dimensional model to one dimension only.
Second, a ten-fold cross-validation (CV) procedure is used to assess the ability of the multi-locus genotype combinations to predict the disease outcome. The sample is divided into a training set (for example, 9/10 of cases and controls) and an independent test set (for example, 1/10 of cases and controls). In each subset of the data, the training set classification error is computed for each genotype combination. This step is repeated in the ten random subsets. The interaction with the lowest classification error (averaged over the ten random subsets) is selected as the 'best interaction'. The whole CV procedure is itself repeated ten times to protect against chance division and all resulting statistics are averages over these ten runs.
Third, all interactions selected through steps 1 and 2 are also evaluated for their CV consistency (number of times the same genotype combination is identified across the ten repetitions of the CV procedure) and testing accuracy (that is, 1 - prediction error).
Fourth, steps 1 through 3 are repeated for different values of k (2, 3, 4, 5 and 6 in our study). The interaction that maximizes the CV consistency and testing accuracy is selected as the final best candidate interaction across all k-multilocus models. In our study, instead of selecting just one best interaction for each k, as proposed originally for this method, we selected the five best interactions based on the same criteria.
Modeling and interpreting SNP-SNP interactions
With LRM, interactions are taken into account by estimating specific parameters for certain genotype combinations. Assuming a codominant effect for each SNP, the LRM is a saturated model with nine parameters coding for the cell-specific odds-ratios ((exp(βj), equation 1). LRM estimates the cell-specific odds-ratio corresponding to each genotype combination. A saturated model tends to fit the observed data but could lead to overfitting and a lack of power. CART suggests interactions by the successive splits occurring in the tree (Figure ). The terminal nodes correspond to either low- or high-risk subgroups and their association with the outcome can be tested. Investigating the tree terminal nodes provides a natural way to identify interaction and characterize the high- or low-risk subgroups. Moreover, CART automatically selects the genetic model that best fits the data. A fundamental difference between CART and LRM is that CART can reveal the structure in the data by creating a split in the proportion of the data only where it is pertinent (for example, among individuals who have the XPD {AA} genotype in Figure ). LRM does not have this feature and can test only an overall effect. MDR, like CART, partitions the data into high- and low-risk subgroups based on the SNP profiles. However, while CART uses recursive partitioning to reveal the interactions, MDR exhaustively searches across all multi-locus genotypes and identifies the one that best predicts the disease outcome. A particular aspect of MDR is that each selected interaction could correspond to different genotype combinations when repeating the CV procedure. For example, Figure displays the four partitions corresponding to the XPD-[Lys751Gln] – IL10-[G(-1082)A] two-locus genotypes when repeating the 10-fold CV procedure ten times.
Strategy to identify critical SNP-SNP interactions associated with breast cancer
All three methods were applied to our data to detect two-way interactions; CART and MDR were also used to investigate higher-order interactions. Because MDR uses a specific strategy to select the important interactions, we tried to develop a similar approach for LRM and CART, which can be summarized by two major steps: first, selection of the most promising SNP-SNP interactions using a CV approach; and second, evaluation of the selected interactions in the whole sample using a distribution-free permutation test.
Step 1
With LRM and CART, we used a two-fold CV approach to select the SNP-SNP interactions, instead of the ten-fold CV of MDR. Our rationale is that a test statistic is used to select the best interactions with LRM and CART, and splitting the original sample into two ensures enough power to perform this test. The idea is to randomly divide the original dataset into training and test sets, each with the same proportion of cases and controls. A variable selection of the important SNP-SNP interactions is carried out in the training set and their significance is evaluated independently in the test set. With LRM, we used forward stepwise selection with the training set and then tested each selected interaction using a LRT statistic in the test set. With CART, we built a tree on the training set, and pruned it to a smaller tree using the deviance criteria in order to keep a maximum of ten terminal nodes. We then applied the pruned tree on the test set and calculated a chi-square statistic for each terminal node. To protect against chance divisions of the data, the CV is repeated ten times and the test statistic for each selected interaction in the test set is averaged over the ten random test sets. The five best two-way interactions (terminal nodes with the larger test statistic) are then selected. With CART and MDR, we also selected the top five k-way interactions with k = 3,4,5,6.
Step 2
We then evaluated each target (that is, selected interaction) in the whole sample using a permutation test statistic. With LRM, we compared a model with and without the selected interaction using the original dataset (main effects are also in the model) and obtained the chi-square statistic from the LRT (each test had four degrees of freedom). With CART, we collected the chi-square statistic for each terminal node and computed the odds-ratio associated with the corresponding subgroup on the original dataset (see Figure for an example). With MDR, we collected the testing accuracy for each target, averaged across the 100 random subsets (that is 10-fold CV repeated 10 times). To find the null distribution of the test statistics, we randomly permuted the response variable 1,000 times. The P-value associated with each target is then computed by comparing the observed test statistic to its empirical null distribution. The null hypothesis was rejected when the upper-tail Monte-Carlo P-value derived from the permutation test is less than 5%.
Software
We used the Splus package 'tree' (Insightful Corporation, Seattle, WA) and some related Splus functions for the CART analyses and the R package 'glm' [
52] for the LRM analyses. MDR analyses were carryout with the program MDR v.1.4.1 for Unix [
48-
51].