A genetic association study was conducted to identify genomic variations with predictive power for NHL prognosis (Zhang et al.
). Here, when referring to ‘prognosis’, we limit ourselves to disease-free survival. Overall and other types of survival may have different patterns and different genomic basis, and should be studied separately. We provide details on the prognostic cohort and DNA extraction and genotyping in the Supplementary Material
In this study, a candidate gene approach was adopted. A total of 1462 tag SNPs from 210 genes involved in immune response pathways were selected. In addition, 302 SNPs in 143 candidate genes previously genotyped by Taqman assay (Lan et al.
) were also included. There are a total of 1764 SNPs, representing 333 genes. The number of SNPs per gene ranges from 1 to 45, with median 3 and mean 5.3.
The prognostic cohort is composed of 546 NHL patients. Among them, SNP genotypes for 469 are available for downstream analysis. Among those 469 patients, 150 have DLBCL, 112 have FL, 51 have CLL/SLL, 32 have MZBL, 35 have T-cell lymphoma and 89 have other subtypes. Different subtypes of lymphoma have significantly different prognosis patterns and different genomic basis. We focus on DLBCL and FL due to sample size considerations.
We remove SNPs with >20% missing measurements and patients with >20% SNPs missing. The missingness has been caused by technical reasons, and removal of missing measurements will not lead to biased results. For DLBCL, 133 patients pass the screening. The genotypes of 1730 SNPs are available, representing 331 genes. For FL, 97 patients pass the screening. The genotypes of 1729 SNPs are available, representing 331 genes. For SNPs with missing genotypes, we impute their values using the 10-nearest neighbors approach.
The following clinical risk factors are also measured: age (continuous), education (categorical: level 1 = high school or less; level 2 = some college; level 3 = college graduate or more), tumor stage (categorical: levels 1–5), B-symptom presence (categorical: No; Yes; Unknown) and initial treatment (categorical: none; surgery; radiation; chemotherapy; other). Following creation of the International Prognostic Index, we dichotomize the age variable and create the binary indicator I(age > 60). For categorical risk factors, we use binary dummy variables to represent their levels. Thus, there are a total of 13 covariates representing five clinical risk factors.
When investigating the prognosis of DLBCL or FL, we model the joint effects of clinical and genomic risk factors. Thus, there are a total of 336 ‘clusters’ of covariates. Within each cluster, there are one or multiple covariates. With clinical risk factors, one cluster corresponds to the different levels of a covariate. For example, the ‘education’ cluster has two covariates, which are the binary indicators representing levels 2 and 3 of education. With genomic risk factors, one cluster corresponds to one gene, and covariates within a cluster are SNPs in this gene.