We describe the seven principal functions that are included in the package and discuss how to interpret the output. All tests are applied to pairs of tumors from the same patient. If multiple tumors per patient are provided, all pairwise comparisons are performed.
2.1 Testing clonality using LOH data
The function ‘LOHclonality()’ combines both tests we have developed for LOH data at the candidate loci (usually 10–30 markers). The main input is a matrix of LOH calls, where each marker has to be represented by one of three user-specified symbols denoting no LOH, LOH in allele 1 or LOH in allele 2. Markers that are not informative (e.g. homozygous) in a particular tumor should be given an NA instead of a call. The methodology assumes that the markers are independent (ideally from different chromosome arms) and that uninformative markers are missing at random. The user can choose to perform the CM test, the LR test or both. The test statistic for the CM test is the number of concordant losses (i.e. LOH on the same allele). This is printed in the output along with counts of markers with an LOH in one tumor only, with no LOH in both tumors and so on. The P
-value is produced using a theoretical reference distribution that relies on assumptions that alleles 1 and 2 are equally likely to be lost, and that each marker has the same probability of an LOH. This test is valid when the deviations from these assumptions are not substantial (Begg et al., 2007
). The LR test does not make these assumptions, but it requires knowledge of the LOH frequencies for each marker. If they are not provided, they can be estimated from the original dataset or another group of reference patients with the same disease. The reference distribution for the LR test is obtained by simulating tumors with given LOH frequencies. Unlike the CM test, concordant LOH at an infrequently occurring marker can provide stronger evidence for clonality than at a commonly occurring marker. The LR test, while free of the assumptions of the CM test, is preferred only when the true LOH frequencies in the specific patient population of interest are known or can be estimated with high confidence.
2.2 Testing clonality using CGH data
For the analysis of CNAs, the software requires the package DNAcopy (Olshen et al., 2004
; Venkatraman et al.
, 2007). The input into the main function, ‘clonality.analysis()’, has to be a CNA object used in DNAcopy; it combines chromosome, genomic locations and each sample's log-ratios in columns. The chromosomes should be split into arms, since it increases the number of independent genomic units for the analysis. This can be done using function ‘splitChromosomes()’. The clonality analysis consists of several steps:
- The data are segmented with the one step Circular Binary Segmentation (CBS) algorithm that selects at most one prominent copy number change per chromosome arm (another segmentation algorithm can potentially be used with a user-defined segmentation function instead of the built-in ‘oneseg’, as long as it detects at most one copy number change per chromosome arm. Details are given in the vignette).
- The chromosome arms are classified as gain/loss/normal based on the central or most outstanding segment. Number of Median Absolute Deviations (MADs) of the residuals, selected by the user, defines the gain/loss threshold. Users should investigate the plots of the segmentation to make sure that the large visible gains and losses pass the MAD criteria in most samples and adjust number of MADs (‘nmad’) parameter accordingly. Frequencies of gain or loss for each chromosome arm are needed for the calculation of the LRs. The frequencies are evaluated based on the original dataset, but there is also an option to estimate them based on a dataset of other patients with the same disease.
- A likelihood ratio (LR1) is calculated based on the concordance of the gain/loss/normal profiles. The final statistic, LR2, is the product of LR1 and LRs from each chromosome arm in which concordant gains or losses are observed in the two tumors. These reflect the odds that the chromosome arm specific mutation is clonal. LR2 quantifies the odds that the two tumors are clonal.
- The reference distribution (option in ‘clonality.analysis’) is calculated by comparing pairs of tumors from different patients, which are necessarily independent. This distribution is used to calculate P-values for the original tumor pairs. It is a preferred option to the hierarchial clustering method often used in practice, since it effectively uses the known paired structure of the data and for other technical reasons (Ostrovnaya et al., 2010b). Note that calculating the reference distribution might take a substantial amount of time (hours) depending on the number of patients and the resolution of the assay. However, we recommend choosing this option unless the clonality signal is very strong and the noise level is small.
It is important to make sure that small gains and losses that could potentially be germline copy number variants (CNVs) are excluded from the arrays prior to clonality analysis
. These appear as pronounced changes that are exactly the same in both tumors, and are thus mistaken to be strong evidence toward clonality by our algorithm. In order to exclude CNVs, the arrays should be either compared to their matching normal arrays, or to the database of genomic variants (http://projects.tcag.ca/variation/
), or some other screening method can be applied (e.g. Ostrovnaya et al., 2010c
Since only one genomic change per chromosome arm is allowed, we suggest limiting the resolution of the array to roughly a total of 5000–15 000 markers. If the array has more markers, then mutually exclusive blocks of several adjacent markers can be averaged using the function ‘ave.adj.probes()’. The number of markers averaged should be calculated based on the original resolution of the array. Averaging also helps reduce noise in the arrays, it removes waves in log-ratios that can result in false genomic changes, and it simplifies the copy number patterns. If the user is not confident that all CNVs are removed, then a lower resolution (e.g. 2000 markers) is recommended.
The help file for ‘clonality.analysis()’ contains a real data example—analysis of pairs of breast cancer lesions published by (Hwang et al., 2004) and available from the authors' web site.
2.3 Visualization of CGH data
Two functions, ‘chromosomePlots()’ and ‘genomewidePlots()’, provide per chromosome and genomewide plots of the data and one step segmentation for pairs of tumors. These plots show where the clonality signal is coming from; they illustrate the concordance between general mutational patterns and the match between specific gains and losses. The overlaying histograms of the LRs and the reference distribution under the hypothesis of independence are produced by ‘histogramPlot()’. The extent of overlap between the two histograms can help define the cut-offs for diagnosing patients: the upper right tail of the reference distribution might be interpreted as an equivocal area, while the LRs above or below the equivocal area receive unambiguous diagnoses.