Correlating specific genomic copy number aberrations (CNA) with disease is an important and challenging first step in biomarker discovery [1
]. Detecting CNAs that define genomic regions of interest using array comparative genomic hybridisation (aCGH) requires precise integration of probe signal amplitude, size (i.e., width) of copy number imbalanced region, and frequency of imbalance across a sample set, all referenced to relevant clinico-pathologic features.
There are two broad methods of aCGH data interpretation for biomarker discovery. The first, exemplified by the R Bioconductor package cghMCR [2
], identifies regions showing the most frequent CNAs within a sample set, ranked by average signal amplitude. This approach to prioritization may under-call low prevalence high-level CNAs, such as homozygous deletions or gene amplifications that occur in small subsets of the samples analysed. The second method, targeted gene identification, exemplified by the genome topography scanning (GTS) algorithm [3
] and Genomic Identification of Significant Targets in Cancer (GISTIC) module [4
], is designed to localize regions of copy number imbalance most likely to be of functional significance. The GTS method models CNAs using parameters of signal intensity, region width and recurrence across a sample set, moderated by gene content. While this approach is able to identify significant regions of imbalance in heterogeneous samples, it relies on prior knowledge. GISTIC calculates the background rate of random chromosomal aberrations and identifies regions that are aberrant more often than would be expected by chance, with greater weight given to high amplitude events. Although gaining favour, a recent report notes GISTIC has trouble identifying relevant minimal regions of interest within larger tracts of CNA [5
There are currently few open source methods for consolidating aCGH data across a set of samples. In addition, there are particular difficulties with handling large data sets derived from very high-density oligonucleotide-based aCGH platforms, where there may be a need to review many distinct significant regions of interest. To address these issues, we developed sliding windows adaptive thresholds CGH (swatCGH), a new computational framework for simplifying aCGH data analysis. swatCGH is a heuristic method based on strengths of the major existing approaches. It provides a robust systematic approach, which effectively automates the aCGH analysis process in order to identify CNA regions of interest and improve the reliability of candidate gene identification.
The framework is based on the analysis of average signal amplitude, region width and frequency of CNA occurrence, and enables these parameters to be identified as independent or associated events, including sample subset analysis by agglomerative hierarchical clustering. For each chromosome, swatCGH preferentially identifies regions that display the largest average signal intensity in the greatest proportion of the sample cohort.
The stages of swatCGH were designed to accommodate technical factors that may confound aCGH data analysis, particularly methods of signal intensity preprocessing, such as background correction, normalization, and classification of probe copy number states following segmentation [6
]. The R Bioconductor [8
] based method enables application of multiple preprocessing configurations, probe segmentation algorithms, and classification strategies, in order to provide the most robust definition of significant CNA regions of interest. Uniquely, the approach also allows comparison and consolidation of analyses resulting from the various preprocessing methods used.
Here, we provide a detailed description of swatCGH. We exemplify the approach using a previously published aCGH dataset based on an analysis of 38 glioblastoma multiforme (GBM) samples using Agilent 44
K oligonucleotide arrays (GSE7602) [3
]. The dataset had previously been analysed by GTS, leading to identification of functional redundancy between CDKN2A and CDKN2C tumour suppressor genes in GBM. We analysed the dataset by swatCGH, using data preprocessed with each of the four most frequently cited segmentation algorithms; circular binary segmentation from the package DNAcopy [9
], an adaptive weights smoothing method from the package GLAD [11
], an homogeneous hidden Markov model (HomHMM) provided by the package aCGH [12
], and a biologically tuned HMM (BioHMM) from the package snapCGH [13
]. By consolidating data from the four analyses, we identified the most robust CNA regions of interest in the dataset. Based on our comparison of methods for prioritizing detected CNAs, we present results as a summarized list ranked by mean signal intensity, with web-style summary pages to facilitate data verification and efficient selection of candidate genes. In addition, the detailed report of all parameters analysed allows for thorough assessment of other potential regions of interest that are not recorded on the ranked list. By comparing our findings with the previous GTS study [3
], we conclude that our heuristic framework offers a simplified high-throughput approach to defining novel genomic loci of potential clinical relevance.