|Home | About | Journals | Submit | Contact Us | Français|
Array comparative genomic hybridization (array CGH) is a technique for assaying the copy number status of cancer genomes. The widespread use of this technology has lead to a rapid accumulation of high throughput data, which in turn has prompted the development of computational strategies for the analysis of array CGH data. Here we explain the principles behind array image processing, data visualization and genomic profile analysis, review currently available software packages, and raise considerations for future software development.
Segmental deletion and duplication of chromosomal regions have been associated with both constitutional diseases and somatic alterations in cancer (Inazawa et al. 2004; Lockwood et al. 2006; Oostlander et al. 2004; Vissers et al. 2005). Recent studies have demonstrated that large scale copy number variations exist in the human population (Conrad et al. 2006; de Vries et al. 2005; Hinds et al. 2006; Iafrate et al. 2004; McCarroll et al. 2006; Sebat et al. 2004; Tuzun et al. 2005). Array comparative genomic hybridization (array CGH) is a method designed for identifying genomic regions with copy number aberration (Pinkel et al. 1998; Solinas Toldo et al. 1997). In this method, DNA from both reference and test genomes are differentially labeled with fluorescent dyes and competitively hybridized to DNA targets arrayed on a glass slide (Fig. 1). The hybridized slide is then scanned and the resulting signal intensity ratio at each DNA target reflects the copy number status of the DNA segment. By referring the segment to its corresponding position on the human genome map, the genes affected by copy number alteration can be identified (Fiegler et al. 2003; Ishkanian et al. 2004; Snijders et al. 2001). Numerous advances in array CGH technology have been made since its development in the mid 1990s with increased genome coverage and target density, improving resolution and sensitivity of detection. The majority of array CGH platforms use either oligonucleotide (oligo) or large insert clone (LIC) DNA targets (Davies et al. 2005). Oligos are short DNA fragments of approximately 21–60 nucleotides in length whereas LICs are typically bacterial artificial chromosome (BAC) clones which are ~100 kb in size. Historically, arrays were designed to cover specific chromosomes (Buckley et al. 2002; Buckley et al. 2005), chromosome arms (Coe et al. 2005; Garnis et al. 2003; Henderson et al. 2005) or selected regions of the genome implicated in disease (Albertson et al. 2000; Schwaenen et al. 2004). In contrast, genome wide arrays that sample copy number status of loci at megabase intervals have facilitated rapid survey for regions of loss and gain (Fiegler et al. 2003; Greshock et al. 2004; Snijders et al. 2001). Alternatively, cDNA microarrays, initially designed for gene expression profiling, have been used to assess copy number status of coding regions (Pollack et al. 1999; Squire et al. 2003). The development of high density arrays consisting of tens of thousands of DNA targets spanning the entire human genome has enabled precision mapping of the boundaries of genetic alterations throughout the genome in a single experiment (Barrett et al. 2004; Bignell et al. 2004; Ishkanian et al. 2004; Selzer et al. 2005; Zhao et al. 2004).
The production and use of these high density arrays relies not only on technical precision in array synthesis but also computational platforms tailored to the imaging, mapping, and analysis of replica sets of tens of thousands of DNA targets with spot signals in a narrow dynamic range. This article describes the principles behind visualization and analysis of whole genome array CGH data and reviews the software currently available.
Array CGH software applications can be classified according to three general functions: data preprocessing, visualization, and analysis (Fig. 2). Some software programs are specific to a particular function while others perform multiple tasks. The following section explains the principles and describes the methods for performing these functions.
Upon completion of an array CGH experiment, the microarray slide is scanned in two channels to generate high resolution fluorescence images corresponding to the two cyanine dyes. The localization of spots on the array is a semi-automatic process supported by “spot finding” functions, available in most microarray scanner software packages and custom packages (Jain et al. 2002). The signal intensity for each spot is quantified for each channel. However, image normalization is critical to improving detection sensitivity of copy number alterations, as a single copy loss would only reduce the signal by 50% resulting in a 1:2 signal ratio, and a single copy gain would result in a 3:2 ratio. (These shifts in ratios are subtle compared to gene expression changes.) In tumor samples, these ratios are further dampened by tissue heterogeneity with a mixed population of normal and cancer cells (Garnis et al. 2005). Therefore, before signal ratio can be deduced, the intensities of the two images need to be balanced and systematic biases influencing measurements need to be removed (Fig. 3). Intensity bias (different intensities for the dye channels), spatial bias (the location of DNA target on the array), plate bias (source plate of the target DNA spotted) and background bias (the contribution of background fluorescence to spot signal intensity) are factors that have been shown to affect signal intensity ratio in high density array CGH experiments (Khojasteh et al. 2005).
As replica spots are necessary to ensure experimental precision, arrays often contain multiple measurements of a DNA target. Therefore, basic operations are applied to determine the mean or median ratios of the replica, and the standard deviation for quality assessment and filtering.
To display spot data in the context of genomic position, log2 signal ratio for each spot is plotted against its corresponding location in the human genome. Graphical representations and interactive display are the two main approaches used in visualization. Graphical representations are XY scatter plots, with the X axis representing the array elements in ordered chromosomal position—typically, the chromosomes are arranged in series—and the Y axis representing the corresponding log2 signal ratio. However, with arrays containing tens of thousands of DNA elements (high density arrays), the number of data points are too numerous to display on this scale (Fig. 4a). Interactive displays are designed for high density arrays allowing the sequential magnification of selected chromosomes and chromosome segments to visualize individual data points. Typically, ratio data is displayed in parallel to a chromosome ideogram. Advanced visualization software provide practical features, for example, displaying the actual segment length represented by the spotted element (as opposed to non-overlapping single points), displaying aligned gene annotation (gene track), providing immediate linkage to public databases such as Online Medelian Inheritance of Man (OMIM), NCBI Entrez and UCSC Genome Browser (Fig. 4b).
A variety of methods are used in the identification of segmental copy number alterations. Here we describe the principles behind the commonly used analytical approaches (Fig. 2).
One of the simplest approaches for data analysis is by directly thresholding at a particular ratio. This methodology was very commonly used in early array CGH publications (Albertson et al. 2000; Garnis et al. 2004; Veltman et al. 2003). This threshold value can be defined in different ways. Ratio thresholds signify gains and losses based on a theoretical ratio of a single copy gain (3:2, log2 ratio of 0.585) and single copy loss, (1:2, log2 ratio of -1), albeit the actual ratio observed is typically significantly lower than the theoretical. Another approach relies on a sex mismatched experiment and using the signal ratio of the X chromosome to define the ratio for a single copy change (Fig. 5a). The drawback to this approach is that the ratio shift dampened by tissue heterogeneity is not reflected in the sex mismatch as both cancerous and non-cancerous cells in a sample have the same number of X chromosomes. Spectral karyotyping (SKY) or fluorescence in situ hybridization (FISH) can be used to calibrate the relationship between the copy number and the amplitude of the signal shift.
In this method, thresholding is applied to multiple consecutive data points, rather than individual ones. This involves calculating the average across a sliding window of data points (e.g. 30 kb windows sliding at 10 kb intervals) (Fig. 5b). As such, larger-sized windows which incorporate more adjacent points would produce a smoother curve, but at a lower detection sensitivity. Conversely, smaller windows will detect the smaller alterations, but may introduce a higher number of false positives.
K-means clustering involves the a priori determination of a set of clusters, k, such that a given quantity is minimized relative to the centroids of the clusters (MacQueen, 1967). Moreover, the variability in the types of K-means clustering is achieved by changing the method of measuring distance and the quantity to be minimized. For example, one quantity to minimize is the maximum distance of an object to its centroid using a distance measure such as the Euclidean distance (Autio et al. 2003). In terms of array CGH, three centroids are normally used, to represent “gain”, “loss” and “retention” respectively. However, the number of centroids may be increased to reflect multiple levels of gains and losses.
Briefly, a Hidden Markov model (HMM) is a statistical approach designed for describing a system with unknown parameters using those that are observed—where the known aspects of the model are the states assigned and the unknown parts are the transition probabilities between states. Moreover, HMMs can be described by three main components: a set of probabilities associated with transitions between all states (Λ), a set of probability distributions associated with each state (B), and a distribution of initial states (π). Commonly, any HMM with a discrete, finite number of states can be defined as λ = (Λ, B, π) (Rabiner, 1989).
In the context of the application of HMM to array CGH analysis, a simple version of this approach was utilized where the hidden states in fact represented each of the states of copy number change; gain, loss and retention (de Vries et al. 2005). Moreover, this method has been used to extrapolate levels of copy number when accounting for such factors as tissue heterogeneity as the expected ratio change for a single copy gain and loss would be dampened (Fridlyand et al. 2004). In addition to the application to BAC based microarrays, this approach has been employed in the context of the oligonucleotide platforms (Iafrate et al. 2004; Nannya et al. 2005; Zhao et al. 2004).
Circular binary segmentation (CBS) is a change-point based method which searches for particular change points where neighboring regions of DNA exhibit a statistical difference in copy number. By modifying the standard binary segmentation approach to a circular approach, this algorithm can be used to detect breakpoints in DNA as the altered region would be flanked by two regions of different copy number level, requiring two breakpoints. This algorithm, implemented in the DNACopy package, has been applied to test BAC array and representative oligonucleotide microarray (ROMA) datasets (Olshen et al. 2004). The application of CBS to describe genetic alterations myeloid sarcoma has been reported recently (Deeb et al. 2005).
Another approach for array CGH analysis revolves around the use of wavelets. Briefly, this is a spatially-adaptive and non-parametric approach used to denoise (smooth) and segment data. Furthermore, this method can handle small discrete alterations which appear as an abrupt aberration and deal with the inherent property of variable sized alterations with different magnitudes seen in array CGH data (Hsu et al. 2005). This approach has been implemented in a few different algorithms used to smooth and segment array CGH data (Hsu et al. 2005; Khojasteh et al. 2006).
The genetic local search approach is an algorithm which tries to partition the data by placing a user-defined number of breakpoints across a particular chromosome. Breakpoints are placed in a random fashion and the algorithm iteratively tries to improve the location of the breakpoints such that the negative log-likelihood of the data and the penalty associated with too many breakpoints within a partition are minimized (Jong et al. 2004). Furthermore, the data becomes segmented and the values are “smoothed” such that they are the average of all the data points in that segment (Fig. 5c). This method, implemented in the aCGH-Smooth software package, has been used in the analysis of non-small cell lung cancer (NSCLC) cell lines (Garnis et al. 2006), small cell lung cancer (SCLC) cell lines (Coe et al. 2006), and oral squamous cell carcinoma (Baldwin et al. 2005).
It should be noted that there is a false discovery rate (positive and negative) associated with any algorithm used for the detection of segmental alterations. The algorithm may not be able to consistently identify and correct for intrinsic noise in the data due to technical and biological variance encountered in array CGH experiments (Ylstra et al. 2006). Complementary methods such as fluorescence in situ hybridization and quantitative PCR will provide independent confirmation of the CGH findings. Alternatively, detection of changes in expression of genes within regions of alteration will also provide support of biological significance.
Table 1 summarizes currently available array CGH software programs and compares the algorithms used in the detection of segmental copy number changes and the types of visualization available.
Typically, software programs are developed to support the analysis and/or visualization of specific array platforms, especially for the commercially available platforms. For example, Affymetrix (Affymetrix Copy Number Analysis Tool) and Nimblegen (Nimblegen SignalMap) have been developed by the respective companies for their manufactured arrays. In contrast, software applications developed by academic laboratories were generally designed to handle a primary array utilized by the research group and upon subsequent improvements, could handle data from other commonly used array platforms. The application SeeGH, as an example, was initially developed to visualize and analyze BAC array CGH data but in new versions of the application, data from oligonucleotide or cDNA platforms can be accommodated. Furthermore, other programs such as ArrayCyGHt, CGH-Explorer, M-CGH and Normalise Suite v2.5 also demonstrate versatility by handling the data generated by all three types of array platforms (Table 1). The visualization capabilities of these applications are compared based on the ability to view single or multiple experiments, and simple static graphical representations versus interactive displays (Table 1). Here, we highlight three software examples to illustrate interactive display: CGHPro, CGHAnalyzer v2.2 and SeeGH v3.0.
CGHPro is a Java-based software operable on multiple operating systems. It requires the installation of the Java Runtime Environment Version 1.4.2 or higher, the statistical package R (Ihaka and Gentleman, 1996) Version 1.9.1 and the MySQL database server to store array CGH experiments (Chen et al. 2005). The major functionalities in this software include data quality assessment through graphical means, normalization of data using commonly used techniques for microarray imaging, integration of previously designed algorithms for alteration detection, and multiple methods for visualization. In addition, CGHPro can input formatted data from a variety of array platforms.
Data quality assessment is achieved using graphical methods such as scatter plots of the log2 spot intensities, box plots, histograms, M-A plots and QQ plots. Data filtering is achieved using user-defined parameters. Normalization routines include: Global Median, Subgrid Median, LOWESS (locally weighted scatter plot smooth), Subgrid LOWESS, and dye-swap normalization. Alteration detection algorithms include direct thresholding and thresholding after use of segmentation algorithms, incorporating the aCGH bioconductor (HMM) and DNACopy (CBS) packages (Fridlyand et al. 2004; Olshen et al. 2004). Visualization is interactive allowing sequential magnification and viewing of multiple experiments.
CGHAnalyzer is also a Java-based software with the requirement of Java Runtime Environment version 1.4 or later (Margolin et al. 2005). This program allows querying of pre-loaded or custom gene sets for copy number status and integrates the clustering options of TIGR Multi-Experiment Viewer (Saeed et al. 2003). CGHAnalyzer does not have normalization functions requiring pre-normalized data. However, mapping information for UPenn BAC array and Affymetrix P501 SNP array are pre-loaded.
Two visualization layouts are provided to give the option of viewing the chromosomes in concentric circles or as traditional chromosome ideograms. Multiple experiments can be viewed using heatmap alignment of individual chromosomes. Alteration detection depends on direct thresholding or by variation from a pre-defined distribution.
SeeGH was developed in C++, runs on Windows platform, requiring MySQL as the database structure. It accepts pre-normalized data and allows filtering of replica data points based on standard deviation and signal-to-noise ratio cut-offs. SeeGH accommodates data from a variety of sources, for example copy-number, gene expression, and global methylation profiles. Interactive display functions include sequential magnification, linking of clones to genes and, in turn, to biological databases (e.g. UCSC Genome Browser). Localization to specific regions of interest can be achieved through querying of identifiers such as gene name, clone name, and base pair position. Experimental parameters and user comments are stored within SeeGH allowing convenient information retrieval.
In addition, users can add customized or preloaded tracks to display gene location, CpG island position, microRNA location, etc. Multiple chromosome alignment, frequency summary plot, and heatmap display are included options for viewing multiple experiments (Fig. 6). Direct thresholding and moving average based thresholding are built in for alteration detection. Alternatively, segmentation using external software (e.g. aCGH-Smooth) can be imported for visualization.
With the rapid accumulation of large scale high throughput data describing cancer genomes, epigenomes, and transcriptomes, cross-platform meta-analysis will become prevalent. However, researchers with limited genomics and computational expertise will not be able to readily take advantage of such information. The development of facile, web-based software for the integration of large scale multidisciplinary databases will facilitate the widespread mining of genomic data and their correlation with clinical features (Kingsley et al. 2006). These issues are more pronounced with the increasing emphasis on translational research as array CGH technology moves towards clinical application. Added consideration of the ease of use, information security, automation and incorporation of prior knowledge of disease to assist in interpretation is necessary to deliver these emerging technologies to a clinical setting.
We thank Jonathan Davies, Timon Buys, and Bradley Coe for useful discussion, and Bryan Chi and Mehrnoush Khojasteh for providing software prior to publication. This work was supported by funds from Genome Canada/Genome British Columbia, Canadian Institute of Health Research, and NIDCR grant RO1 DE15965-01. WWL is supported by a scholarship from Natural Sciences and Engineering Research Council.