|Home | About | Journals | Submit | Contact Us | Français|
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
The prevalence of high resolution profiling of genomes has created a need for the integrative analysis of information generated from multiple methodologies and platforms. Although the majority of data in the public domain are gene expression profiles, and expression analysis software are available, the increase of array CGH studies has enabled integration of high throughput genomic and gene expression datasets. However, tools for direct mining and analysis of array CGH data are limited. Hence, there is a great need for analytical and display software tailored to cross platform integrative analysis of cancer genomes.
We have created a user-friendly java application to facilitate sophisticated visualization and analysis such as cross-tumor and cross-platform comparisons. To demonstrate the utility of this software, we assembled array CGH data representing Affymetrix SNP chip, Stanford cDNA arrays and whole genome tiling path array platforms for cross comparison. This cancer genome database contains 267 profiles from commonly used cancer cell lines representing 14 different tissue types.
In this study we have developed an application for the visualization and analysis of data from high resolution array CGH platforms that can be adapted for analysis of multiple types of high throughput genomic datasets. Furthermore, we invite researchers using array CGH technology to deposit both their raw and processed data, as this will be a continually expanding database of cancer genomes. This publicly available resource, the System for Integrative Genomic Microarray Analysis (SIGMA) of cancer genomes, can be accessed at http://sigma.bccrc.ca.
Array comparative genomic hybridization (CGH) is a method used to detect segmental DNA copy number alterations and is widely used to discover chromosomal aberrations in cancer and other genetic diseases [1,2]. In this method, differentially labeled genomic DNA samples are competitively hybridized to chromosomal targets, and the copy number balance between the two samples is reflected by their signal intensity ratio. Numerous array CGH platforms exist; these vary in the type of elements present on the array and their corresponding coverage of the human genome. With the development of high resolution, genome wide arrays, tens of thousands of loci can be evaluated for copy number status, facilitating the high throughput search for genes potentially involved in pathogenesis. This has allowed the identification of discrete regions of alteration that may have been missed by traditional cytogenetic methods and has proven to be a useful platform for exploring the underlying genetic basis of cancer [1,3].
With the increasing utilization of array CGH, it has not only become important to establish standards for data deposition, but to develop tools to facilitate public access and to ease mining of available data. Currently, the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) repository  and European Bioinformatics Institute (EBI) ArrayExpress  provide storage for array CGH data, but these databases have been largely designed for gene expression microarrays. Although these sites support visualization of previously analyzed gene expression profiles, there are limited tools available for direct mining and analysis of array CGH data. Hence, there is a need for forums specific to array CGH data. Recently, attempts have been made in making a database primarily of lower resolution array CGH data . However, with the accumulation of high density array data generated with diverse technology, the viewing of array data has become a bioinformatics challenge, especially when the integration of multiple datasets from different platforms is required. Therefore, a central database with analytical software tailored specifically for analyzing and visualizing different types of high resolution array CGH data would greatly facilitate data mining.
In this study we have created a database consisting of high-resolution, whole-genome array CGH profiles for nearly 200 commonly used cancer cell lines profiled on four different array platforms, which have been instrumental in biochemical and pharmacogenetic studies. Moreover, we have developed a user friendly, web-based java application called the System for Integrative Genomic Microarray Analysis (SIGMA) for comparative analysis of multiple genomes.
We have assembled a collection of 267 array CGH profiles, representing 184 distinct cell lines profiled on at least one of the four array CGH platforms (Table (Table1,1, Table Table2).2). Moreover, 14 different cancer tissue origins and 30 distinct cancer types are represented in this database, resulting in the assembly of a wide spectrum of genomes in this repository (Table (Table2)2) [see Additional file 1]. Significantly, 56 of the 267 CGH profiles are unpublished raw data which is now made public.
In order to increase the utility of this collection, a significant component of SIGMA is the web-based application which allows for the user-friendly mining of this dataset. Four major types of functionalities are offered by SIGMA: (1) interrogation of a single sample, (2) visualization and analysis of a single group of samples, (3) comparative analysis of two groups of samples, and (4) integration of data from multiple platforms (Figure (Figure1B1B).
The first function we discuss is the ability to view a single high resolution array CGH profile at multiple magnification levels. The major utility of this function is to display the underlying genomic architecture of a cell line, so that genetic features can be considered in experimental interpretation. For example, a whole genome karyogram of lung adenocarcinoma cell line H2087 profiled on the SMRT array platform (Figure (Figure2A).2A). From this image, we see many changes such as the loss of the 3p arm as well as segmental changes in chromosomes 8, 19 and 20. Specifically, we can select chromosome 8 (Figure (Figure2B)2B) and view that separately, then zoom into the region of interest and visualize it in finer detail (Figure (Figure2C).2C). Users can then highlight or place boundary lines in this region and query for which genes are located within the region of interest.
Subsequently, using the interval search option, users can retrieve the genes which are located in a desired region and have the option to query commonly used biological databases such as NCBI MIM, NCBI Gene, NCBI PubMed and the UCSC Genome Browser. For example, if we look at band 8q24.21 (Figure (Figure2D),2D), we can highlight the region and search the interval for which genes it contains. When we invoke the region search and retrieve only genes curated by RefSeq, we see there are 8 genes in the amplicon. If the user selects a particular gene, options to link out to the above mentioned biological databases become available. The utility of this function is to facilitate a direct connection from experimental findings to known, relevant information. Moreover, the ability to interrogate for specific genes and regions can be done for any types of the analysis outlined (Figure (Figure1B1B).
A common research question is to look across a series of samples with common phenotype to identify recurrent genetic changes, for example comparison of lung adenocarcinomas [see Additional file 2]. With the spectrum of samples warehoused in SIGMA, such a query can be performed across multiple cancer types. For example, the alignment of a set of samples representing 8 different cancer types revealed common amplification of the MYC oncogene locus (Figure (Figure3A),3A), while the epidermal growth factor receptor (EGFR) locus is amplified only in a subset of samples (Figure (Figure3B3B).
Recurrent alterations detected in one group of samples can be compared against those in another group, for example, the comparison of lung squamous cell carcinoma (SqCC) with cervical SqCC [see Additional file 3]. The strategy for comparison, for example the overlay of frequency plots derived from two groups of profiles, has been described elsewhere [7,8].
Cross platform comparison is essential to the multi-dimensional descriptions of a genome. Here, we have included a feature in SIGMA to allow users to view multiple platforms of data simultaneously. We use the breast cancer cell line, MCF7, to illustrate this functionality. Data from four different array CGH platforms were available publicly: SMRT array, Stanford cDNA microarray, Affymetrix 10K SNP array and the Affymetrix 100K SNP array. Figure Figure4A4A illustrates the cross platform display of chromosome 17, while Figure Figure4B4B shows the variable density of coverage by these four commonly used platforms.
One of the novel features we have provided is the integrative visualization of copy number alterations with DNA methylation status. The major premise in studying copy number alterations at the DNA level is that these are the primary changes involved in driving changes in gene expression. Though gene dosage variation may be responsible for expression changes, alteration in DNA methylation pattern also contributes significantly to regulating gene expression. Recently, methods for global methylation analysis to measure aberrant DNA methylation status across tumor genomes have been developed [9-12]. Wilson et al (2006) compared methylation patterns with copy number profiles in lung cancer cells. Utilizing genomic and epigenetic data from this study for the H1395 lung cancer cell line, we illustrate a parallel display in SIGMA. Figure Figure55 shows a large segmental copy number gain spanning 1q21.2 to 1q23.1 with corresponding hypomethylation, localized precisely to 1q21.3 . Significantly, both copy number gain and decreased methylation can elevate gene expression. The S100 calcium binding protein A10 (S100A10) gene within this region has been previously shown to be over-expressed in lung cancer . It is readily apparent of the value of integrative studies examining aberrant DNA methylation and genomic copy number. With increased prevalence of studies of whole genome methylation, this feature will be of greater importance.
We have developed an application for the integrative cross platform analysis of array CGH data. The SIGMA application facilitates consolidation and structuring of diverse sources of array CGH data into a repository that is accessible with a new easy-to-use built-in web-based analytical application. The launch version contains data for 267 array CGH profiles, representing cancer cell lines of over 14 different types of tissue. The ability of SIGMA to incorporate multiple array CGH platforms facilitates the archiving of array CGH data from future publications, regardless of current or future array platform used. Though currently SIGMA's architecture facilitates the direct mining of genomic and epigenomic data, this can be easily adapted, and not limited to, high resolution genetic and gene expression surveys.
The raw data for 267 CGH profiles in the database was obtained from a variety of sources. They include both published and unpublished data [7,8,15-22]. Publicly available data were downloaded from NCBI GEO , Stanford Microarray Database  or websites affiliated with the author's laboratory. Data which were not publicly available were obtained by consent from the authors of the respective studies. The four array CGH platforms used for this study were the whole genome tiling path BAC (SMRT) array , the early access Affymetrix SNP 100 K array [16,20], the Stanford cDNA microarray  and the Affymetrix SNP 10 K array . In addition, 2 of the cell lines were profiled for whole genome DNA methylation status using MeDIP array CGH [10,13]. For this launch version, we concentrated on available cell lines profiled on high density array platforms and did not include profiles from clinical specimens. A summary of the sources for the raw data is given in Table Table1,1, while the detailed description of each of the cell lines in the collection is given in Additional file 1.
There are three main components which comprise this application; a Java WebStart application interface allowing users to formulate queries and perform visualization, an Apache Web Service which facilitates the connection of the user application to the database and a relational database which is implemented using MySQL (Figure (Figure1A).1A). Utilization of the Java WebStart technology ensures that users will have the latest version of the application, without the need for manual updating. In addition, efficiency and speed of the application will largely be determined by the user's computer specifications. Hence, we have provided different versions of our application based on system resource utilization, allowing users with greater system resources to perform more analytical tasks per session.
SIGMA is a continually expanding database of array CGH experiments. The launch version contains 267 genomic profiles generated from cancer cell lines, implemented using the MySQL relational database application. Each array CGH experiment is contained in a separate database table allowing for easy and seamless expansion of this database. Upon addition of a new profile, a database table which contains the information of each array CGH experiment is updated. This table stores a record of each experiment, with the name of the cell line, the American Type Culture Collection (ATCC) identification (if applicable), array platform used and the description of the cancer type as part of the schema. For two channel array-based profiles, the dye which was used to label the sample is also recorded. Lastly, mapping information pertaining to a clone or probe and its position in the genome is kept in file with a fixed format, such that subsequent improvements and updates to the genomic positioning of the array elements can be easily incorporated. Moreover, since individual microarray software platforms use their own map version, map information for all platforms were compiled based on data from the UCSC Genome Browser . Currently, two genomic builds are supported: April 2003 (hg15) and May 2004 (hg17) assemblies.
Data for each platform were processed as similarly as possible. SMRT array data were normalized using the stepwise framework for normalization with default parameters . Similarly, Affymetrix 10 K and Affymetrix 100 K data were normalized and processed using dChip  with default settings. Specifically, the samples from the Affymetrix 10 K dataset of lung cancer cell lines were referenced against the group of matching blood lymphoblast lines and similarly, the breast cancer cell lines were referenced against their matching blood lymphoblast lines. Affymetrix 100 K data from Zhao et al. (2004) were referenced against 12 normal individuals and the NCI-60 profiles were referenced against 6 normal diploid controls. The gender of the profiles were not specified, hence data from the X chromosome may not be an accurate reflection. Segmentation of all data was performed using aCGH-Smooth , with data for the sex chromosomes removed prior to segmentation – as profiles were generated with sex matched or mismatched reference DNA in two channel hybridization experiments – and run with the settings of Lambda = 6.75 and "breakpoints per chromosome" = 100. Each element of the array is given a call with respect to normal: -1 if the element shows copy loss, 0 if the element shows no change in copy number and +1 if the element shows increased genomic content.
SIGMA: System for Integrative Genomic Microarray Analysis
CGH: Comparative Genomic Hybridization
SNP: Single Nucleotide Polymorphism
Project name: SIGMA (System for Integrative Genomic Microarray Analysis)
Project home page: http://sigma.bccrc.ca
Operating system(s): Platform independent
Programming language: Java
Other requirements: Java version 1.6+
License: Free for academic and research use, commercial users please request special permission
RC designed and developed SIGMA and wrote the manuscript. WWL and BPC were involved in the design and development. AC, DM and AT were the programmers. JJD was involved with initial design and CM provided manuscript editing and program feedback. WLL is the principal investigator of the lab.
List of cell lines in database. Detailed description of all the cell lines.
Single group visualization. Whole genome frequency plot of lung adenocarcinoma.
Two group whole genome comparison. Whole genome frequency plot comparison of squamous cancer of the lung and cervix.
We thank Ian Wilson for useful discussion and Byron Cline for extensive testing of the application as well as Andrew Lam for tutorial development. This work was supported by funds from Genome Canada/Genome British Columbia, Canadian Institute of Health Research, and NIDCR grant RO1 DE15965-01.