|Home | About | Journals | Submit | Contact Us | Français|
Cancer is known to have abundant copy number alterations (CNAs) that greatly contribute to its pathogenesis and progression. Investigation of CNA regions could potentially help identify oncogenes and tumor suppressor genes and infer cancer mechanisms. Although single-nucleotide polymorphism (SNP) arrays have strengthened our ability to identify CNAs with unprecedented resolution, a comprehensive collection of CNA information from SNP array data is still lacking. We developed a web-based CaSNP (http://cistrome.dfci.harvard.edu/CaSNP/) database for storing and interrogating quantitative CNA data, which curated ~11500 SNP arrays on 34 different cancer types in 104 studies. With a user input of region or gene of interest, CaSNP will return the CNA information summarizing the frequencies of gain/loss and averaged copy number for each study, and provide links to download the data or visualize it in UCSC Genome Browser. CaSNP also displays the heatmap showing copy numbers estimated at each SNP marker around the query region across all studies for a more comprehensive visualization. Finally, we used CaSNP to study the CNA of protein-coding genes as well as LincRNA genes across all cancer SNP arrays, and found putative regions harboring novel oncogenes and tumor suppressors. In summary, CaSNP is a useful tool for cancer CNA association studies, with the potential to facilitate both basic science and translational research on cancer.
Cancer is a complex genetic disease, whose initiation and progression are often accompanied by genome alterations. A great amount of copy number alterations (CNAs) is known to occur in the malignant neoplasm at full scale of human genome. Of the different types of genome variations, CNA has been the most implicated in oncogenesis and cancer progression, and many CNAs are known to be characteristic of specific types of cancers (1–3). There is a growing demand to understand the nature of CNA in cancer, as CNAs not only serve as biomarkers to predict cancer malignancy and prognosis, but also often harbor tumor suppressors and oncogenes (4,5), the studies of which could shed light on the sequence and mechanism of oncogenesis. In addition, there is increasing evidence that some CNAs could target noncoding RNA (ncRNA) genes such as miRNAs (6), suggesting ncRNAs might be extensively involved in oncogenesis.
Array comparative genomic hybridization (aCGH) has long been the standard platform to investigate the relative gains and losses of genomic DNA by measuring the relative signal ratios of the differentially labeled array hybridization between tumor and normal samples. Several repositories focusing on CNAs detected from aCGH are already publicly available (7–9). However, most of these aCGH studies use BAC or cDNA probes, which have a coarse resolution for CNA detection.
In the last few years, single nucleotide polymorphism (SNP) arrays have gradually become the major platform for SNP genotyping and CNA detection (10,11). SNP array has probes on known SNPs that are densely distributed in the human genome, and allows accurate SNP genotyping at these loci for individual biological samples. In addition, by comparing the signal intensities at the SNP loci between cancer and normal reference samples, one can also gain high-resolution CNA knowledge about the cancer of interest. It has been reported that SNP arrays outperform traditional aCGH in CNA detection resolution (12), enabling high-resolution SNP and CNA detection at individual gene level (13–15). Currently, some studies use websites to display the results of their own (e.g. http://www.broadinstitute.org/tumorscape). However, a comprehensive resource of CNA data from all cancer SNP array experiments is still unavailable.
We present CaSNP as a comprehensive collection of CNA information inferred from cancer SNP array data. We analyzed ~11500 Affymetrix SNP arrays on 34 different cancer types in 104 studies to profile the genome-wide CNAs. This includes all the publicly available cancer SNP profiles using Affymetrix SNP arrays, mostly from Gene Expression Omnibus (GEO) (16). We also developed a data extraction and annotation schema to interrogate copy number on user-specified genomic region by cancer type and across different array platforms (from SNP 10K to 6.0) and studies. CaSNP is available at http://cistrome.dfci.harvard.edu/CaSNP/.
Among the 104 studies collected, 100 are from GEO, one is from GlaxoSmithKline (https://cabig.nci.nih.gov/tools/caArray_GSKdata) and three are from individual publication’s supplementary websites (17,18). The raw data (.cel file) of array experiments and accompanying genotype files (if available) for samples were collected. dCHIP-SNP (19), a widely used and referenced SNP array analysis algorithm (cited by 238 accordingly to Google Scholar), was applied to each data set. Array raw data within each study were normalized in dCHIP-SNP with invariant set normalization, and signal values for individual SNP loci were further computed with the model-based expression index method (20). Relative copy number value for each SNP was calculated as the signal ratio of tumor samples versus the average of normal reference samples within the same data set, and was exported and stored in CaSNP. For data sets with no normal reference samples, the average ‘normal reference’ was calculated for each SNP from the tumor samples bearing the middle 50% of signals (i.e. 25% outlier signals from both sides were excluded). We did not choose normal samples from other experiments of the same array type as reference to avoid potential microarray batch effect.
To treat and query copy number data from different array platforms in a unified manner, we updated the genome coordinate system to the latest human genome assembly (UCSC hg19). In addition, all SNP IDs were converted to dbSNP129. We also manually extracted and curated information on sample clinical background and organized them at two levels: the top level on the tissue origin (e.g. lung cancer), while the second on cancer subtypes (e.g. small cell lung carcinoma).
The only required field in a user query is the genome region where a user inputs a genomic coordinate range (limited to 2-MB size), a gene name, a RefSeq ID or an miRNA name—all of which will be internally converted to a genomic coordinate range. The user could optionally specify the cancer type and subtype to limit the query. Alternatively, one could also go to the ‘Browse Data’ page to select a subset of the data sets/series for analysis. The ‘Browse Data’ option allows the user to focus on specific studies or conduct joint analysis in two or more studies and/or across multiple cancer types. When the user specifies a cancer type or subtype or data sets/series, CaSNP will consult the sample information table to extract the matching samples for analysis. In addition, the user can specify the upper and lower CNA thresholds (default 2.2 and 1.8, respectively) for CaSNP to calculate the percentage of samples beyond the thresholds within each study. A flowchart depicting the internal table schema of CaSNP is shown in Figure 1.
A screenshot of CaSNP’s result output page is shown in Figure 2. The most important results from a CaSNP query is the average copy number of the queried region for each of the series involved, and the percentage of samples exceeding the copy number thresholds. This value is calculated as the mean of all biological samples in each series. If there are multiple array platforms for a sample, all data for the sample will be combined before the calculation. If user specifies the upper or lower CNA threshold at input, the frequency of threshold-passing samples will also be displayed for each series. This could help the user to determine whether an observed CNA is prevalent in many samples or only caused by outlier ones. The percentage values of threshold-passing samples at the SNP loci in the region are also coded in the bedGraph file format, which is the standard for displaying continuous-valued data as a track in the UCSC genome browser. The bed files generated could be directly viewed in UCSC genome browser (21) via a link or downloaded. Also displayed on the result page are statistics of sample and SNP number for each series, links to their corresponding GEO entries at NCBI and other relevant information.
A graphic display of the results is also provided through the ‘HeatMap’ query page (Figure 3). The series returned are grouped by array platforms, with CNAs (loss to gain) expressed in color gradient (blue to red), and white for normal diploid (copy number 2) which gives users a comprehensive view of the copy number data in the queried region and cancer types. The heatmaps are dynamically generated from the data in the database.
CaSNP is running on an Apache web server and the data resides in a MySQL server. The scripts for query processing and data analysis are written in Python and the user interface is based on a django frame.
As an example of how CaSNP can be used for cancer biomarker or oncogene/tumor suppressor detection, we systematically exctracted the copy number of all 20221 RefSeq genes from CaSNP. We then calculated a G-score, which is a component of the GISTIC methodology (22) for each gene to summarize both the frequency and amplitude of its copy number alteration in all 11500 cancer samples. When comparing with known annotated database of oncogenes (http://www.sanger.ac.uk/genetics/CGP/Census/) and tumor suppressor genes (http://cbio.mskcc.org/CancerGenes/), we found that regions of highest or lowest G-scores often harbor known oncogenes and tumor suppressor genes, respectively (Figure 4). This partially validated the quality of the data and the accuracy of our copy number estimation. Interestingly, we observed that many chromosome ends show strong deletions in cancer, and harbor some of the well-known tumor suppressors such as STK11, TSPAN32, MAPK9 and PTGES.
A very striking exception is a strong amplified region on the left tip of chromosome 5, with no previously annotated tumor suppressors and oncogenes. The region was implicated in breast cancer risk (23), and a recent cancer CNA study (24) identified the putative target amplification gene as TERT, but did not experimentally validate its function in breast caner. Checking Oncomine (25), we found that TERT is not highly expressed in breast cancers. Instead, a nearby gene IRX2 not only shows gene amplification and enhanced expression in breast cancers, but also has some literature support for playing a role in mammary gland neoplasia (26). Alternatively, the oncogene in the Chr5 left tip might be an ncRNA, so we investigated the CNAs of all 4013 newly identified LincRNAs (27) in mammalian genomes (Supplementary Figure S1). Although gene expression of LincRNA in breast cancers is still lacking, our analysis did generate interesting leads for potential follow up validations of LincRNAs as tumor suppressors and oncogenes and demonstrate the value of CaSNP.
Here, we have presented the CaSNP database for identifying and visualizing CNAs in cancers at any specific region within the human genome. CaSNP stores pre-computed raw copy numbers, and dynamically generates viewable and downloadable summaries of CNA status in response to user queries. A schema for uniformly processing, storing, annotating and presenting data sets across different data sets or platforms was successfully implemented, making CaSNP a useful tool for cancer genomic meta-study. The query results contain numerical values of cancer copy numbers and the frequencies of CNA events, which are well suited for more detailed analysis by other software or methods. Besides the tabular display, the heatmap view displays SNP copy numbers in colors, enabling users to intuitively and comprehensively visualize the results and facilitating finding novel CNA regions in subset of samples. Besides, we provided a scenario of using CaSNP to explore cancer biomarkers or genes through a meta-analysis, and proved CaSNP’s ability in suggesting novel oncogenes/tumor suppressors, whether a protein coding gene or a ncRNA.
Benefited from the abundance of SNP array data sets in recent years, CaSNP is the largest repository of SNP array-oriented CNA data among all the databases of the similar type. The amount of public-accessible SNP array data on cancer is still expanding, so will be the data collection in CaSNP. Such a large-scale analysis will be extremely valuable when correlating CNA data with a genomic location with specific diagnostic, prognostic or therapeutic value found in other studies, or to reduce noise from individual studies via meta-analysis. Nowadays, when high-throughput methods as ChIP–chip or ChIP-seq could generate hundreds of thousands of regions of interest in a single run, CaSNP will be powerful for independent validation purpose, such as screening the regions which might be related to oncogenesis and might go unnoticed in ChIP experiments alone. Besides collecting more data, we will commit our work to make better use of them. The loss-of-heterozygosity (LOH) information deduced from genotype data will be added, and the CNA status will be compared across different cancer types for specified regions and across the genome.
Supplementary Data are available at NAR Online.
The National Institutes of Health R01 (HG004069 to X.S.L., GM077122 to C.L.); Chinese Scholarship Council (2008632067 to Q.C.); State S & T Projects (11th Five Year) of China (2008ZX10002-007 to Z.C.); National Basic Research Program of China (973 Program No. 2010CB944904 to M.Z., X.W. and Y.Z.). Funding for open access charge: National Basic Research Program of China (973 Program No. 2010CB944904).
Conflict of interest statement. None declared.
We greatly appreciate Yi Wang, Scott Taing, Len Taing, Tao Liu and Luhua Zhang's help on the design and deployment of CaSNP.