Related Articles
Background
Comparative Genomic Hybridization (CGH) is a molecular approach for detecting DNA Copy Number Alterations (CNAs) in tumor, which are among the key causes of tumorigenesis. However in the post-genomic era, most studies in cancer biology have been focusing on Gene Expression Profiling (GEP) but not CGH, and as a result, an enormous amount of GEP data had been accumulated in public databases for a wide variety of tumor types. We exploited this resource of GEP data to define possible recurrent CNAs in tumor. In addition, the CNAs identified by GEP would be more functionally relevant CNAs in the disease pathogenesis since the functional effects of CNAs can be reflected by altered gene expression.
Methods
We proposed a novel computational approach, coined virtual CGH (vCGH), which employs hidden Markov models (HMMs) to predict DNA CNAs from their corresponding GEP data. vCGH was first trained on the paired GEP and CGH data generated from a sufficient number of tumor samples, and then applied to the GEP data of a new tumor sample to predict its CNAs.
Results
Using cross-validation on 190 Diffuse Large B-Cell Lymphomas (DLBCL), vCGH achieved 80% sensitivity, 90% specificity and 90% accuracy for CNA prediction. The majority of the recurrent regions defined by vCGH are concordant with the experimental CGH, including gains of 1q, 2p16-p14, 3q27-q29, 6p25-p21, 7, 11q, 12 and 18q21, and losses of 6q, 8p23-p21, 9p24-p21 and 17p13 in DLBCL. In addition, vCGH predicted some recurrent functional abnormalities which were not observed in CGH, including gains of 1p, 2q and 6q and losses of 1q, 6p and 8q. Among those novel loci, 1q, 6q and 8q were significantly associated with the clinical outcomes in the DLBCL patients (p < 0.05).
Conclusions
We developed a novel computational approach, vCGH, to predict genome-wide genetic abnormalities from GEP data in lymphomas. vCGH can be generally applied to other types of tumors and may significantly enhance the detection of functionally important genetic abnormalities in cancer research.
doi:10.1186/1755-8794-4-32
PMCID: PMC3086850
PMID: 21486456
Array-Comparative Genomic Hybridization (aCGH) is a powerful high throughput technology for detecting chromosomal copy number aberrations (CNAs) in cancer, aiming at identifying related critical genes from the affected genomic regions. However, advancing from a dataset with thousands of tabular lines to a few candidate genes can be an onerous and time-consuming process. To expedite the aCGH data analysis process, we have developed a user-friendly aCGH data viewer (aCGHViewer) as a conduit between the aCGH data tables and a genome browser. The data from a given aCGH analysis are displayed in a genomic view comprised of individual chromosome panels which can be rapidly scanned for interesting features. A chromosome panel containing a feature of interest can be selected to launch a detail window for that single chromosome. Selecting a data point of interest in the detail window launches a query to the UCSC or NCBI genome browser to allow the user to explore the gene content in the chromosomal region. Additionally, aCGHViewer can display aCGH and expression array data concurrently to visually correlate the two. aCGHViewer is a stand alone Java visualization application that should be used in conjunction with separate statistical programs. It operates on all major computer platforms and is freely available at http://falcon.roswellpark.org/aCGHview/.
PMCID: PMC1847423
PMID: 17404607
array-CGH; CNA; gene expression; visualization
Array-Comparative Genomic Hybridization (aCGH) is a powerful high throughput technology for detecting chromosomal copy number aberrations (CNAs) in cancer, aiming at identifying related critical genes from the affected genomic regions. However, advancing from a dataset with thousands of tabular lines to a few candidate genes can be an onerous and time-consuming process. To expedite the aCGH data analysis process, we have developed a user-friendly aCGH data viewer (aCGHViewer) as a conduit between the aCGH data tables and a genome browser. The data from a given aCGH analysis are displayed in a genomic view comprised of individual chromosome panels which can be rapidly scanned for interesting features. A chromosome panel containing a feature of interest can be selected to launch a detail window for that single chromosome. Selecting a data point of interest in the detail window launches a query to the UCSC or NCBI genome browser to allow the user to explore the gene content in the chromosomal region. Additionally, aCGHViewer can display aCGH and expression array data concurrently to visually correlate the two. aCGHViewer is a stand alone Java visualization application that should be used in conjunction with separate statistical programs. It operates on all major computer platforms and is freely available at http://falcon.roswellpark.org/aCGHview/.
PMCID: PMC1847423
PMID: 17404607
array-CGH; CNA; gene expression; visualization
Background
DNA copy number aberration (CNA) is one of the key characteristics of cancer cells. Recent studies demonstrated the feasibility of utilizing high density single nucleotide polymorphism (SNP) genotyping arrays to detect CNA. Compared with the two-color array-based comparative genomic hybridization (array-CGH), the SNP arrays offer much higher probe density and lower signal-to-noise ratio at the single SNP level. To accurately identify small segments of CNA from SNP array data, segmentation methods that are sensitive to CNA while resistant to noise are required.
Results
We have developed a highly sensitive algorithm for the edge detection of copy number data which is especially suitable for the SNP array-based copy number data. The method consists of an over-sensitive edge-detection step and a test-based forward-backward edge selection step.
Conclusion
Using simulations constructed from real experimental data, the method shows high sensitivity and specificity in detecting small copy number changes in focused regions. The method is implemented in an R package FASeg, which includes data processing and visualization utilities, as well as libraries for processing Affymetrix SNP array data.
doi:10.1186/1471-2105-8-145
PMCID: PMC1868765
PMID: 17477871
Background
Copy number alterations (CNAs) in genomic DNA have been associated with complex human diseases, including cancer. One of the most common techniques to detect CNAs is array-based comparative genomic hybridization (aCGH). The availability of aCGH platforms and the need for identification of CNAs has resulted in a wealth of methodological studies.
Methodology/Principal Findings
ADaCGH is an R package and a web-based application for the analysis of aCGH data. It implements eight methods for detection of CNAs, gains and losses of genomic DNA, including all of the best performing ones from two recent reviews (CBS, GLAD, CGHseg, HMM). For improved speed, we use parallel computing (via MPI). Additional information (GO terms, PubMed citations, KEGG and Reactome pathways) is available for individual genes, and for sets of genes with altered copy numbers.
Conclusions/Significance
ADaCGH represents a qualitative increase in the standards of these types of applications: a) all of the best performing algorithms are included, not just one or two; b) we do not limit ourselves to providing a thin layer of CGI on top of existing BioConductor packages, but instead carefully use parallelization, examining different schemes, and are able to achieve significant decreases in user waiting time (factors up to 45×); c) we have added functionality not currently available in some methods, to adapt to recent recommendations (e.g., merging of segmentation results in wavelet-based and CGHseg algorithms); d) we incorporate redundancy, fault-tolerance and checkpointing, which are unique among web-based, parallelized applications; e) all of the code is available under open source licenses, allowing to build upon, copy, and adapt our code for other software projects.
doi:10.1371/journal.pone.0000737
PMCID: PMC1940324
PMID: 17710137
De Bortoli, Massimiliano | Castellino, Robert C | Lu, Xin-Yan | Deyo, Jeffrey | Sturla, Lisa Marie | Adesina, Adekunle M | Perlaky, Laszlo | Pomeroy, Scott L | Lau, Ching C | Man, Tsz-Kwong | Rao, Pulivarthi H | Kim, John YH
Background
Medulloblastoma is the most common malignant brain tumor of childhood. Improvements in clinical outcome require a better understanding of the genetic alterations to identify clinically significant biological factors and to stratify patients accordingly. In the present study, we applied cytogenetic characterization to guide the identification of biologically significant genes from gene expression microarray profiles of medulloblastoma.
Methods
We analyzed 71 primary medulloblastomas for chromosomal copy number aberrations (CNAs) using comparative genomic hybridization (CGH). Among 64 tumors that we previously analyzed by gene expression microarrays, 27 were included in our CGH series. We analyzed clinical outcome with respect to CNAs and microarray results. We filtered microarray data using specific CNAs to detect differentially expressed candidate genes associated with survival.
Results
The most frequent lesions detected in our series involved chromosome 17; loss of 16q, 10q, or 8p; and gain of 7q or 2p. Recurrent amplifications at 2p23-p24, 2q14, 7q34, and 12p13 were also observed. Gain of 8q is associated with worse overall survival (p = 0.0141), which is not entirely attributable to MYC amplification or overexpression. By applying CGH results to gene expression analysis of medulloblastoma, we identified three 8q-mapped genes that are associated with overall survival in the larger group of 64 patients (p < 0.05): eukaryotic translation elongation factor 1D (EEF1D), ribosomal protein L30 (RPL30), and ribosomal protein S20 (RPS20).
Conclusion
The complementary use of CGH and expression profiles can facilitate the identification of clinically significant candidate genes involved in medulloblastoma growth. We demonstrate that gain of 8q and expression levels of three 8q-mapped candidate genes (EEF1D, RPL30, RPS20) are associated with adverse outcome in medulloblastoma.
doi:10.1186/1471-2407-6-223
PMCID: PMC1578584
PMID: 16968546
Genomic DNA copy-number alterations (CNAs) are associated with complex diseases, including cancer: CNAs are indeed related to tumoral grade, metastasis, and patient survival. CNAs discovered from array-based comparative genomic hybridization (aCGH) data have been instrumental in identifying disease-related genes and potential therapeutic targets. To be immediately useful in both clinical and basic research scenarios, aCGH data analysis requires accurate methods that do not impose unrealistic biological assumptions and that provide direct answers to the key question, “What is the probability that this gene/region has CNAs?” Current approaches fail, however, to meet these requirements. Here, we introduce reversible jump aCGH (RJaCGH), a new method for identifying CNAs from aCGH; we use a nonhomogeneous hidden Markov model fitted via reversible jump Markov chain Monte Carlo; and we incorporate model uncertainty through Bayesian model averaging. RJaCGH provides an estimate of the probability that a gene/region has CNAs while incorporating interprobe distance and the capability to analyze data on a chromosome or genome-wide basis. RJaCGH outperforms alternative methods, and the performance difference is even larger with noisy data and highly variable interprobe distance, both commonly found features in aCGH data. Furthermore, our probabilistic method allows us to identify minimal common regions of CNAs among samples and can be extended to incorporate expression data. In summary, we provide a rigorous statistical framework for locating genes and chromosomal regions with CNAs with potential applications to cancer and other complex human diseases.
Author Summary
As a consequence of problems during cell division, the number of copies of a gene in a chromosome can either increase or decrease. These copy-number alterations (CNAs) can play a crucial role in the emergence of complex multigenic diseases. For example, in cancer, amplification of oncogenes can drive tumor activation, and CNAs are associated with metastasis development and patient survival. Studies on the relationship between CNAs and disease have been recently fueled by the widespread use of array-based comparative genomic hybridization (aCGH), a technique with much finer resolution than previous experimental approaches. Detection of CNAs from these data depends on methods of analysis that do not impose biologically unrealistic assumptions and that provide direct answers to fundamental research questions. We have developed a statistical method, using a Bayesian approach, that returns estimates of the probabilities of CNAs from aCGH data, the most direct and valuable answer to the key biological question: “What is the probability that this gene/region has an altered copy number?” The output of the method can therefore be immediately used in different settings from clinical to basic research scenarios, and is applicable over a wide variety of aCGH technologies.
doi:10.1371/journal.pcbi.0030122
PMCID: PMC1894821
PMID: 17590078
Cancer progression is often driven by an accumulation of genetic changes but also accompanied by increasing genomic instability. These processes lead to a complicated landscape of copy number alterations (CNAs) within individual tumors and great diversity across tumor samples. High resolution array-based comparative genomic hybridization (aCGH) is being used to profile CNAs of ever larger tumor collections, and better computational methods for processing these data sets and identifying potential driver CNAs are needed. Typical studies of aCGH data sets take a pipeline approach, starting with segmentation of profiles, calls of gains and losses, and finally determination of frequent CNAs across samples. A drawback of pipelines is that choices at each step may produce different results, and biases are propagated forward. We present a mathematically robust new method that exploits probe-level correlations in aCGH data to discover subsets of samples that display common CNAs. Our algorithm is related to recent work on maximum-margin clustering. It does not require pre-segmentation of the data and also provides grouping of recurrent CNAs into clusters. We tested our approach on a large cohort of glioblastoma aCGH samples from The Cancer Genome Atlas and recovered almost all CNAs reported in the initial study. We also found additional significant CNAs missed by the original analysis but supported by earlier studies, and we identified significant correlations between CNAs.
doi:10.1371/journal.pone.0012028
PMCID: PMC2920822
PMID: 20711339
Tumor formation is in part driven by DNA copy number alterations (CNAs), which can be measured using microarray-based Comparative Genomic Hybridization (aCGH). Multiexperiment analysis of aCGH data from tumors allows discovery of recurrent CNAs that are potentially causal to cancer development. Until now, multiexperiment aCGH data analysis has been dependent on discretization of measurement data to a gain, loss or no-change state. Valuable biological information is lost when a heterogeneous system such as a solid tumor is reduced to these states. We have developed a new approach which inputs nondiscretized aCGH data to identify regions that are significantly aberrant across an entire tumor set. Our method is based on kernel regression and accounts for the strength of a probe's signal, its local genomic environment and the signal distribution across multiple tumors. In an analysis of 89 human breast tumors, our method showed enrichment for known cancer genes in the detected regions and identified aberrations that are strongly associated with breast cancer subtypes and clinical parameters. Furthermore, we identified 18 recurrent aberrant regions in a new dataset of 19 p53-deficient mouse mammary tumors. These regions, combined with gene expression microarray data, point to known cancer genes and novel candidate cancer genes.
doi:10.1093/nar/gkm1143
PMCID: PMC2241875
PMID: 18187509
We propose a statistical framework, named genoCN, to simultaneously dissect copy number states and genotypes using high-density SNP (single nucleotide polymorphism) arrays. There are at least two types of genomic DNA copy number differences: copy number variations (CNVs) and copy number aberrations (CNAs). While CNVs are naturally occurring and inheritable, CNAs are acquired somatic alterations most often observed in tumor tissues only. CNVs tend to be short and more sparsely located in the genome compared with CNAs. GenoCN consists of two components, genoCNV and genoCNA, designed for CNV and CNA studies, respectively. In contrast to most existing methods, genoCN is more flexible in that the model parameters are estimated from the data instead of being decided a priori. GenoCNA also incorporates two important strategies for CNA studies. First, the effects of tissue contamination are explicitly modeled. Second, if SNP arrays are performed for both tumor and normal tissues of one individual, the genotype calls from normal tissue are used to study CNAs in tumor tissue. We evaluated genoCN by applications to 162 HapMap individuals and a brain tumor (glioblastoma) dataset and showed that our method can successfully identify both types of copy number differences and produce high-quality genotype calls.
doi:10.1093/nar/gkp493
PMCID: PMC2935461
PMID: 19581427
Background. Array-based comparative genomic hybridization (array-CGH) is an emerging high-resolution and high-throughput molecular genetic technique that allows genome-wide screening for chromosome alterations. DNA copy number alterations (CNAs) are a hallmark of somatic mutations in tumor genomes and congenital abnormalities that lead to diseases such as mental retardation. However, accurate identification of amplified or deleted regions requires a sequence of different computational analysis steps of the microarray data. Results. We have developed a user-friendly and versatile tool for the normalization, visualization, breakpoint detection, and comparative analysis of array-CGH data which allows the accurate and sensitive detection of CNAs. Conclusion. The implemented option for the determination of minimal altered regions (MARs) from a series of tumor samples is a step forward in the identification of new tumor suppressor genes or oncogenes.
doi:10.1155/2009/201325
PMCID: PMC2728899
PMID: 19696946
The Overlay Tool© has been developed to combine high throughput data derived from various microarray platforms. This tool analyzes high-resolution correlations between gene expression changes and either copy number abnormalities (CNAs) or loss of heterozygosity events detected using array comparative genomic hybridization (aCGH). Using an overlay analysis which is designed to be performed using data from multiple microarray platforms on a single biological sample, the Overlay Tool© identifies potentially important genes whose expression profiles are changed as a result of losses, gains and amplifications in the cancer genome. In addition, the Overlay Tool© will incorporate loss of heterozygosity (LOH) probability data into this overlay procedure. To facilitate this analysis, we developed an application which computationally combines two or more high throughput datasets (e.g. aCGH/expression) into a single categorized dataset for visualization and interrogation using a gene-centric approach. As such, data from virtually any microarray platform can be incorporated without the need to remap entire datasets individually. The resultant categorized (overlay) data set can be conveniently viewed using our in-house visualization tool, aCGHViewer© (Shankar et al. 2006), which serves as a conduit to public databases such as UCSC and NCBI, to rapidly investigate genes of interest.
PMCID: PMC2675835
PMID: 19455250
Overlay Analysis; Microarray; ACGH; expression profiling; CNAs; aCGHViewer
Motivation
DNA copy number aberrations (CNAs) and gene expression (GE) changes provide valuable information for studying chromosomal instability and its consequences in cancer. While it is clear that the structural aberrations and the transcript levels are intertwined, their relationship is more complex and subtle than initially suspected. Most studies so far have focused on how a CNA affects the expression levels of those genes contained within that CNA.
Results
To better understand the impact of CNAs on expression, we investigated the correlation of each CNA to all other genes in the genome. The correlations are computed over multiple patients that have both expression and copy number measurements in brain, bladder, and breast cancer data sets. We find that a CNA has a direct impact on the gene amplified or deleted, but it also has a broad, indirect impact elsewhere. To identify a set of CNAs that is coordinately associated with the expression changes of a set of genes, we used a biclustering algorithm on the correlation matrix. For each of the three cancer types examined, the aberrations in several loci are associated with cancer-type specific biological pathways that have been described in the literature: CNAs of chromosome (chr) 7p13 were significantly correlated with epidermal growth factor receptor signaling pathway in glioblastoma multiforme, chr 13q with NF-kappaB cascades in bladder cancer, and chr 11p with Reck pathway in breast cancer. In all three data sets, gene sets related to cell cycle/division such as M phase, DNA replication, and cell division were also associated with CNAs. Our results suggest that CNAs are both directly and indirectly correlated with changes in expression and that it is beneficial to examine the indirect effects of CNAs.
doi:10.1093/bioinformatics/btn034
PMCID: PMC2600603
PMID: 18263644
Aim
To investigate overall chromosomal alterations using array‐based comparative genomic hybridisation (CGH) of myxoid liposarcomas (MLSs) and myxofibrosarcomas (MFSs).
Materials and methods
Genomic DNA extracted from fresh‐frozen tumour tissues was labelled with fluorochromes and then hybridised on to an array consisting of 1440 bacterial artificial chromosome clones representing regions throughout the entire human genome important in cytogenetics and oncology.
Results
DNA copy number aberrations (CNAs) were found in all the 8 MFSs, but no alterations were found in 7 (70%) of 10 MLSs. In MFSs, the most frequent CNAs were gains at 7p21.1–p22.1 and 12q15–q21.1 and a loss at 13q14.3–q34. The second most frequent CNAs were gains at 7q33–q35, 9q22.31–q22.33, 12p13.32–pter, 17q22–q23, Xp11.2 and Xq12 and losses at 10p13–p14, 10q25, 11p11–p14, 11q23.3–q25, 20p11–p12 and 21q22.13–q22.2, which were detected in 38% of the MFSs examined. In MLSs, only a few CNAs were found in two sarcomas with gains at 8p21.2–p23.3, 8q11.22–q12.2 and 8q23.1–q24.3, and in one with gains at 5p13.2–p14.3 and 5q11.2–5q35.2 and a loss at 21q22.2–qter.
Conclusions
MFS has more frequent and diverse CNAs than MLS, which reinforces the hypothesis that MFS is genetically different from MLS. Out‐array CGH analysis may also provide several entry points for the identification of candidate genes associated with oncogenesis and progression in MFS.
doi:10.1136/jcp.2005.034942
PMCID: PMC1860469
PMID: 16751306
Varma, G | Varma, R | Huang, H | Pryshchepava, A | Groth, J | Fleming, D | Nowak, N J | McQuaid, D | Conroy, J | Mahoney, M | Moysich, K | Falkner, K L | Geradts, J
High-resolution array comparative genomic hybridisation (aCGH) analysis of DNA copy number aberrations (CNAs) was performed on breast carcinomas in premenopausal women from Western New York (WNY) and from Gomel, Belarus, an area exposed to fallout from the 1986 Chernobyl nuclear accident. Genomic DNA was isolated from 47 frozen tumour specimens from 42 patients and hybridised to arrays spotted with more than 3000 BAC clones. In all, 20 samples were from WNY and 27 were from Belarus. In total, 34 samples were primary tumours and 13 were lymph node metastases, including five matched pairs from Gomel. The average number of total CNAs per sample was 76 (range 35–134). We identified 152 CNAs (92 gains and 60 losses) occurring in more than 10% of the samples. The most common amplifications included gains at 8q13.2 (49%), at 1p21.1 (36%), and at 8q24.21 (36%). The most common deletions were at 1p36.22 (26%), at 17p13.2 (26%), and at 8p23.3 (23%). Belarussian tumours had more amplifications and fewer deletions than WNY breast cancers. HER2/neu negativity and younger age were also associated with a higher number of gains and fewer losses. In the five paired samples, we observed more discordant than concordant DNA changes. Unsupervised hierarchical cluster analysis revealed two distinct groups of tumours: one comprised predominantly of Belarussian carcinomas and the other largely consisting of WNY cases. In total, 50 CNAs occurred significantly more commonly in one cohort vs the other, and these included some candidate signature amplifications in the breast cancers in women exposed to significant radiation. In conclusion, our high-density aCGH study has revealed a large number of genetic aberrations in individual premenopausal breast cancer specimens, some of which had not been reported before. We identified a distinct CNA profile for carcinomas from a nuclear fallout area, suggesting a possible molecular fingerprint of radiation-associated breast cancer.
doi:10.1038/sj.bjc.6602784
PMCID: PMC2361621
PMID: 16222315
amplification; array CGH; breast cancer; deletion; radiation
Recently, microarray-based comparative genomic hybridization (array-CGH) has emerged as a very efficient technology with higher resolution for the genome-wide identification of copy number alterations (CNA). Although CNAs are thought to affect gene expression, there is no platform currently available for the integrated CNA-expression analysis. To achieve high-resolution copy number analysis integrated with expression profiles, we established human 30k oligoarray-based genome-wide copy number analysis system and explored the applicability of this system for integrated genome and transcriptome analysis using MDA-MB-231 cell line. We compared the CNAs detected by the oligoarray with those detected by the 3k BAC array for validation. The oligoarray identified the single copy difference more accurately and sensitively than the BAC array. Seventeen CNAs detected by both platforms in MDA-MB-231 such as gains of 5p15.33-13.1, 8q11.22-8q21.13, 17p11.2, and losses of 1p32.3, 8p23.3-8p11.21, and 9p21 were consistently identified in previous studies on breast cancer. There were 122 other small CNAs (mean size 1.79 mb) that were detected by oligoarray only, not by BAC-array. We performed genomic qPCR targeting 7 CNA regions, detected by oligoarray only, and one non-CNA region to validate the oligoarray CNA detection. All qPCR results were consistent with the oligoarray-CGH results. When we explored the possibility of combined interpretation of both DNA copy number and RNA expression profiles, mean DNA copy number and RNA expression levels showed a significant correlation. In conclusion, this 30k oligoarray-CGH system can be a reasonable choice for analyzing whole genome CNAs and RNA expression profiles at a lower cost.
doi:10.3858/emm.2009.41.7.051
PMCID: PMC2721143
PMID: 19322034
cell line, tumor; gene dosage; gene expression profiling; oligonucleotide array sequence analysis
Yuan, Xiguo | Yu, Guoqiang | Hou, Xuchu | Shih, Ie-Ming | Clarke, Robert | Zhang, Junying | Hoffman, Eric P | Wang, Roger R | Zhang, Zhen | Wang, Yue
Background
Somatic Copy Number Alterations (CNAs) in human genomes are present in almost all human cancers. Systematic efforts to characterize such structural variants must effectively distinguish significant consensus events from random background aberrations. Here we introduce Significant Aberration in Cancer (SAIC), a new method for characterizing and assessing the statistical significance of recurrent CNA units. Three main features of SAIC include: (1) exploiting the intrinsic correlation among consecutive probes to assign a score to each CNA unit instead of single probes; (2) performing permutations on CNA units that preserve correlations inherent in the copy number data; and (3) iteratively detecting Significant Copy Number Aberrations (SCAs) and estimating an unbiased null distribution by applying an SCA-exclusive permutation scheme.
Results
We test and compare the performance of SAIC against four peer methods (GISTIC, STAC, KC-SMART, CMDS) on a large number of simulation datasets. Experimental results show that SAIC outperforms peer methods in terms of larger area under the Receiver Operating Characteristics curve and increased detection power. We then apply SAIC to analyze structural genomic aberrations acquired in four real cancer genome-wide copy number data sets (ovarian cancer, metastatic prostate cancer, lung adenocarcinoma, glioblastoma). When compared with previously reported results, SAIC successfully identifies most SCAs known to be of biological significance and associated with oncogenes (e.g., KRAS, CCNE1, and MYC) or tumor suppressor genes (e.g., CDKN2A/B). Furthermore, SAIC identifies a number of novel SCAs in these copy number data that encompass tumor related genes and may warrant further studies.
Conclusions
Supported by a well-grounded theoretical framework, SAIC has been developed and used to identify SCAs in various cancer copy number data sets, providing useful information to study the landscape of cancer genomes. Open–source and platform-independent SAIC software is implemented using C++, together with R scripts for data formatting and Perl scripts for user interfacing, and it is easy to install and efficient to use. The source code and documentation are freely available at http://www.cbil.ece.vt.edu/software.htm.
doi:10.1186/1471-2164-13-342
PMCID: PMC3428679
PMID: 22839576
Background
Copy number aberrations (CNAs) are an important molecular signature in cancer initiation, development, and progression. However, these aberrations span a wide range of chromosomes, making it hard to distinguish cancer related genes from other genes that are not closely related to cancer but are located in broadly aberrant regions. With the current availability of high-resolution data sets such as single nucleotide polymorphism (SNP) microarrays, it has become an important issue to develop a computational method to detect driving genes related to cancer development located in the focal regions of CNAs.
Results
In this study, we introduce a novel method referred to as the wavelet-based identification of focal genomic aberrations (WIFA). The use of the wavelet analysis, because it is a multi-resolution approach, makes it possible to effectively identify focal genomic aberrations in broadly aberrant regions. The proposed method integrates multiple cancer samples so that it enables the detection of the consistent aberrations across multiple samples. We then apply this method to glioblastoma multiforme and lung cancer data sets from the SNP microarray platform. Through this process, we confirm the ability to detect previously known cancer related genes from both cancer types with high accuracy. Also, the application of this approach to a lung cancer data set identifies focal amplification regions that contain known oncogenes, though these regions are not reported using a recent CNAs detecting algorithm GISTIC: SMAD7 (chr18q21.1) and FGF10 (chr5p12).
Conclusions
Our results suggest that WIFA can be used to reveal cancer related genes in various cancer data sets.
doi:10.1186/1471-2105-12-146
PMCID: PMC3114745
PMID: 21569311
Background
Both somatic copy number alterations (CNAs) and germline copy number variants (CNVs) that are prevalent in healthy individuals can appear as recurrent changes in comparative genomic hybridization (CGH) analyses of tumors. In order to identify important cancer genes CNAs and CNVs must be distinguished. Although the Database of Genomic Variants (DGV) contains a list of all known CNVs, there is no standard methodology to use the database effectively.
Results
We develop a prediction model that distinguishes CNVs from CNAs based on the information contained in the DGV and several other variables, including segment's length, height, closeness to a telomere or centromere and occurrence in other patients. The models are fitted on data from glioblastoma and their corresponding normal samples that were collected as part of The Cancer Genome Atlas project and hybridized to Agilent 244 K arrays.
Conclusions
Using the DGV alone CNVs in the test set can be correctly identified with about 85% accuracy if the outliers are removed before segmentation and with 72% accuracy if the outliers are included, and additional variables improve the prediction by about 2-3% and 12%, respectively. Final models applied to data from ovarian tumors have about 90% accuracy with all the variables and 86% accuracy with the DGV alone.
doi:10.1186/1471-2105-11-297
PMCID: PMC2897829
PMID: 20525196
Zhang, Qunyuan | Ding, Li | Larson, David E. | Koboldt, Daniel C. | McLellan, Michael D. | Chen, Ken | Shi, Xiaoqi | Kraja, Aldi | Mardis, Elaine R. | Wilson, Richard K. | Borecki, Ingrid B. | Province, Michael A.
Motivation: DNA copy number aberration (CNA) is a hallmark of genomic abnormality in tumor cells. Recurrent CNA (RCNA) occurs in multiple cancer samples across the same chromosomal region and has greater implication in tumorigenesis. Current commonly used methods for RCNA identification require CNA calling for individual samples before cross-sample analysis. This two-step strategy may result in a heavy computational burden, as well as a loss of the overall statistical power due to segmentation and discretization of individual sample's data. We propose a population-based approach for RCNA detection with no need of single-sample analysis, which is statistically powerful, computationally efficient and particularly suitable for high-resolution and large-population studies.
Results: Our approach, correlation matrix diagonal segmentation (CMDS), identifies RCNAs based on a between-chromosomal-site correlation analysis. Directly using the raw intensity ratio data from all samples and adopting a diagonal transformation strategy, CMDS substantially reduces computational burden and can obtain results very quickly from large datasets. Our simulation indicates that the statistical power of CMDS is higher than that of single-sample CNA calling based two-step approaches. We applied CMDS to two real datasets of lung cancer and brain cancer from Affymetrix and Illumina array platforms, respectively, and successfully identified known regions of CNA associated with EGFR, KRAS and other important oncogenes. CMDS provides a fast, powerful and easily implemented tool for the RCNA analysis of large-scale data from cancer genomes.
Availability: The R and C programs implementing our method are available at https://dsgweb.wustl.edu/qunyuan/software/cmds.
Contact: qunyuan@wustl.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btp708
PMCID: PMC2852218
PMID: 20031968
Defining regions of genomic imbalance can identify genes involved in tumour development. Conventional cytogenetics has identified several nonrandom copy number alterations (CNA) in uveal melanomas (UVM), which include monosomy 3, chromosome 6 abnormalities and gain of 8q. To gain further insight into the CNAs and define the regions involved more precisely we analysed 18 primary UVMs using 1 Mb BAC microarray comparative genomic hybridisation (CGH). Our analysis showed that the most common genomic imbalances were 8q gain (78%), 6p gain (67%) and monosomy 3 (56%). Two distinct CGH profiles could be delineated on the basis of the chromosome 3 status. The most common genetic changes in monosomy 3 tumours, in our study, were gain of 8q11.21–q24.3, 6p25.1–p21.2, 21q21.2–q21.3 and 21q22.13–q22.3 and loss of 1p36.33–p34.3, 1p31.1–p21.2, 6q16.2–q25.3 and 8p23.3–p11.23. In contrast, disomy 3 tumours showed recurrent gains of only 6p25.3–p22.3 and 8q23.2–q24.3. Our approach allowed definition of the smallest overlapping regions of imbalance, which may be important in the development of UVM.
doi:10.1038/sj.bjc.6602834
PMCID: PMC2361503
PMID: 16251874
uveal melanoma; array CGH; regions of imbalance
Recurrent copy number alterations (CNAs) play an important role in cancer genesis. While a number of computational methods have been proposed for identifying such CNAs, their relative merits remain largely unknown in practice since very few efforts have been focused on comparative analysis of the methods. To facilitate studies of recurrent CNA identification in cancer genome, it is imperative to conduct a comprehensive comparison of performance and limitations among existing methods. In this paper, six representative methods proposed in the latest six years are compared. These include one-stage and two-stage approaches, working with raw intensity ratio data and discretized data respectively. They are based on various techniques such as kernel regression, correlation matrix diagonal segmentation, semi-parametric permutation and cyclic permutation schemes. We explore multiple criteria including type I error rate, detection power, Receiver Operating Characteristics (ROC) curve and the area under curve (AUC), and computational complexity, to evaluate performance of the methods under multiple simulation scenarios. We also characterize their abilities on applications to two real datasets obtained from cancers with lung adenocarcinoma and glioblastoma. This comparison study reveals general characteristics of the existing methods for identifying recurrent CNAs, and further provides new insights into their strengths and weaknesses. It is believed helpful to accelerate the development of novel and improved methods.
doi:10.1371/journal.pone.0052516
PMCID: PMC3527554
PMID: 23285074
Motivation: Somatic amplification of particular genomic regions and selection of cellular lineages with such amplifications drives tumor development. However, pinpointing genes under such selection has been difficult due to the large span of these regions. Our recently-developed method, the amplification distortion test (ADT), identifies specific nucleotide alleles and haplotypes that confer better survival for tumor cells when somatically amplified. In this work, we focus on evaluating ADT's power to detect such causal variants across a variety of tumor dataset scenarios.
Results: Towards this end, we generated multiple parameter-based, synthetic datasets—derived from real data—that contain somatic copy number aberrations (CNAs) of various lengths and frequencies over germline single nucleotide polymorphisms (SNPs) genome-wide. Gold-standard causal sub-regions were assigned within these CNAs, followed by an assessment of ADT's ability to detect these sub-regions. Results indicate that ADT possesses high sensitivity and specificity in large sample sizes across most parameter cases, including those that more closely reflect existing SNP and CNA cancer data.
Availability: ADT is implemented in the Java software HADiT and can be downloaded through the SVN repository (via Develop→ Code→SVN Browse) at: http://sourceforge.net/projects/hadit/.
Contact: ninad.dewal@dbmi.columbia.edu
Supplementary Information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btp694
PMCID: PMC2852215
PMID: 20031965
Kuroda, Akiko | Tsukamoto, Yoshiyuki | Nguyen, Lam Tung | Noguchi, Tsuyoshi | Takeuchi, Ichiro | Uchida, Masahiro | Uchida, Tomohisa | Hijiya, Naoki | Nakada, Chisato | Okimoto, Tadayoshi | Kodama, Masaaki | Murakami, Kazunari | Matsuura, Keiko | Seto, Masao | Ito, Hisao | Fujioka, Toshio | Moriyama, Masatsugu | Novelli, Giuseppe
Genomic copy number aberrations (CNAs) in gastric cancer have already been extensively characterized by array comparative genomic hybridization (array CGH) analysis. However, involvement of genomic CNAs in the process of submucosal invasion and lymph node metastasis in early gastric cancer is still poorly understood. In this study, to address this issue, we collected a total of 59 tumor samples from 27 patients with submucosal-invasive gastric cancers (SMGC), analyzed their genomic profiles by array CGH, and compared them between paired samples of mucosal (MU) and submucosal (SM) invasion (23 pairs), and SM invasion and lymph node (LN) metastasis (9 pairs). Initially, we hypothesized that acquisition of specific CNA(s) is important for these processes. However, we observed no significant difference in the number of genomic CNAs between paired MU and SM, and between paired SM and LN. Furthermore, we were unable to find any CNAs specifically associated with SM invasion or LN metastasis. Among the 23 cases analyzed, 15 had some similar pattern of genomic profiling between SM and MU. Interestingly, 13 of the 15 cases also showed some differences in genomic profiles. These results suggest that the majority of SMGCs are composed of heterogeneous subpopulations derived from the same clonal origin. Comparison of genomic CNAs between SMGCs with and without LN metastasis revealed that gain of 11q13, 11q14, 11q22, 14q32 and amplification of 17q21 were more frequent in metastatic SMGCs, suggesting that these CNAs are related to LN metastasis of early gastric cancer. In conclusion, our data suggest that generation of genetically distinct subclones, rather than acquisition of specific CNA at MU, is integral to the process of submucosal invasion, and that subclones that acquire gain of 11q13, 11q14, 11q22, 14q32 or amplification of 17q21 are likely to become metastatic.
doi:10.1371/journal.pone.0022313
PMCID: PMC3141024
PMID: 21811585
Background
DNA copy number aberration (CNA) is very important in the pathogenesis of tumors and other diseases. For example, CNAs may result in suppression of anti-oncogenes and activation of oncogenes, which would cause certain types of cancers. High density single nucleotide polymorphism (SNP) array data is widely used for the CNA detection. However, it is nontrivial to detect the CNA automatically because the signals obtained from high density SNP arrays often have low signal-to-noise ratio (SNR), which might be caused by whole genome amplification, mixtures of normal and tumor cells, experimental noise or other technical limitations. With the reduction in SNR, many false CNA regions are often detected and the true CNA regions are missed. Thus, more sophisticated statistical models are needed to make the CNAs detection, using the low SNR signals, more robust and reliable.
Results
This paper presents a conditional random pattern (CRP) model for CNA detection where much contextual cues are explored to suppress the noise and improve CNA detection accuracy. Both simulated and the real data are used to evaluate the proposed model, and the validation results show that the CRP model is more robust and reliable in the presence of noise for CNA detection using high density SNP array data, compared to a number of widely used software packages.
Conclusions
The proposed conditional random pattern (CRP) model could effectively detect the CNA regions in the presence of noise.
doi:10.1186/1471-2105-11-200
PMCID: PMC2876128
PMID: 20412592