Changes in the number of copies of genomic DNA is an important step in the progression of cancer. Comparative genomic hybridization (CGH) was developed to identify these changes at a resolution of 10–20 Mb (Kallioniemi et al., 1992
). Platforms for copy number (CN) analysis that employ microarray technology and that achieve high resolution include array CGH (Pinkel et al., 1998
), ROMA (Lucito et al., 2003
) and SNP arrays (Hardenbol et al., 2005
; Peiffer et al., 2006
; Zhao et al., 2004
). Current technology has improved the resolution to as low as 1 kb. With custom arrays also available, the resolution in particular neighborhoods can be even higher.
Heretofore, CN analysis has consisted primarily of examining total copy number (TCN). TCN is the sum of the CNs from the two parental chromosomes. For normal human cells, total CN is two, one from each parental chromosome. SNP arrays allow separate estimates of CN from the parental chromosomes. This is parent-specific copy number (PSCN).
PSCN may be interesting for two major reasons. First, there may be alleles that differentially undergo CN change (Nagase et al., 2003
). Estimating PSCN would help elucidate this situation. Second, when the total CN is C
, the PSCNs may be more complicated than (1,C
−1). For instance, diploid (C
=2) CN is maintained when one parental copy is lost and the other is doubled. This type of alteration is called copy-neutral loss-of-heterozygosity
(CN-LOH), and it occurs often in many cancers including glioblastoma (Kuga et al., 2008
) and hematologic malignancies (O'Keefe et al., 2010
). Such a region would be assumed normal if there was analysis only of total CN.
Direct estimates of PSCN can be made only for SNPs at which a subject is heterozygous. Homozygous SNPs are not directly informative because all the CN signal is in one allele. For example, if both parents contributed G, then there would only be a G CN signal, and this would result in no information additional to that contained in total CN. For heterozygous SNPs, however, there are two components to the CN information. If the subject was GT at a SNP, then there would be a CN estimate corresponding to G and one corresponding to T. One of the CNs would be expected to have come from one parent and the other to have come from the other parent. Additionally, the data are unphased; it is not directly known which measurement is associated with which parental chromosome.
CN alterations apply to contiguous regions, and the data on CN derived from microarrays can be noisy. Therefore, methods have been developed to analyze CN data that rely on the underlying spatial correlation. The idea is to split the genome into regions of equal total CN. Methods for this have included direct segmentation (Olshen et al., 2004
; Picard et al., 2005
; Venkatraman and Olshen, 2007
), hidden Markov models (HMMs) (Fridlyand et al., 2004
; Guha et al., 2008
; Lai et al., 2008
) and smoothing (Hsu et al., 2005
; Tibshirani and Wang, 2008
). When the earlier methods were compared (Lai et al., 2005
; Willenbrock and Fridlyand, 2005
), direct segmentation methods performed best.
The purpose of the present article is to extend segmentation to allele-specific data. We cannot simply perform two separate segmentations, one for each parental chromosome, because the data are unphased. Therefore, other techniques are needed. As part of our algorithm, we use the circular binary segmentation (CBS) method (Olshen et al., 2004
; Venkatraman and Olshen, 2007
), although any good segmentation method could replace CBS in our overall procedure. We call our method Paired Parent-Specific CBS (‘Paired PSCBS’ or just ‘PSCBS’), while acknowledging that due to the lack of phase information, we cannot assign segmentation-based estimates to the paternal or maternal chromosomes.
Other approaches to PSCN segmentation bear some resemblance to PSCBS. Here, we focus on the BAF segmentation method of Staaf et al. (2008
), especially since their study provides a comparison to existing methods. It is similar in that it relies on CBS and it adapts to datasets consisting of paired tumor and normal samples. It essentially segments the mirrored B-allele frequency
(mirrored BAF), which is the ratio of the higher parental copy number to the total copy number, after removing all homozygotes identified in the normal samples. It differs from PSCBS, as discussed in Section 2
, in that it segments only heterozygous SNPs, whereas Paired PSCBS has an advantage in that it utilizes all SNPs as well as any non-polymorphic loci. Another advantage of Paired PSCBS over BAF segmentation is that it uses the normal sample to more accurately quantify the tumor data (Bengtsson et al., 2010
Another paired method of which we are aware is a hidden Markov method that segments jointly on TCN and mirrored BAF (Lamy et al., 2007
). But since it is specific to Affymetrix arrays, and we are interested only in general methods, we did not evaluate it. During the review of this article, Van Loo et al. (2010
) published a paired joint segmentation method that was not studied here.
Other methodologies exist that are not based on paired samples. LaFramboise et al. (2005
) used CBS to segment total CN data, and then estimated parental CN within segments. By not segmenting the allele-specific data, certain events may be missed. Li et al. (2008
) developed a similar procedure using an HMM. They referred to the mirrored BAF as the major copy proportion
, so their method is called MCP. SOMATICs (Assié et al., 2008
) uses the BAF, which is the ratio of the B-allele to the total CN, to identify CN abnormalities that are then confirmed by the total CN. QuantiSNP (Colella et al., 2007
) and PennCNV (Wang et al., 2007
) are two HMM methods that rely on the same six-state model. Sun et al. (2009
) is a ‘2d’ HMM method in the same vein as PennCNV and QuantiSNP, but that has been adapted to cancer studies. Recently, GAP (Popova et al., 2009
) segments total CN and allelic ratio independently and then considers the segments defined by the union of the two sets of change-points. Chen et al. (2011
) extended their HMM methodology (Lai et al., 2008
) to allele-specific data. An advantage of Chen's HMM method (PSCN) is that there is no limit on the number of states. In addition, Greenman et al. (2010
) developed the PICNIC method, which is also based on an HMM and assigns integer CN states.
In the present article, Section 2
covers our methods. Section 3
contains simulations that show the effectiveness of our procedure, as well as an example drawn from glioblastoma data. Finally, Section 4