The Neighborhood Breakpoint Conservation (NBC) algorithm takes, as input, aCGH data from many individuals and identifies recurrent breakpoints and pairs of recurrent breakpoints in a subset of the individuals (Figure ). The first step in NBC, as in most aCGH analysis, is to segment each copy number profile into intervals of equal copy number.

While many existing methods produce a single segmentation for aCGH data [

19-

21], NBC uses a dynamic programming approach [

22] to compute the probability

of a copy number profile

**X **given a segmentation

. NBC then employs a stochastic backtrace to compute the posterior probability

. Using this approach, one can derive the segmentation

with maximum probability, but more importantly, one can compute the posterior probability of events of interest over

*all possible segmentations *of the data. In particular, we compute the probability of a breakpoint between each pair of adjacent probes, as well as the probability of a breakpoint within a fixed interval or probes (e.g. from a gene region).

The second step of NBC is to combine breakpoint probabilities in each individual to determine breakpoints that appear in multiple individuals. Similar to [

11], we use a binomial order statistic [

23] to compute a

*p*-value for the event that

*k *or more individuals share a breakpoint between two adjacent probes. We then extend this breakpoint score to consider pairs of breakpoints that are shared by multiple individuals. Finally, we also define a score for a breakpoint that may occur anywhere within an interval of adjacent probes (e.g. a gene) that is shared by multiple individuals. We detail each of these two steps in the following sections.

A Probability Model for Segmentation and Breakpoint Analysis

A probabilistic formulation of the segmentation problem assigns a probability to each possible segmentation of

**X**. The probability of other events, such as a breakpoint occurring at a particular locus, are readily computed from this model. Probabilistic segmentation approaches have been previously applied to CNV detection [

21,

24-

26], but we found that these methods either: require a finite number of copy number levels (as in the Bayesian Hidden Markov method of [

26]); focus on probabilistic model selection rather than an explicit probabilistic model for the segmentation itself [

21]; or do not perform well on high-resolution oligonucleotide arrays (see Additional File

1, Figure S3 for a comparison to [

25]).

Our algorithm is based on the change-point model described in [

22]. Consider a copy number profile

**X **= (

*X*_{1},...,

*X*_{n}), where

*X*_{i }is the log

_{2 }ratio of test.reference DNA at the

*i*th probe. We assume that the test genome consists of an unknown number of segments

*K *with corresponding copy numbers Θ = {

*θ*_{1},...,

*θ*_{K}}. Following the usual assumptions for aCGH data [

20,

21,

24-

26], we assume that each

*X*_{i }is normally distributed with mean

*μ*_{i }and variance

*σ*^{2}. The variance

*σ*^{2 }is a hyperparameter whose value must be set. Below we describe how we estimate this value from the data. The mean

*μ*_{i }equals

*θ*_{s }if probe

*i *lies within segment

*s*. Further, we assume that

*X*_{i }from different segments are independent. Let

*l*_{j }denote the number of probes in segment

*j*, and let

*k*_{max }denote the maximum number of segments in the test genome.

We define the

*breakpoint sequence ***A **= (

*A*_{1}, ...,

*A*_{K+1}), where

*A*_{v }is the index of the probe at the start of the

*v *+ 1

*st *segment and

*A*_{K+1 }=

*n *is a "dummy" breakpoint signifying the end of the sequence (i.e. there are

*K*+1 breakpoints representing

*K *segments in

**A**). Thus,

The unknowns in our model are the breakpoint sequence

**A**, the number of segments

*K*, and the segment copy numbers Θ. We assume a priori that Θ is independent of

**A **and

*K*. We further assume that the segment copy numbers

*θ*_{s } Θ are independent and normally distributed with mean

*μ*_{0 }and variance

. (The assumption that

gives a conjugate prior for

allowing us to compute some probabilities analytically. See Additional File

1, Section SA.) We assign a prior on breakpoints sequences

**A **such that all

**A **with

*K *segments are equally likely,

. Additionally, we assign a prior on the number of segments

*K *such that there is a probability of

of a single segment (

*K *= 1) and the remaining values of

*K*, 1 <

*K *≤

*k*_{max}, are equally likely. Note that these priors do not make any strong assumptions about the data. essentially, the a priori assumption is that with probability

the data is produced from a single segment.

From the priors

*P *(

**A**|

*K*) and

*P *(

*K *=

*k*) and the values of the hyperparameter

*σ*;

*μ*_{0},

*σ*_{0}, the joint distribution

*P *(

**X**,

**A**, Θ,

*K*) can be derived (Additional File

1, Section SA).

Hyperparameter Estimation

The segmentation and breakpoint analysis algorithm relies on setting values for the hyperparameters

*μ*_{0 }(the baseline mean),

(the variance in segment copy numbers), and

*σ*^{2 }(the variance in probe measurements). We describe how to estimate these from the copy number profile

**X **= (

*X*_{1},...,

*X*_{n}). First, we set

*μ*_{0 }to be the median of the

*X*_{i}. To estimate the variances

and

*σ*^{2}, we form sliding windows of 10 probes. Let

*V *be the median of the sample variances of the windows, and let

*M *be the maximum absolute difference between the sample means of the windows and

*μ*_{0}. We set the measurement variance

*σ*^{2 }= 2

*V *and the segment variance

.

To test the sensitivity of our results to our particular estimates of the hyperparameters - in particular our estimates of

*σ*^{2 }and

- we performed two simulations that are inspired by the simulations of [

25].

**Simulation #1 **We generated an artificial chromosome with 100 probes containing a 40 probe single-copy gain (log

_{2 }ratio of 1) placed in the center. We then introduced various amounts of gaussian noise

in the probe measurements, setting

. For each value of

, we generated 100 such chromosomes.

**Simulation #2 **We generated an artificial chromosome with 100 probes with gaussian noise *N*(0, 0.5) in the probe measurements. We then introduced a 40 probe aberration at various log_{2 }ratios. 0.5, 1, 2, 3, 4, 5, and 6. For each log_{2 }ratio, we generated 100 such chromosomes.

A representative sample of the datasets for Simulation #1 and Simulation #2 are shown in Additional File

1, Figure S1 and S2.

We ran NBC on datasets from the two simulations with different estimates for the variances

and

*σ*^{2}, detailed below. To assess the quality of the resulting breakpoint predictions, we consider probe locations with Pr(breakpoint) ≥ 0.5 to be a predicted breakpoint. We assume that a predicted breakpoint detects a true breakpoint if the predicted breakpoint location is ≤ 2 probes away from the true breakpoint location. We count the number of true positive predictions (0, 1, or 2). Additionally, we count the number of false positive predictions for each dataset. We average the true positives and false positives over the 100 artificial chromosomes.

Simulation #1 has a fixed aberration log

_{2 }ratio, so we set the segment variance

and we test three different values of

*σ*^{2}:

*V*, 2

*V*, and 3

*V *(Figure top row). Compared to our estimated value of

*σ*^{2 }= 2

*V*, the number of true positives is similar when

*σ*^{2 }=

*V *or

*σ*^{2 }= 3

*V *and the measurement error

is low. As

increases, setting

*σ*^{2 }=

*V *results in more false positives compared to our estimate

*σ*^{2 }= 2

*V*, while setting

*σ*^{2 }= 3

*V *results in fewer total predictions, including true positives. Thus, at lower measurement error the results are not particularly sensitive to the value of

*σ*^{2}, with our estimate

*σ*^{2 }= 2

*V *maintaining reasonable sensitivity and specificity and higher measurement error.

Since Simulation #2 has fixed measurement error, we set the measurement variance

*σ*^{2 }= 2

*V *and test three different values of

, and

*M*^{3 }(Figure bottom row). The number of true and false positives is very similar for all three estimates of

. The only exception is that when

, there is a large variation in the number of false positives over the different simulated chromosomes. These simulations show that our hyperparameter estimates are reasonable, although other estimation approaches are possible.

The simulations underscore that the ability to detect the breakpoints of a segment is related to both the copy number of the segment (governed by the segment variance

) and the measurement error (governed by the variance

*σ*^{2}). For example, in Simulation #1 (where

is fixed), as the probe variance

*σ*^{2 }increases the average number of false positive breakpoints increases while the average number of true positives remains below one. To avoid such situations, we do not segment the data and immediately report 0 breakpoints when our estimates of

*σ *and

*σ*_{0 }satisfy

*σ *≥ 3

*σ*_{0}.

Computing Breakpoint Probabilities

We compute the probability of a breakpoint between pairs of adjacent probes by sampling breakpoint sequences

**A **from the distribution

*P *(

**A**|

**X**) and counting the proportion of samples that have a breakpoint between adjacent probes. Note that the probability of a breakpoint between adjacent probes can be analytically computed (see [

22]). We describe a sampling strategy, since this will generalize to the computation of the probability of breakpoints that lie within an interval or pairs or breakpoints. For notational convenience, let

*X*_{[i:j] }= (

*X*_{i},...,

*X*_{j}),

*X*_{(i:j] }(

*X*_{i+1},...,

*X*_{j}), and

*X*_{[i:j) }= (

*X*_{i},...,

*X*_{j-1}) The probability of

**X **is

Here,

is the length (number of breakpoints) of

*P *(

**X**|

*K *=

*k*) is the probability of the data

**X **given that the test genome is divided into

*k *segments, and

is the probability that

consists of a single segment. The product in Equation (2) results from the segment independence assumption. The choice of a conjugate prior for

*P *(

*θ*) allows the integral to be analytically computed (Additional File

1, Section SA.2). However, calculating

*P *(

**X**|

*K *=

*k*) in this way requires summing over all possible breakpoint sequences

**A **and is computationally infeasible. A dynamic program allows the efficient computation of this term.

Dynamic program

Let

*P *(

*X*_{[i:j]}*|k*) be the probability of observing

*X*_{[i:j] }given that it is generated from

*k *different segments. We compute this

*P *(

*X*_{[1:j]}|

*k*) for 1 ≤

*k *≤

*k*_{max }and 1 ≤

*j *≤

*n *as follows.

The final row of the dynamic programming table contains *P *(**X**|*K *= *k*) for 1 ≤ *k *≤ *k*_{max}, which is used in Equation (1) to compute *P *(**X**).

Recursive sampling

We use

*P *(

**X**|

*K *=

*k*) as well as the base case

*P *(

*X*_{[i:j]}|1) and intermediate terms

*P *(

*X*_{[1:j]}|

*k*) in the dynamic program to sample exact and independent breakpoint sequences

**A **using a backward sampling technique [

22].

1. Draw *K *= *k *from *P *(*K *= *k*|**X**), determined by inverting *P *(**X**|*K *= *k*) using Bayes Rule.

2. Set *A*_{k+1 }= *n*.

3. Draw

*A*_{k},

*A*_{k-1}, ...,

*A*_{1 }recursively using the conditional distributions computed by the recurrences in Equation (4). Given

*A*_{q}, the location of the beginning of the

*q*th segment, the distribution of

*A*_{q-1 }is obtained as follows.

From a set of breakpoint sequences sampled in proportion to *P *(**A**|**X**), we determine the probability of a breakpoint occurring between two adjacent probes by counting the proportion of samples that contain a breakpoint at that locus. Other probabilities derived from these sampled breakpoint sequences are described in subsequent sections.

Runtime analysis

The base cases *P *(*X*_{[i:j]}|1) require *O*(*n*^{2}) computations and the dynamic program requires *O*(*nk*_{max}) computations; thus computing *P *(**X**|*K *= *k*) is achieved in *O*(*n*(*n *+ *k*_{max})) time. All computations necessary to sample a breakpoint sequence **A **are already computed in the dynamic program, so sampling is linear in the number of breakpoints *K *drawn from *P *(**X**|*K *= *k*).

Identifying Recurrent Breakpoints

After sampling breakpoint sequences for a set of individuals, we identify recurrent breakpoints that appear in many individuals at the same genomic locus. Let

be a set of copy number profiles from m individuals, where

*S*_{j }= (

*X*_{1}, ...,

*X*_{n}) is the copy number profile for individual

*j*. We assume that the same array probes are used for each individual, i.e. the

*i*th probe in individual

*S*_{j }is at the same location as the

*i*th probe in individual

*S*_{j'}. We analyze recurrent breakpoints at two levels of resolution.

• *Recurrent probe breakpoints *occur between the same two array probes in a subset of individuals.

• *Recurrent interval breakpoints *occur within the same interval of the genome in a subset of individuals.

In addition to analyzing these types of recurrent breakpoints, we also consider pairs of recurrent breakpoints to identify recurrent CNVs. Note that these pairs may indicate *intrachromosomal *CNVs, as in the case of classic copy number aberrations like duplications and deletions, or *interchromosomal *CNVs, as in the case of (unbalanced) translocations.

Recurrent probe breakpoints

For each probe, we define a score that measures the presence of a breakpoint in a subset of individuals. We design this score to account for the observation that the number of breakpoints in copy-number profiles, particularly in a set of cancer samples, is highly variable. That is, in a set of cancer samples, even from the same cancer type, there will typically be highly rearranged cancer genomes with many breakpoints, and less rearranged genomes with relatively few breakpoints. This variability in the number of breakpoints is maintained following our Bayesian segmentation approach - despite the fact that we use the same flat prior for each individual - because there is strong evidence to support a larger number of breakpoints in some samples. Since there is a greater chance of recurrent breakpoints occurring randomly in a collection of highly rearranged genomes than a collection of less rearranged genomes, it is advantageous to consider the number of breakpoints in each profile when scoring recurrent breakpoints. Because the variability of number of breakpoints across different individuals is typically not well matched by a standard distribution, one approach is to use a permutation test that preserves the number and probability of breakpoints in each profile while permuting their location. We instead derive a score for recurrent probe breakpoints based on a binomial order statistic [

11,

23]. This score first normalizes the breakpoint probability at each probe in each individual according to the breakpoint probabilities across all probes in individual. These normalized values are then combined across multiple individuals to produce a recurrent breakpoint score.

Let

*b*_{i }be the event that a breakpoint lies between probes

*i *and

*i *+ 1;

*P *(

*b*_{i}|

*S*_{j}) is the

*breakpoint probability *at probe

*i *in individual

*S*_{j}, and is computed by counting the proportion of sampled breakpoint sequences

**A **that have a breakpoint between

*i *and

*i *+ 1. Let

*ρ*_{j}(

*i*) be the fraction of probes with a higher breakpoint probability than probe

*i *in individual

*S*_{j }(the normalized rank of probe

*i*).

Let

*π *be a permutation of the individuals

such that

. For 1 ≤

*h *≤

*m*, we wish to determine the probability that

*h *or more individuals have a breakpoint at location

*i*. Because of our normalization of the breakpoint probabilities in each sample, under the null hypothesis the individual scores

*ρ*_{j}(

*i*) are independent and uniformly distributed in [0,1]. Thus, the probability that h or more individuals have a breakpoint at location

*i *is given by the tail of the binomial distribution with success probability

. The

*p*-value for the probe location

*i *is

where we are only interested in scoring those breakpoints that are present in at least *h*_{min }patients. Note that because the binomial order statistic is computed from the empirical distribution *ρ*_{j }of breakpoint probabilities in each sample, the relative magnitude of the breakpoint probability is not used in the computation. Despite this loss of information, we found that the binomial order statistic produced reasonable results on real data (See Results below) and was more efficient than a permutation test.

Finally, we assume that a recurrent breakpoint is also conserved in the direction of the copy number change: all samples with a recurrent breakpoint are either breakpoints that go from relatively low copy number to high copy number of vice versa. A breakpoint sequence

**A **defined a segmentation, and we use the mean values of each segment to determine the direction of copy number change. The copy number change is positive if the mean of the segment to the right of the breakpoint is higher than the mean of the segment to the left. We test both cases for each recurrent breakpoint, doubling the number of hypotheses we test. We control the False Discovery Rate (FDR) using the method of Benjamini and Hochberg [

27].

Recurrent interval/gene breakpoints

We extend our approach to find recurrent breakpoints that lie within a genomic interval

*W*; e.g. a gene. Unlike the recurrent probe breakpoint calculation above, where each probe was a priori equally likely to contain a breakpoint, intervals that contain more probes are a priori more likely to contain a breakpoint than intervals that contain fewer probes. To account for this, we use a log-odds score that is defined as follows. Let

*b * *W *be the event that one or more breakpoints lie between any pair of adjacent probes within

*W*. Similarly, let

*b * *W *be the event that no breakpoint lies between any adjacent probes within

*W*. The log-odds score

_{j}(

*W*) that patient

*S*_{j }contains a breakpoint within

*W *is

The conditional probabilities

*P*(

*b * *W *|

*S*_{j}) and

*P*(

*b * =

*W *|

*S*_{j}) describe probabilities over all possible segmentations of the copy number profile

*S*_{j}.

*P*(

*b * *W *|

*S*_{j}) is determined by sampling breakpoint sequences

**A **and counting the number of samples that contain one or more breakpoints in the interval

*W*.

*P*(

*b * =

*W *|

*S*_{j}) is then simply 1-

*P*(

*b * *W *|

*S*_{j}). The scaling factor

is computed by counting the number of ways to place breakpoints such that none of them lie in

*W*:

Here, the last term in Equation (9) counts the number of ways to choose

*k *breakpoints that do not lie in

*W*. As in the recurrent breakpoint computation above, we use the binomial order statistic to combine log-odds scores across patients. First, in an analogous computation to Equation (6) we normalize the log-odds scores using the empirical cumulative distribution, which produces the normalized rank of

_{j}(W) for all

*j*:

Finally, using the *ρ*_{j}(*W *) scores for each patient *S*_{j }we compute the *p*-value *ρ*(*W *) using the binomial order statistic as in Equation (7).

For the experiments below, we define the the copy number change for an interval *W *to be positive if at least 90% of the breakpoints within the interval are positive and negative if at least 90% of the breakpoints within the interval are negative. Otherwise, we do not call a breakpoint in *W*.

Pairs of recurrent interval/gene breakpoints

We identify pairs of non-overlapping recurrent interval breakpoints using a log-odds score similar to Equation (8) that scores two breakpoints occurring in intervals

*W*_{1 }and

*W*_{2}. An important case we will consider is when

*W*_{1 }and

*W*_{2 }are genes. Let

*b * *W*_{1 }be the event that a breakpoint lies between any pair of adjacent probes within

*W*_{1}, and Let

*b' * *W*_{2 }be the event that a breakpoint lies between any pair of adjacent probes within

*W*_{2}. We define the score for intervals

*W*_{1 }and

*W*_{2 }for a particular patient

*S*_{j}.

Each term is computed similarly to Equation (8). If

*W*_{1 }and

*W*_{2 }are on different chromosomes, the events

*P *(

*b * *W*_{1}) and

*P *(

*b' * *W*_{2}) are independent and Equations (9) and (10) are used to compute the scaling factor

. If the intervals are on the same chromosome then the events are dependent, and the numerator in the scaling factor is

The denominator in the scaling factor is then

The *p*-value *ρ*(*W*_{1}, *W*_{2}) is computed by normalizing as in Equation (11) according to the empirical distribution of log-odds scores over all pairs of non-overlapping intervals and then using the binomial order statistic to determine the final *p*-value. Here, we test four hypotheses for each pair *W*_{1 }and *W*_{2 }by considering the four combinations of direction of copy number change: {(+, +), (-, -), (-, +), (-, -)}. Note that restricting *W*_{1 }and *W*_{2 }to each contain a single probe identifies pairs of recurrent probe breakpoints.

Predicting Structural Variants, Gene Truncations, and Fusion Genes

Our statistics for single recurrent breakpoints (*ρ*(*i*) and *ρ*(*W*)) and pairs of recurrent breakpoints (*ρ*(*i*, *j*) and *ρ*(*W*_{1}, *W*_{2})) provide a flexible framework to predict particular rearrangement configurations. In this paper, we classify predictions into structural variants, gene truncations, and fusion genes.

Structural variants

Pairs of recurrent probe breakpoints may indicate germline or somatic rearrangements that have recurrent breakpoints at the highest resolution allowed by the spacing of probes. To identify these rearrangements, we compute the pairs of recurrent probe breakpoint statistic for every pair of probes within each chromosomal arm. Note that this limits the structural variant predictions to intrachromosomal rearrangements only.

Gene truncations

Recurrent breakpoints found within a single gene may indicate a gene truncation, resulting in the loss of functionality for a particular gene. To predict gene truncations, we compute the recurrent interval breakpoint detection statistic, using the set of gene regions from RefSeq as our intervals of interest.

Fusion genes

Pairs of recurrent interval breakpoints found within genes suggest potential fusion genes. We compute pairs of recurrent interval breakpoints using all pairs of gene regions from RefSeq as our intervals of interest. Note that not all pairs of recurrent genes suggest functional fusion genes. For example, a rearrangement that joins the 3' end of one gene to the 3' end of another gene is typically not a functional fusion gene. Thus, we restrict our attention to pairs of interval breakpoints with particular configurations (Figure ).

Specifically, consider a pair of recurrent intervals

*G*_{1 }and

*G*_{2 }that represent gene regions. Each gene has an orientation,

*orient*(

*G*_{1})

{+, -} and

*orient*(

*G*_{2})

{+, -}. Additionally, the breakpoint that lies within each recurrent interval has an associated direction of copy number change,

*dir*(

*G*_{1})

{+, -} and

*dir*(

*G*_{2})

{+, -}. We assume that a fusion gene contains the 5' end of one gene joined to the 3' end of the other gene and thus satisfies the following rule.

Filtering and Ranking Predictions

We apply a number of additional steps to remove and prioritize predictions. In the case of fusion genes, if there are many predictions remaining we rank these predictions by the preservation of copy number across the fusion point.

Removing single probe aberrations

Single probe aberrations are segments consisting of a single probe. Since these are difficult to distinguish from experimental artifacts, we remove them from further consideration. Single probe aberrations are characterized by two large changes in copy number in adjacent probes, where the segments adjacent to this aberration have a similar copy number. We identify these probes and remove them from the analysis.

Removing known CNVs

We remove predictions that are new known CNVs. We say that a single probe is "near" a known CNV in the Database of Genomic Variants (DGV) [

28] if it is within 10 kb of a recorded copy number variant endpoint, and a gene region is "near" a known copy number variant if it is within 10 kb of a recorded copy number variant endpoint. Additionally, a pair of intrachromosomal recurrent breakpoints are near a variant if at least one of the breakpoints is within 10 kb of a recorded copy number variant endpoint and the mutual overlap between the prediction interval (defined by the pair of breakpoints) and the variant interval is greater than 50%.

Ranking predictions

Since fusion genes (and other recurrent pairs of breakpoints) are physically joined in the test genome, we expect the copy number of either side of the breakpoint to be the same. Thus, we rank these predictions by calculating the root mean squared difference (RMS) between the copy number levels of probes surrounding the breakpoint. Consider fusion gene predictions. we know the configuration of the gene partners, but we do not know exactly where the breakpoint lies. Thus, we determine the copy number on each side of the fusion as the average of the three flanking probes of the left gene partner and the three flanking probes of the right gene partner. If

*h *patients have the breakpoint, determined by the argmax of Equation (7),

is the left-flanking copy number of the fusion and

is the right-flanking copy number of the fusion, then the RMS difference of the pair of conserved breakpoints is