Home | About | Journals | Submit | Contact Us | Français |

**|**Cancer Inform**|**v.9; 2010**|**PMC2978932

Formats

Article sections

- Abstract
- 1. Introduction
- 2. Methods
- 3. Implementation and results
- 4. Conclusion
- Supplementary Material
- References

Authors

Related links

Cancer Inform. 2010; 9: 229–249.

Published online 2010 October 12. doi: 10.4137/CIN.S5614

PMCID: PMC2978932

Corresponding author email: ua.ude.wou@aixnay

Copyright © 2010 the author(s), publisher and licensee Libertas Academica Ltd.

This is an open access article. Unrestricted non-commercial use is permitted provided the original work is properly cited.

Existing methods for estimating copy number variations in array comparative genomic hybridization (aCGH) data are limited to estimations of the gain/loss of chromosome regions for single sample analysis. We propose the linear-median method for estimating shared copy numbers in DNA sequences across multiple samples, demonstrate its operating characteristics through simulations and applications to real cancer data, and compare it to two existing methods.

Our proposed linear-median method has the power to estimate common changes that appear at isolated single probe positions or very short regions. Such changes are hard to detect by current methods. This new method shows a higher rate of true positives and a lower rate of false positives. The linear-median method is non-parametric and hence is more robust in estimating copy number. Additionally the linear-median method is easily computable for practical aCGH data sets compared to other copy number estimation methods.

During cell division, a cell replicates its genome by synthesizing a new copy of each chromosome, using the original DNA as a template. The expected copy number of 2, may be less/greater than 2 when alterations occur during the replication process. Research has suggested that such abnormalities in the number of DNA copies in a cell are associated with the development and progression of disease, including cancer.^{1} Laboratory research to estimate the altered copy numbers in a DNA sequence often uses aCGH. The technology used to produce aCGH data, however, may result in data that contain uncontrollable noise.^{2} The use of appropriate statistical methods to normalize the data and produce meaningful estimates of copy number variation in a DNA sequence is integral to this research. Developing improved statistical methods for this application is the focus of this paper.

Different statistical methods have been suggested for use with aCGH data to estimate copy numbers in DNA sequences. Methods to analyze copy numbers in terms of identifying the locations of gains or losses of chromosome regions have been developed. Assuming that there is a connection between copy number changes in a cancer cell and the development/progression of the cancer, there must exist some common change regions in DNA sequences collected from different patients with the same cancer diagnosis. Techniques for analyzing shared copy number regions have been developed.^{3}^{,}^{4} For detecting copy number regions in a single sample, Olshen et al^{5} and Venkatraman et al^{6} had developed a widely used method, the faster circular binary segmentation (CBS) method. In this paper, we propose a new method, the linear-median method, for estimating shared copy number alterations in DNA sequences collected from the same type of cancer cells. The linear-median method is able to optimally use the information available across independent DNA sequences.

This paper is organized as follows. In Section 2.1, we discuss current existing statistical models used to assess aCGH data and describe a new model for analyzing multiple independent aCGH data sets. We introduce the linear-median method in Section 2.2. In Section 3.1, we present three simulation studies. We study how much extra information on copy number aberration can be obtained by using the linear-median method compared to the comparative genomic hybridization minimal common region (cghMCR) method and the CBS algorithm. We present an application of the linear-median method to real data in Section 3.2. Supporting figures and tables are available online as Supplementary Material.

aCGH employs the comparative hybridization of genomic DNA that is differentially labeled according to its source in a cancer cell versus a normal cell. The ratio of the hybridization intensities along the chromosomes provides a measure of the relative copy number of sequences in the genomes that hybridize to each location on the chromosomes. Estimating copy numbers and identifying the locations of gains and losses in a DNA sequence are two main challenges in the analysis of aCGH data. We label the normal genomic sequences as “reference” sample and the genomic sequences from cancer cells as the “test” sample. Let *T _{p}* denote the “test” copy number at probe position

We briefly describe two current methods for modeling aCGH data. Let us denote by *Y _{p}* the aCGH data (the logarithm intensity ratio) observed at probe position

$$\text{Model}1:{Y}_{p}={\mathit{\text{log}}}_{2}({T}_{p}/{R}_{p})+{\epsilon}_{p},$$

(1)

where * _{p}* are i.i.d. with normal distribution
$N(0,{\sigma}_{\epsilon}^{2})$. This Gaussian model forms the basis of many models for aCGH data.

$$\text{Model}2:{Y}_{p}={\mathit{\text{log}}}_{2}\left(\frac{{T}_{p}+{\epsilon}_{p}}{{R}_{p}+{\eta}_{p}}\right),$$

(2)

where * _{p}* and

In practice, *R _{p}* is assumed to be 2. Given the logarithm intensity ratio observations, {

Models 1 and 2 assume very different probability structures to describe the system. The variance of the log intensity ratios given by Model 1 is a constant, whereas the variance of the log intensity ratios given by Model 2 is a function of *T _{p}*.

We consider which of the two models is a more appropriate model for the analysis of aCGH data. Although Model 1 looks simpler, it is not an appropriate model for aCGH data. The main reason for this is that aCGH data provide the ratio of the copy number variations, not the ratio of the copy numbers. Furthermore, empirical studies show that the standard error of the logarithm of the intensity ratios increases as the copy number increases. Additionally, the distribution of the logarithm of intensity ratios is skewed.^{9} Thus, the distribution of * _{p}* should not be assumed to be normal if Model 1 is adopted.

Compared to Model 1, Model 2 is a more appropriate model for aCGH data, as it takes into account the ratio of the copy number variations. However, this model can be improved further. The normality assumptions on the distributions of * _{p}* and

In Model 2, the errors * _{p}* and

Therefore, we consider a third model:

$$\text{Model}3:{X}_{p}=\frac{{T}_{p}+{\epsilon}_{p}}{{R}_{p}+{\eta}_{p}},$$

(3)

where * _{p}* and

To allow the model to be more flexible, we can assume that the uniform distributions for * _{p}* and

Model 3 is used to model one aCGH profile from one sample/patient. However, if there is a group of independent samples of aCGH data (eg, multiple patients) and their data share copy number change regions, we can extend Model 3 to such data.

Consider the following scenario. A group of *n* patients suffer from a common cancer. For each patient a sample of aCGH data is collected from a cancer cell. Let *X _{i,p}* be the observed intensity ratio for the ith sample at probe position

For multiple independent aCGH data, the extended model can be considered as

$$\begin{array}{l}\text{Model}4:{X}_{i,p}=\frac{{T}_{i,p}+{\epsilon}_{i,p}}{{R}_{i,p}+{\eta}_{i,p}}1\le p\le M,i=1,2,\dots ,n,\end{array}$$

(4)

where *M* is the total number of probe positions; *n* is the number of independent samples in the group; * _{i,p}* and

Model 4 provides a flexible way to model multiple independent aCGH data in terms of the following arguments:

- The probability distributions of
and_{i,p}*η*are allowed to be different. This means that the probability distribution of the measurement errors for the “test” and “reference” are allowed to be different._{i,p} - The true
**shared**copy number at position*p*is no longer a constant.*T*is a random variable. This means that the copy number (if it were observable) at position_{p}*p*could be different from patient to patient.

Hereafter, we consider multiple independent aCGH data and assume Model 4 as the basis for developing a method to estimate the **shared** copy number *t _{p}*,

Currently, all raw data used for copy number analysis are presented in the format of a *log*_{2} intensity of the ratios of the test to the reference. From the current literature, we know that a linear format refers to using the intensity of the ratios of the test to the reference, and a nonlinear format refers to using a *log*_{2} intensity of the ratios of the test to the reference, as the *log*_{2}(*ratio*) is not linearly related to the copy number. The variance of a linear format tends to be larger than the variance of a nonlinear format when the relative copy number is far away from 1.^{11} This may explain why the nonlinear format is widely used.

It is expected that the *log*_{2} of the true relative copy number, ie, *log*_{2} (*t _{p}/R_{p}*), can be well estimated using the observations of the

$$\begin{array}{l}E\left[{\mathit{\text{log}}}_{2}\left(\frac{{T}_{p}+{\epsilon}_{p}}{{R}_{p}+{\eta}_{p}}\right)\right]\ne {\mathit{\text{log}}}_{2}\left(\frac{E[{T}_{p}+{\epsilon}_{p}]}{E[{R}_{p}+{\eta}_{p}]}\right)\\ ={\mathit{\text{log}}}_{2}\left(\frac{E[{T}_{p}]}{{R}_{p}}\right).\end{array}$$

Further, the probability distribution of *log*_{2} [(*T _{p}*

For the estimating procedure we propose, we will use linear format data rather than nonlinear format data to estimate the shared copy number at probe position *p,* 0 ≤ *p* ≤ *M.*

As defined in Model 4, *X _{i,p}* is a random variable of the intensity of the ratios of the test to the reference given by the ith sample at probe position

$${X}_{i,p}=\frac{{T}_{i,p}+{\epsilon}_{i,p}}{{R}_{i,p}+{\eta}_{i,p}},p=1,2,,M,i=1,2,,n,$$

where *i* denotes the *i*th sample/patient; * _{i,p}* and

As stated in Section 2.2, we always assign *R _{i,p}* = 2, which is the information given by the “reference” genome. The true shared copy number

Let *x _{i,p}* be the observed values of

The estimation of *t _{p}*,

- Step 1 Calculate the median of {
*x*}_{i,p}_{i}_{= 1,2, ...,n}for each*p,*denoted by*M*._{p} - Step 2 Calculate 2(
*M*−1 + π)/π for each_{p}*p.* - Step 3 Determine the estimate of
*t*,_{p}*p*= 1, ,*M*,where [$$\begin{array}{l}{\widehat{t}}_{p}=[2({M}_{p}-1+\pi )/\pi ],\text{if}2({M}_{p}-1+\pi )/\pi \\ \le [2({M}_{p}-1+\pi )/\pi ]+0.5;\\ {\widehat{t}}_{p}=[2({M}_{p}-1+\pi )/\pi ]+1,\text{if}2({M}_{p}-1+\pi )/\pi \\ [2({M}_{p}-1+\pi )/\pi ]+0.5,\end{array}$$*c*] denotes the integer part of the real number*c*.

We call this 3-step method the “linear-median method”. “Linear” indicates that the data (the intensity of the ratios of the test to the reference) are in a linear format. “Median” indicates that the median of the data is employed by this method.

Next, we explain theoretically why copy numbers can be accurately estimated by this 3-step method.

Let *X _{p}* be the intensity of the ratios of the test to the reference at probe position

$${X}_{p}=\frac{{T}_{p}+{\epsilon}_{p}}{2+{\eta}_{p}},$$

where * _{p}* and

Following the definition of *X _{p}* and assuming the independence of

$$\begin{array}{l}E({X}_{p})=E\left(\frac{{T}_{p}+{\epsilon}_{p}}{2+{\eta}_{p}}\right)\\ =E({T}_{p}+{\epsilon}_{p})E\left(\frac{1}{2+{\eta}_{p}}\right)\\ =({t}_{p}\pi +2(1-\pi ))E\left(\frac{1}{2+\eta}\right)\\ =\frac{{t}_{p}\pi +2(1-\pi )}{2a}\text{log}\left(\frac{2+a}{2-a}\right).\end{array}$$

Thus

$$tp=\left(\frac{2a}{\text{log}\left(\frac{2+a}{2-a}\right)}E({X}_{p})-2(1-\pi )\right)/\pi .$$

(5)

Equation (5) gives the exact relationship between *t _{p}* and

However, *E*(*X _{p}*) is unknown in practice and the probability distribution of

To overcome this difficulty, we suggest the following way to evaluate *t _{p}*:

$$\begin{array}{c}{t}_{p}=\left(\frac{2a}{\text{log}\left(\frac{2+a}{2-a}\right)}E({X}_{p})-2(1-\pi )\right)/\pi \\ =\left(\frac{2a}{\text{log}\left(\frac{2+a}{2-a}\right)}\frac{E({X}_{p})}{{m}_{{X}_{p}}}{m}_{{X}_{p}}-2(1-\pi )\right)/\pi ,\end{array}$$

where *m*_{Xp} is the median of *X _{p}*. It is technically possible to directly evaluate the ratio

$$\frac{aE({X}_{p})}{\text{log}\left(\frac{2+a}{2-a}\right){m}_{{X}_{p}}}$$

(6)

and prove that the ratio is close to 1, for any *a* (0, 2) and any π (0, 1).

We use the Monte Carlo method to indirectly show that the value of (6) is close to 1 for *a* = 0.1, 0.2, , 1.9 and π = 0.1, 0.2, , 1. (see Appendix A and Supplementary Tables 1 and 2 in the online materials for details). Therefore,

$${t}_{p}\approx \frac{2({m}_{{X}_{p}}-(1-\pi ))}{\pi}.$$

The linear-median method is designed for estimating **shared** copy number aberrations and mainly focuses on the information across the sample for each probe position. Therefore, this method ignores the dependency within each individual sample. Our focus is two-fold: i) to determine the extent of information of **shared** copy number aberrations that can be detected, regardless of the impact of dependency, and ii) to assess the differences in detection outcomes obtained from the linear-median method versus other methods.

In a recent review of methods for detecting “recurrent” copy number alterations, Rueda and Diaz-Uriarte evaluated the CGHregions method, Master HMMs, cghMCR, GISTIC, MSA, RAE, and others.^{12} In this subsection, we compare the linear-median method to the cghMCR method and the CBS method.

We present three simulation studies to highlight the performance of our proposed linear-median method.

**Example 1:** A sequence of integers

2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 |

2 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |

1 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 | 4 | 4 | 4 | 4 | 4 |

5 | 5 | 5 | 5 | 5 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 |

1 | 1 | 1 | 1 | 1 | 3 | 3 | 3 | 3 | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |

serves as a sequence of the true **shared** copy number *t _{p}*,

We simulated a group of independent realizations {*X _{i,p}*} from model (

Subsequently, we generated 1000 replicates. For the *k*th replicate, *k* = 1, 2, , 1000, let *d*(*k*) be the percentage of *t _{p}* −

The sample mean and sample standard error of the estimated error rate {*d*(*k*)} given by different combinations of *a* and *n,* where *a* is the parameter of the uniform distribution *U*[−*a, a*] and *n* is the number of the independent sequences in the realizations. **...**

Table 1 shows that the error rate increases with *a.* This is obvious because a larger value of *a* is equivalent to a larger measurement error in the data. However, the error rate will be reduced when the number of independent samples in the group increases. In general, the mean error rate calculated for the linear-median method is reasonably low: the mean error rate was less than 10%, as expected, for all three cases of varying *a.*

Although the underlying model involves the parameter a, Example 1 shows, in general, that the impact of the value of *a* on the estimation of the copy number is not significant in terms of the mean of *d*(*k*), except for a very large value of *a*(>1). (Further demonstrations are presented in the Supplementary Material.) In summary, the value of *a* (0, 2) has minimal effect on the estimation of the **shared** copy number when the sample size is reasonable large. As a result, the linear-median method can be employed without knowing the value of *a,* as long as *a* (0, 2).

**Example 2:** In Table 1 of their review of 15 estimation methods, Rueda and Diaz-Uriarte indicate that only the cghMCR method both uses an input of the log2 ratio and produces estimations of the differences in the states of two successive probes.^{12} The cghMCR method is designed to identify the minimal common copy number alteration regions among a group of independent samples; thus it is analogous to the linear-median method and is an appropriate method to compare to the linear-median method. Using segmented data (ie, smoothed data), the cghMCR algorithm first identifies altered segments within each subject (those above the 97th or below the 3rd percentile of the data) and then joins adjacent segments separated by a user-defined parameter. The R package for the cghMCR method is available at the following URL: http://www.bioconductor.org/packages/2.6/bioc/html/cghMCR.html. See the work of Aguirre et al for explicit details and a complete review of the cghMCR method.^{3}

We use simulated data to compare the performance of the linear-median method to that of the cghMCR method. The data were simulated by assuming non dependency between the intensity ratios across probe positions, which is a very simple situation.

Consider a sequence of true **shared** copy number {*t _{p}*} plotted in Figure 2.

The sequence *t _{p}* consists of four abnormal

We simulated data from the following model

$${X}_{i,p}=\{\begin{array}{ll}\frac{2+{\epsilon}_{i,p}}{2+{\eta}_{i,p}},\hfill & 1\le p\le 10,\hfill \\ \frac{B{(1,\pi )}_{i,p}+2(B{(1,\pi )}_{i,p}-1)+{\epsilon}_{i,p}}{2+{\eta}_{i,p}}\hfill & 11\le p\le 50,\hfill \\ \frac{2+{\epsilon}_{i,p}}{2+{\eta}_{i,p}},\hfill & 51\le p\le 98,\hfill \\ \frac{4B{(1,\pi )}_{i,p}+2(B{(1,\pi )}_{i,p}-1)+{\epsilon}_{i,p}}{2+{\eta}_{i,p}},\hfill & 99\le p\le 102,\hfill \\ \frac{2+{\epsilon}_{i,p}}{2+{\eta}_{i,p}},\hfill & 103\le p\le 109,\hfill \\ \frac{5B{(1,\pi )}_{i,p}+2(B{(1,\pi )}_{i,p}-1)+{\epsilon}_{i,p}}{2+{\eta}_{i,p}},\hfill & p=110,\hfill \\ \frac{2+{\epsilon}_{i,p}}{2+{\eta}_{i,p}},\hfill & 111\le p\le 150,\hfill \\ \frac{3B{(1,\pi )}_{i,p}+2(B{(1,\pi )}_{i,p}-1)+{\epsilon}_{i,p}}{2+{\eta}_{i,p}},\hfill & 151\le p\le 200,\hfill \\ \frac{2+{\epsilon}_{i,p}}{2+{\eta}_{i,p}},\hfill & 201\le p\le 250,\hfill \end{array}$$

(7)

*i* = 1, 2, , *n*, where * _{i,p}* and

We applied the linear-median method and the cgh-MCR method to each group of independent samples with size *n* for different pairs of parameters (*a*, *π*), respectively. Then, for each triplet (*a*, π, n), we calculated the true positive (TP) rates and the false positive (FP) rates produced by each model. TP rate = *P*(the method shows “copy number changed” | copy number is changed). FP rate = *P*(the method shows “copy number changed” | copy number is not changed). The linear-median method is able to provide an estimate of the **shared** copy number at each probe position. Therefore, when we say that a correct detection of the **shared** copy number was produced by the linear-median method at position *p,* it means that * _{p}* =

Finally, we carried out 250 replicates for the case where *n* = 20; 100 replicates for the case where *n* = 50, and 50 replicates for the case where *n* = 100. The resulting TP and FP rates, means, and standard errors obtained from both methods are shown in Supplementary Tables 3–5.

In terms of the TP rates, the linear-median method worked reasonably well in each case and performed vastly better than the cghMCR method, which showed poor performance, especially when *a* was larger and π was smaller. In this particular example of a true **shared** copy number sequence, the cghMCR method tended to give a lower FP value, ie, it did not call as many gains/losses, and hence was very conservative. Compared to the cghMCR method, the linear-median method gave a lower FP value when *a* was not close to 2 or π was greater than 0.5. In summary, two advantages of using the linear-median method include:

- The ability to estimate the actual
**shared**copy number at each position*p.*The estimation accuracy of the linear-median method is very high, as reflected by the values of the TP and FP rates. - Better power in identifying shorter alternating regions. For example, considering the data simulated from (7) with
*a*= 1.5, π = 1 and*n*= 20, we can compare the means of the estimated copy numbers given by both methods. Since*a*= 1.5, the variance for*U*(−*a, a*) is relatively large and the simulated data involve a lot of random noise. By choosing π = 1, there is no variation on the true copy numbers shared across the independent samples. Technically, one expects that the linear-median method and the cghMCR method will perform at the same level. However, it turns out that the linear-median method dominates the cghMCR method. At almost every probe position, the sample mean and median of the estimated**shared**copy number given by the linear-median method was the same as the true**shared**copy number. In contrast, the cghMCR method did not accurately identify the gain/loss regions (see Supplementary Figures 1–3).

This simulation example (Example 2) illustrates that the cghMCR method performs very poorly in high-noise scenarios, for example, *a* = 1.5, and the cghMCR method is not robust for large values of *a.* We believe this is due to the fact that the cgh-MCR method performs segmentation and calling functions independently of one other; whereas the linear-median method borrows strength from all the samples.

**Example 3:** In this example we consider data *X _{i,p}*, simulated from the following model:

$${X}_{i,p}=\frac{{t}_{p}+{\epsilon}_{i,p}}{2+{\eta}_{i,p}}=\{\begin{array}{lll}\frac{2+{\epsilon}_{i,p}}{2+{\eta}_{i,p}}\hfill & 1\le p\le 100,\hfill & l=100\hfill \\ \frac{3+{\epsilon}_{i,p}}{2+{\eta}_{i,p}}\hfill & 101\le p\le 150,\hfill & l=50\hfill \\ \frac{4+{\epsilon}_{i,p}}{2+{\eta}_{i,p}}\hfill & 151\le p\le 152,\hfill & l=2\hfill \\ \frac{3+{\epsilon}_{i,p}}{2+{\eta}_{i,p}}\hfill & 153\le p\le 200,\hfill & l=48\hfill \\ \frac{1+{\epsilon}_{i,p}}{2+{\eta}_{i,p}}\hfill & 201\le p\le 202,\hfill & l=2\hfill \\ \frac{2+{\epsilon}_{i,p}}{2+{\eta}_{i,p}}\hfill & 203\le p\le 204,\hfill & l=2\hfill \\ \frac{1+{\epsilon}_{i,p}}{2+{\eta}_{i,p}}\hfill & 205\le p\le 300,\hfill & l=96\hfill \end{array}$$

where * _{i,p}* and

In this example, we compare the linear-median method to the circular binary segmentation (CBS) method, which was developed by Olshen et al.^{6} An R package description for the CBS method is available at the following URL: http://bioconductor.org/packages/2.6/bioc/manuals/DNAcopy/man/DNAcopy.pdf. The CBS method is employed to find segments along the chromosome that share constant DNA copy numbers. Technically, it is inappropriate to directly compare the analytical results obtained by these two methods because the CBS method is designed for application to a single sample of data, whereas the linear-median method is applicable to a group of independent samples.

To apply the CBS algorithm to observations {*x _{i,p}*},

Figure 3 shows the plot of the medians of {log_{2}(*x _{i,p}*)} and the estimate of log

Application of the CBS method to the sequence of the median of the logarithm of the ratios (top panel). The red bars show the values of the estimation of log_{2}(*t*_{p}/2). Application of the linear-median method to the data in Example 3 (bottom panel), showing **...**

Comparing the plots in Figures 3, both approaches, the linear-median method and the CBS method, were able to detect all the longer regions of alternations. However, all the shorter regions of alterations, [151, 152], [201, 202] and [203, 204], were missed by the CBS method. This indicates that the linear-median method has more power than the CBS method to detect shorter segments of alterations or narrow gaps between segments.

We applied the linear-median method to a subset of aCGH data from 39 well-studied lung cancer cell lines. The data, originally published by Coe et al^{13} and Garnis et al^{14} are available for downloading from http://sigma.bccrc.ca/. For this study, we used data from only the subgroup with the largest sample size, that of non-small cell adenocarcinoma (NA), which included 18 samples.

As both the linear-median method and the cgh-MCR method are designed for application to multiple aCGH data, the sample size is a critical issue. Data with more independent samples are able to provide more information on the commonalities across all samples.

Accurately identifying the locations of copy number aberrations has many important medical applications. As far as we know, the cghMCR method is one of the methods used to estimate the **shared** copy number for multiple aCGH data. Many other methods give an estimation of only the probability of gain/loss at each probe position.^{4}^{,}^{13}

Information on the exact **shared** copy number(s) at each probe position is not available for the data we have analyzed (the NA data). Therefore, based on only the analytic outputs of the linear-median method and the cghMCR method, it is difficult for us to claim which method is better in terms of the accuracy of estimating the true copy numbers. As a result, we compared the similarities between the analytic outputs of the two methods and determined which method provides more information on the changes in the copy numbers in the NA data. As a reference for this comparison, we used the probability of gain/loss at each probe position that was reported by Shah et al.^{4}

The total number of probe positions in the NA data (chromosome 9) is 1249. Recalling Model 4 in Section 2.1, in order to estimate the **shared** copy numbers in a “test” DNA sequence, we need to know the parameter π. This type of information is also required for the cghMCR method. The value of π might be estimated based on the researcher’s empirical knowledge. For the NA data, empirical knowledge on the value of π is not available. Therefore, we applied the cghMCR method and the linear-median method to the data for different values of π, 0.2, 0.4, 0.6, 0.8 and 1. Then we compared the results from both methods and also compared those results to findings reported by Shah et al.^{4} We expected to find little difference in the results obtained from the three methods. Shah et al found a loss of the **shared** copy number in a significant portion of the NA data (see Figure 7 in their paper).^{4} However, for π = 0.4, 0.6, 0.8 or 1, both the cghMCR method and the linear-median method provided high proportions of neutral states, ie, where the **shared** copy number equals 2. Therefore, it is reasonable to use π = 0.2 when analyzing the NA data. We limit our report of the analytic results to the case where π = 0.2.

Combining all the results given by the linear-median method and the cghMCR method for π = 0.2, 0.4, 0.6, 0.8 and 1, we were able to identify a common trend in the outputs of the two methods for all probe positions as the value π moves from 1 to 0.2 (data not shown). For the NA data, both the linear-median method and the cghMCR method give neutral states to all probe positions when π is assigned as 1, with the exception of a few probe positions identified as gain/loss by the linear-median method. In our empirical study of the NA data, if a probe position *a* is more likely to lose copy number(s), then the **shared** copy number estimation given by both methods will decrease as π moves from 1 to 0.2; if a probe position *a* is more likely to gain copy number(s), then the **shared** copy number estimation given by both methods will increase as π moves from 1 to 0.2. One important phenomenon we observed from the outputs of the two methods is that once a probe position has been identified as having a **shared** copy number change when π = π_{0}, the observation remains the same for any π > π_{0}. Comparing the results of the two methods, we found that the estimation of the **shared** copy number at each probe position given by the cgh-MCR method is reluctant to change as the value of π decreases. In contrast, the linear-median method can show changes in the estimated **shared** copy number as π decreases. This may reflect the later detection of an aberration by the cghMCR method compared to the linear-median method when the true **shared** copy number at a probe position is gained/lost, and as the value of π decreases. Based on our analysis of the NA data, the linear-median method was able to report the estimated **shared** copy number at each probe position; whereas the cghMCR method reported only the state of the **shared** copy number, ie, wether there was a gain, loss or no change (neutral state), in the **shared** copy number. To simplify the comparison between the results given by the two methods, we report only the gain, loss, or neutral states of the **shared** copy number for the linear-median method. A plot of the states for both methods is given in Figure 4. In the plot, we use “1”, “0” and “−1” to indicate a **shared** copy number gain, neutrality, or loss, respectively. We summarize the results as follows.

The output of the linear-median adjusted method is shown in red and that of the cghMCR method is in green.

From probe positions 1 to 500 and 1235 to 1249, both the cghMCR method and the linear-median method provide similar results, except for some isolated prob positions. This is what we expect to find because our simulation studies demonstrated that the linear-median method can identify those isolated regions.

From probe positions 501 to 1234, the results obtained from the linear-median method and the cgh-MCR method are quite different. The cghMCR method claims that all the probe positions are neutral, in contrast to the findings of the linear-median method, which identifies gains/losses at these probe positions. One possible explanation for the large difference between the two sets of results in this prob region is that the π used in the estimation for this region may be too high. A lower value of π should be used to accurately estimate copy numbers in this interval. These results suggest that the parameter π might vary over sequences of NA data. If this is true, then, detecting the change in π will be an interesting challenge for future studies.

Information on the true **shared** copy numbers for the NA data is not available; hence, we cannot be certain which method would best estimate the **shared** copy number variations in these data. However, through our comparison of the two methods and taking into account the results given by Shah et al^{4} we can claim that the linear-median method has some capability to reasonably estimate **shared** copy numbers in DNA sequences. As shown in our simulation studies, the linear-median method can easily identify isolated probe positions with **shared** copy number changes or short **shared** alternating segments. These changes are often missed by the cghMCR approach.

The 1249 probe sets we studied target the **shared** copy number status of 1262 genes present in the chromosome 9.

In order to classify these genes as one of three general categories, we performed a search of the OMIM database (http://www.ncbi.nlm.nih.gov/omim). The three categories we used were “not related to/unknown cancer phenotype (NR/U)”, “cancer-related phenotype, except for lung cancer (CR)”, and “lung cancer-related phenotype (LCR)”. The results are presented in Tables 2 and and3.3. Identifying altered regions where important cancer-related genes are located aids the biological interpretation of our findings and works as an empirical form of validation. Detailed locations of the genes categorized as NR/U, CR and LCR are presented in Supplementary Appendix B. From Tables 2 and and33 we can see that the linear-median method is able to report more CR and LCR with copy number losses/gains than the cghMCR method.

Number of genes identified by the linear-median method (LM) and the cghMCR method in the regions of shared copy number aberrations with the status of copy number loss, neutrality or gain. NR/U is not cancer-related or unknown function phenotype, CR is **...**

List of lung cancer-related genes for each phenotypic group identified by the linear-median method (LM) and the cghMCR method.

We were able to find additional information of interest from the output of the linear-median method. Focusing on the probe positions at which the estimated **shared** copy number given by the linear-median method was <1 or >3 when π = 0.2, we identified 145 such probe positions out of 1249 (see Figure 5). Among those 145 probe positions, 22 probe positions showed an estimated copy number ≥4 or ≤−1. These results provided a more serious warning of copy number aberrations — a warning that was not obtained from the cghMCR method.

We developed a new model for aCGH data analysis, the linear-median method, which estimates shared copy numbers in DNA sequences. Using simulated data, we found the linear-median method to be more powerful than the cghMCR method in terms of achieving a higher rate of true positives and a lower rate of false positives. In addition to estimating the common gain/loss of chromosome regions, the linear-median method estimates the number of DNA copies. In other words, analytic results produced by the linear-median method allow us to extract additional information on the tested DNA sequences. In particular, the linear-median method has the power to estimate common changes that appear at isolated single probe positions or very short regions. The only drawback of the linear-median method is that it ignores the dependency information in samples. However, based on our application of the proposed method to real data, we find that most information on shared copy number aberrations can be captured by the linear-median method using only the information across independent samples.

Use Monte Carlo method to indirectly show that the value of *aE*(*X _{p}*)

The simulation is conducted as follows. For each triplet (*a*, π, *t _{p}*), 5000 independent samples are simulated from model

$${X}_{p}{X}_{p}(a,\pi )=\frac{{T}_{p}+\epsilon}{2+\eta},$$

where random variables *T _{p}*, and

$$\frac{a{\overline{X}}_{p}(a,\pi )}{\text{log}\left(\frac{2+a}{2-a}\right)\mathit{\text{median}}({X}_{p})(a,\pi )}.$$

For each π and *t _{p}* fixed, the sample mean m(π,

$$\frac{a{\overline{X}}_{p}(a,\pi )}{\text{log}\left(\frac{2+a}{2-a}\right)\mathit{\text{median}}({X}_{p})(a,\pi )},a=0.1,0.2,,1.9,$$

are calculated by the following formulae:

$$\begin{array}{c}m(\pi ,{t}_{p})=\sum _{a=0.1}^{1.9}\left(\frac{a{\overline{X}}_{p}(a,\pi )}{\text{log}\left(\frac{2+a}{2-a}\right)\mathit{\text{median}}({X}_{p})(a,\pi )}\right)/19,\\ {s}^{2}(\pi ,{t}_{p})={\sum _{a=0.1}^{1.9}\left(\frac{a{\overline{X}}_{p}(a,\pi )}{\text{log}\left(\frac{2+a}{2-a}\right)\mathit{\text{median}}({X}_{p})(a,\pi )}-m(\pi ,{t}_{p})\right)}^{2}/19,\end{array}$$

and reported in Tables 1 and and2,2, which follow, where *s*^{2}(π) is given within the parentheses.

The Monte Carlo simulation results clearly show that all the sample means *m*(π, *t _{p}*) are close to 1 and the sample variance

π = 1 | ||||

t = 1_{p} | t = 2_{p} | t = 3_{p} | t = 4_{p} | t = 5_{p} |

0.9988320 (1.204417e-05) | 0.9999922 (6.722244e-06) | 1.0006107 (5.334313e-06) | 0.9996400 (1.261637e-05) | 1.0010851 (1.167334e-05) |

t = 6_{p} | t = 7_{p} | t = 8_{p} | t = 9_{p} | |

0.9996429 (1.231912e-06) | 1.0007002 (5.472384e-06) | 0.9996422 (5.472957e-06) | 0.9995458 (4.414939e-06) | |

π= 0.9 | ||||

t = 1_{p} | t = 2_{p} | t = 3_{p} | t = 4_{p} | t = 5_{p} |

1.0234726 (7.944477e-04) | 0.9999251 (3.122031e-06) | 0.9945141 (7.239544e-05) | 0.9874093 (2.153089e-04) | 0.9815765 (3.306295e-04) |

t = 6_{p} | t = 7_{p} | t = 8_{p} | t = 9_{p} | |

0.9784618 (4.723169e-04) | 0.9754699 (5.685863e-04) | 0.9728850 (5.824456e-04) | 0.9690245 (5.801032e-04) | |

π = 0.8 | ||||

t = 1_{p} | t = 2_{p} | t = 3_{p} | t = 4_{p} | t = 5_{p} |

1.0387630 (2.886933e-03) | 0.9996188 (9.836780e-06) | 0.9892445 (2.590746e-04) | 0.9765723 (8.382101e-04) | 0.9673282 (1.415877e-03) |

t = 6_{p} | t = 7_{p} | t = 8_{p} | t = 9_{p} | |

0.9586696 (1.799211e-03) | 0.9522338 (2.116493e-03) | 0.9460126 (2.245196e-03) | 0.9428445 (2.510819e-03) | |

π = 0.7 | ||||

t = 1_{p} | t = 2_{p} | t = 3_{p} | t = 4_{p} | t = 5_{p} |

1.0490667 (5.548408e-03) | 1.0001018 (1.432224e-05) | 0.9855943 (5.197429e-04) | 0.9663545 (1.709809e-03) | 0.9524253 2 (2.958586e-03) |

t = 6_{p} | t = 7_{p} | t = 8_{p} | t = 9_{p} | |

0.9407424 (3.912353e-03) | 0.930110 (4.538055e-03) | 0.9227174 (5.050342e-03) | 0.9165458 (5.479345e-03) | |

π = 0.6 | ||||

t = 1_{p} | t = 2_{p} | t = 3_{p} | t = 4_{p} | t = 5_{p} |

1.0488169 (7.753325e-03) | 1.0010726 (6.911221e-06) | 0.9854699 (7.413494e-04) | 0.9623178 (2.893487e-03) | 0.9414670 (4.946303e-03) |

t = 6_{p} | t = 7_{p} | t = 8_{p} | t = 9_{p} | |

0.9257995 (6.682264e-03) | 0.9128190 (8.140030e-03) | 0.9026656 (9.115055e-03) | 0.8949812 (1.010583e-02) |

π = 0.5 | ||||

t = 1_{p} | t = 2_{p} | t = 3_{p} | t = 4_{p} | t = 5_{p} |

0.9976367 (3.009497e-03) | 1.0008558 (3.751868e-06) | 1.0051801 (1.132037e-03) | 1.0075940 (8.828197e-03) | 1.0563732 (3.143084e-02) |

t = 6_{p} | t = 7_{p} | t = 8_{p} | t = 9_{p} | |

1.0996647 (5.681301e-02) | 0.9949510 (3.069008e-02) | 1.0348440 (5.681301e-02) | 1.2778189 (3.069008e-02) | |

π = 0.4 | ||||

t = 1_{p} | t = 2_{p} | t = 3_{p} | t = 4_{p} | t = 5_{p} |

0.9689243 (2.197164e-03) | 1.0004657 (8.269312e-06) | 1.0224327 (1.544194e-03) | 1.0774329 (1.112485e-02) | 1.1533009 (3.060863e-02) |

t = 6_{p} | t = 7_{p} | t = 8_{p} | t = 9_{p} | |

1.2460811 (5.815739e-02) | 1.3484609 (9.379170e-02) | 1.4610660 (1.286891e-01) | 1.5757287 (1.732840e-01) | |

π = 0.3 | ||||

t = 1_{p} | t = 2_{p} | t = 3_{p} | t = 4_{p} | t = 5_{p} |

0.9690245 (1.446965e-03) | 1.0008585 (3.876559e-06) | 1.0242324 (1.226318e-03) | 1.0860159 (6.909656e-03) | 1.1647982 (1.727091e-02) |

t = 6_{p} | t = 7_{p} | t = 8_{p} | t = 9_{p} | |

1.2629289 (2.912726e-02) | 1.3679820 (4.057614e-02) | 1.4846238 (5.020234e-02) | 1.5987995 (5.951541e-02) | |

π = 0.2 | ||||

t = 1_{p} | t = 2_{p} | t = 3_{p} | t = 4_{p} | t = 5_{p} |

0.9737273 (6.463392e-04) | 1.0001785 (3.456922e-06) | 1.0231539 (5.653691e-04) | 1.0743057 (3.026114e-03) | 1.1446959 (6.303903e-03) |

t = 6_{p} | t = 7_{p} | t = 8_{p} | t = 9_{p} | |

1.2239605 (9.262918e-03) | 1.3104238 (1.097795e-02) | 1.3991247 (1.232603e-02) | 1.4862704 (1.325612e-02) | |

π = 0.1 | ||||

t = 1_{p} | t = 2_{p} | t = 3_{p} | t = 4_{p} | t = 5_{p} |

0.9836251 (1.448035e-04) | 0.9996371 (1.181245e-05) | 1.0143460 (1.537200e-04) | 1.0458579 (6.429722e-04) | 1.0869769 (1.166530e-03) |

t = 6_{p} | t = 7_{p} | t = 8_{p} | t = 9_{p} | |

1.1335024 (1.374409e-03) | 1.1808455 (1.518496e-03) | 1.2294815 (1.587512e-03) | 1.2743215 (1.669519e-03) |

The plot of the mean of gains/losses obtained at each probe position using the cghMCR method.

Click here to view.^{(9.5M, tif)}

The locations of the genes of NR/U, CR and LCR in non-small cell adenocarcinoma (NA) and related references.

- Probe positions from 1 to 295: A total of 200 genes are found in this region, 28 of them (14%) are genes related to cancer phenotype while 3 (1.5%) are related to lung cancer phenotype. All LCR genes are located in chromosomal regions identified as losses by both methods (LM and cghMCR). The LCR genes located at this region are PSIP1, CDKN2A, and TUSC1. PSIP1 and CDKN2A, a well-known lung cancer suppressor
^{1}are both located in a region frequently found deleted in lung cancer patients.^{2}In addition, TUSC1 is found mutated and silent in nonsmall cell lung carcinoma cell lines.^{3} - Probe positions from 296 to 331: A total of 12 NR/U genes are found in this region.
- Probe positions from 332 to 341: Only 3 genes are located in this region with one of them being classified as CR (ACO1). Both methods identify the region where this gene is located as loss.
- Probe positions from 342 to 375: A total of 113 genes are located in this regions with 14 of them being classified as CR.
- Probe positions from 376 to 500: A total of 171 genes are located in this region. Four of them are CR and only one (IGFBPL1, classified as loss by both methods) is classified as LCR. IGFBPL1 has already been shown to be downregulated in lung tumor samples.
^{4} - Probe positions from 501 to 1234: A total of 744 genes are located in this region, 90 of them being classified as CR, and 9 as LCR. The cghMCR method does not identify any region containing LCR as altered. On the other hand, the LM method identifies five of the LCR genes in chromosomal regions of loss (TLE1, FRMD3, DAPK1, MIRLET7A1, PTPN3) and, consequently, are expected to have lower expression in lung tumor samples. In fact, TLE1 is frequently found altered in squamous cell carcinomas and adenocarcinomas
^{5}while FRMD3 expression is usually silenced in primary nonsmall cell lung carcinomas.^{6}Likewise, mouse lung carcinoma clones characterized by highly aggressive metastatic behavior did not express Dapk1.^{7}Also, MIRLET7A1 and PTPN3 expressions are downregulated in lung cancer.^{8}^{,}^{9}The LM indetifies one gene located in a gain region (GAS1), and therefore, it is expected to be overexpressed in lung cancer samples. Surprisingly, Gas1 expression is known by its capacity of suppressing metastasis in lung,^{10}therefore, we hypothesize that the this gene might be regulated epigenetically or it is a false positive identified by the LM method. Again, the cghMCR method does identifies this region as neutral. In addition, 3 genes are found by both methods in neutral regions (PHF19, DAB2IP, RPL12) and, therefore, we believe that their regulation is being performed by epigenetic factors. In fact, PHF19 mRNA is known to be overexpressed in lung cancers^{9}as well as methylation of the promoter of DAB2IP is associated with the lung cancer phenotype.^{11}Likewise, RPL12 splice variant are frequently found in human lung carcinoma cell.^{12} - Probe positions from 1235 to 1249: A total of 17 genes are located in this regions with only one of them (ABL1) being classified as CR and identified as a gain by both methods.

The plot of the mean of copy numbers obtained at each probe position using the linear-median method.

Click here to view.^{(9.5M, tif)}

The plot of the median of copy numbers obtained at each probe position using the linear-median method.

Click here to view.^{(12M, tif)}

*x*is an*n*×*T*matrix, the elements of*y*are aCGH observations in linear format*n*denotes the number of independent samples*T*denotes the size of each individual sample- At any probe position
*p,*if the true**shared**copy number is not 2, the probability of having copy number changed is “prob” - Function “Linear_Median” gives the estimate of
**shared**copy number at each probe position.

Linear_Median = function(x,n,T,prob){

medianx = c()

for (i in 1:T){

medianx[i] = median(x[i,])

}

justx = c()

justx = 2*(medianx-1+prob)/prob

xx = c()

xx = floor(justx)

for(i in 1:T){

if (justx[i]>= xx[i]+0.5)

xx[i] = xx[i]+1

}

xx

}

n= 20 | L-M | cgh MCR | L-M | cgh MCR | L-M | cgh MCR |
---|---|---|---|---|---|---|

π | α = 0.5 | α = 1 | α = 1.5 | |||

0.2 | ||||||

TP | 0.6382 (0.0496) | 0.0714 (0.1101) | 0.7568 (0.0406) | 0.0024 (0.0188) | 0.8096 (0.0414) | 0 (0) |

FP | 0.3785 (0.0384) | 0.0040 (0.0154) | 0.6549 (0.0384) | 2.83e-04 (0.0041) | 0.7657 (0.0357) | 0 (0) |

0.4 | ||||||

TP | 0.7849 (0.0429) | 0.6760 (0.1830) | 0.7696 (0.0413) | 0.0308 (0.0779) | 0.7616 (0.0453) | 0 (0) |

FP | 0.0861 (0.0248) | 0.0415 (0.0302) | 0.3827 (0.0402) | 0.0011 (0.0081) | 0.5611 (0.0408) | 0 (0) |

0.6 | ||||||

TP | 0.9503 (0.0227) | 0.9075 (0.0224) | 0.8708 (0.0359) | 0.2759 (0.1129) | 0.8013 (0.0410) | 0 (0) |

FP | 0.0122 (0.0090) | 2.58e-05 (0.0004) | 0.2000 (0.0310) | 0.0023 (0.0114) | 0.3905 (0.0419) | 0 (0) |

0.8 | ||||||

TP | 0.9966 (0.0060) | 0.9030 (0.0206) | 0.9451 (0.0204) | 0.3877 (0.1308) | 0.8677 (0.0331) | 0 (0) |

FP | 0.0013 (0.0028) | 0 (0) | 0.0238 (0.0917) | 0 (0) | 0.2617 (0.0358) | 0 (0) |

1 | ||||||

TP | 1 (0) | 0.9490 (0.0147) | 0.9817 (0.0147) | 0.6542 (0.1561) | 0.9237 (0.0287) | 0 (0) |

FP | 7.74e-05 (0.0007) | 0 (0) | 0.04026 (0.0154) | 0 (0) | 0.1667 (0.0314) | 0 (0) |

n= 50 | L-M | cgh MCR | L-M | cgh MCR | L-M | cgh MCR |
---|---|---|---|---|---|---|

π | α = 0.5 | α = 1 | α = 1.5 | |||

0.2 | ||||||

TP | 0.6309 (0.0547) | 0.02442 (0.0626) | 0.7147 (0.0499) | 0 (0) | 0.7521 (0.0455) | 0 (0) |

FP | 0.1712 (0.0346) | 6.45e-04 (0.0065) | 0.4866 (0.0488) | 0 (0) | 0.6425 (0.0437) | 0 (0) |

0.4 | ||||||

TP | 0.8895 (0.0347) | 0.6643 (0.1542) | 0.8574 (0.0357) | 0.0019 (0.0109) | 0.7975 (0.0420) | 0 (0) |

FP | 0.0089 (0.0070) | 0.0439 (0.0297) | 0.1737 (0.0365) | 0 (0) | 0.3603 (0.0416) | 0 (0) |

0.6 | ||||||

TP | 0.9949 (0.0072) | 0.9046 (0.0149) | 0.9581 (0.0212) | 0.2762 (0.0842) | 0.8926 (0.0358) | 0 (0) |

FP | 6.45e-05 (0.0006) | 0 (0) | 0.0482 (0.0189) | 0 (0) | 0.1814 (0.0364) | 0 (0) |

0.8 | ||||||

TP | 1 (0) | 0.8962 (0.0118) | 0.9912 (0.0100) | 0.3384 (0.0416) | 0.9545 (0.0209) | 0 (0) |

FP | 0 (0) | 0 (0) | 0.0100 (0.0082) | 0 (0) | 0.0826 (0.0238) | 0 (0) |

1 | ||||||

TP | 1 (0) | 0.9207 (0.0154) | 0.9992 (0.0029) | 0.4155 (0.1679) | 0.9848 (0.0107) | 0 (0) |

FP | 0 (0) | 0 (0) | 0.0023 (0.0038) | 0 (0) | 0.0348 (0.0153) | 0 (0) |

n= 100 | L-M | cgh MCR | L-M | cgh MCR | L-M | cgh MCR |
---|---|---|---|---|---|---|

π | α = 0.5 | α = 1 | α = 1.5 | |||

0.2 | ||||||

TP | 0.6771 (0.0539) | 0.0048 (0.0203) | 0.7438 (0.0505) | 0 (0) | 0.7335 (0.0461) | 0 (0) |

FP | 0.0561 (0.0187) | 0 (0) | 0.3266 (0.0412) | 0 (0) | 0.5146 (0.0381) | 0 (0) |

0.4 | ||||||

TP | 0.9566 (0.0233) | 0.6650 (0.1317) | 0.9299 (0.02653) | 0.0004 (0.0030) | 0.8718 (0.0341) | 0 (0) |

FP | 0.0003 (0.0013) | 0.0455 (0.0270) | 0.0578 (0.0196) | 0 (0) | 0.2012 (0.0340) | 0 (0) |

0.6 | ||||||

TP | 0.9998 (0.0015) | 0.9033 (0.0108) | 0.9920 (0.01010) | 02804 (0.0706) | 0.9556 (0.0239) | 0 (0) |

FP | 0 (0) | 0 (0) | 0.0065 (0.0075) | 0 (0) | 0.0621 (0.0224) | 0 (0) |

0.8 | ||||||

TP | 1 (0) | 0.8956 (0.0087) | 0.9998 (0.0015) | 0.3345 (0.0193) | 0.9922 (0.0099) | 0 (0) |

FP | 0 (0) | 0 (0) | 0.0003 (0.0013) | 0 (0) | 0.0167 (0.0125) | 0 (0) |

1 | ||||||

TP | 1 (0) | 0.8971 (0.0177) | 0.9998 (0.0015) | 0.2263 (0.1156) | 0.9983 (0.0044) | 0 (0) |

FP | 0 (0) | 0 (0) | 0.0003 (0.0013) | 0 (0) | 0.0043 (0.0056) | 0 (0) |

1. Kamb A, Gruis NA, Weaver-Feldhaus J, et al. A cell cycle regulator potentially involved in genesis of many tumor types. Science. 1994;264:436–40. [PubMed]

2. Singh DP, Kimura A, Chylack LT, Jr, Shinohara T. Lens epithelium-derived growth factor (LEDGF/p75) and p52 are derived from a single gene by alternative splicing. Gene. 2000;242:265–73. [PubMed]

3. Shan Z, Parker T, Wiest JS. Identifying novel homozygous deletions by microsatellite analysis and characterization of tumor suppressor candidate 1 gene, TUSC1, on chromosome 9p in human lung cancer. Oncogene. 2004;23:6612–20. [PMC free article] [PubMed]

4. Cai Z, Chen HT, Boyle B, Rupp F, Funk WD, Dedera DA. Identification of a novel insulin-like growth factor binding protein gene homologue with tumor suppressor like properties. Biochem Biophys Res Commun. 2005;331:261–6. [PubMed]

5. Allen T, van Tuyl M, Iyengar P, et al. Grg1 acts as a lung-specific oncogene in a transgenic mouse model. Cancer Res. 2006;66:1294–301. [PubMed]

6. Haase D, Meister M, Muley T, et al. FRMD3, a novel putative tumour suppressor in NSCLC. Oncogene. 2007;26:4464–8. [PubMed]

7. Inbal B, Cohen O, Polak-Charcon S, et al. DAP kinase links the control of apoptosis to metastasis. Nature. 1997;390:180–4. [PubMed]

8. Johnson SM, Grosshans H, Shingara J, Byrom M, Jarvis R, Cheng A, et al. RAS Is Regulated by the let-7 MicroRNA Family. Cell. 2005;120:635C647. [PubMed]

9. Gobeil S, Zhu X, Doillon CJ, Green1 MR. A genome-wide shRNA screen identifies GAS1 as a novel melanoma metastasis suppressor gene. Genes Dev. 2008;22:2932–40. [PubMed]

10. Wang Z, Shen D, Parsons DW, et al. Mutational analysis of the tyrosine phosphatome in colorectal cancers. Science. 2004;304:1164–6. [PubMed]

11. Yano M, Toyooka S, Tsukuda K, et al. Aberrant promoter methylation of human DAB2 interactive protein (hDAB2IP) gene in lung cancers. Int J Cancer. 2005;113:59–66. [PubMed]

12. Cuccurese M, Russo G, Russo A, Pietropaolo C. Alternative splicing and nonsense-mediated mRNA decay regulate mammalian ribosomal gene expression. Nucleic Acids Research. 2005;33:5965–77. [PMC free article] [PubMed]

V. Baladandayuthapani was partially supported by US National Science Foundation grant IIS 0914861. K.-A. Do was partially supported by the University of Texas SPORE grants in Prostate Cancer P50 CA140388, Breast Cancer P50 CA116199, Brain Cancer P50 CA127001, and the Cancer Center Support Grant P30 CA016672. We would also like to acknowledge LeeAnn Chastain (UTMDACC) for her editorial contributions to the manuscript.

**Disclosure**

This manuscript has been read and approved by all authors. This paper is unique and is not under consideration by any other publication and has not been published elsewhere. The authors and peer reviewers of this paper report no conflicts of interest. The authors confirm that they have permission to reproduce any copyrighted material.

1. Cappuzzo F, Hirsch FR, Rossi E, et al. Epidermal growth factor receptor gene and protein and gefitinib sensitivity in non-small-cell lung cancer. J Nat Cancer Inst. 2005;97:643–55. [PubMed]

3. Aguirre AJ, Brennan C, Bailey G, et al. High-resolution characterization of the pancreatic adenocarcinoma genome. Proc Nat Acad Sci U S A. 2004;101:9067–72. [PubMed]

4. Shah SP, Xuan X, deLeeuw RJ, et al. Integrating copy number polymorphisms into array CGH analysis using a robust HMM. Bioinformatics. 2006;22:e431–9. [PubMed]

5. Venkatraman ES, Olshen AB. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007;23:657–63. [PubMed]

6. Olshen AB, Venkatraman ES, Lucito R, Wigler M. Circular binary Segmentation for the analysis of array-based DNA copy number data. Bio Statistics. 2004;5:557–72. [PubMed]

7. Molinaro AM, van der Laan MJ, Moore DH. Comparative Genomic Hybridization Array Analysis. U.C. Berkeley Division of Bio-statistics Working Paper Series. 2002. Working Paper Series. Working Paper 106. http://www.bepress.com/ucbbiostat/paper106.

8. Guha S, Li Y, Neuberg D. Bayesian hidden Markov modeling of array CGH data. J Am Stat Assoc. 2008;103:485–97. [PMC free article] [PubMed]

9. Pinkel D, Albertson DG. Comparative genomic hybridization. Ann Rev Genom Hum Genet. 2005;6:331–54. [PubMed]

10. Pinkel D, Albertson DG. Array comparative genemic hybrization and its application in cancer. Nat Genet. 2005;37(Suppl):S11–7. [PubMed]

11. Pinkel D, Davis R, Albertson D. Detection of gene dosage abnormalities using comparative genomic hybridization. 2005. http://cancer.ucsf.edu/array/nccls_pinkel.pdf.

12. Rueda OM, Diaz-Uriarte R. Finding recurrent copy number alteration regions: a review of methods. Current Bioinformatics. 2010;5:1–17.

13. Coe BP, Lockwood WW, Girard L, et al. Differential disruption of cell cycle pathways in small cell and non-small cell lung cancer. Br J Cancer. 2006;94:1927–35. [PMC free article] [PubMed]

14. Garnis C, Lockwood WW, Vucic E, et al. High resolution analysis of non-small cell lung cancer cell lines by whole genome tiling path array CGH. Int J Cancer. 2006;118:1556–64. [PubMed]

Articles from Cancer Informatics are provided here courtesy of **Libertas Academica**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's national science library in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |