Home | About | Journals | Submit | Contact Us | Français |

**|**Bioinformatics**|**PMC2639073

Formats

Article sections

- Abstract
- 1 INTRODUCTION
- 2 ALIGNMENT PROCEDURES
- 3 MISALIGNMENT
- 4 AN EXAMPLE
- 5 DISCUSSION
- Supplementary Material
- REFERENCES

Authors

Related links

Bioinformatics. 2009 February 1; 25(3): 315–321.

Published online 2008 December 3. doi: 10.1093/bioinformatics/btn624

PMCID: PMC2639073

Stan Pounds,^{1,}^{*} Cheng Cheng,^{1} Charles Mullighan,^{2} Susana C. Raimondi,^{2} Sheila Shurtleff,^{2} and James R. Downing^{2}

*To whom correspondence should be addressed.

Associate Editor: Trey Ideker

Received 2008 September 16; Revised 2008 November 14; Accepted 2008 November 29.

Copyright © The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

This article has been cited by other articles in PMC.

A new procedure to align single nucleotide polymorphism (SNP) microarray signals for copy number analysis is proposed. For each individual array, this *reference alignment procedure* (RAP) uses a set of selected markers as internal references to direct the signal alignment. RAP aligns the signals so that each array has a similar signal distribution among its reference markers. An accompanying reference selection algorithm (RSA) uses genotype calls and initial signal intensities to choose two-copy markers as the internal references for each array. After RSA and RAP are applied, each array has a similar distribution of signals of two-copy markers so that across-array signal comparisons are biologically meaningful. An upper bound for a statistical metric of signal misalignment is derived and provides a theoretical basis to choose RSA-RAP over other alignment procedures for copy number analysis of cancers. In our study of acute lymphoblastic leukemia, RSA-RAP gives copy number analysis results that show substantially better concordance with cytogenetics than do two other alignment procedures.

**Availability:** Documented R code is freely available from www.stjuderesearch.org/depts/biostats/refnorm.

**Contact:** stanley.pounds/at/stjude.org

**Supplementary information:** Supplementary data are available at *Bioinformatics* online.

Normalization is a critical component of the analysis of microarray data. Numerous methods have been proposed for normalization. Steinhoff and Vingron (2006) reviewed several of these methods. Broadly speaking, normalization consists of two major steps: (i) signal summarization and (ii) signal alignment. Signal summarization reduces the microarray image data into a set of signals representing the image intensity for each microarray probe set or marker. Signal alignment transforms the summary signals of each array so that comparisons of signals across arrays are biologically meaningful.

Signal alignment procedures transform the summary signals so that all arrays have a similar value for some characteristic of the signal distribution. For example, some methods transform the summary signals so that each array has the same mean, trimmed mean or median signal across the entire genome. Quantile normalization ensures that the quantiles of the signal distribution are matched across arrays. Such *global-alignment* methods transform the data so that each array has a similar distribution of signals among all probe sets or markers. Global alignment methods have been widely used with much success for gene expression microarray studies.

However, we observed that global-alignment methods were unsuitable for our study of copy number alterations in acute lymphoblastic leukemia (ALL; Mullighan *et al*., 2007). Cytogenetics data indicate the proportion of the genome that is amplified, deleted or unaltered varies markedly from tumor to tumor (Raimondi *et al*., 1998). Some tumors have multiple full-chromosome amplifications or deletions, while others have abnormalities involving only one or two cytobands. For the control arrays (i.e. from non-cancerous tissue specimens), most markers are presumably in the typical two-copy state. It did not seem sensible to align the signal distribution among all markers across arrays because the copy number distribution among all markers varies substantially between tumors and controls and from tumor to tumor.

Therefore, we developed a reference alignment procedure (RAP) to align the signals so that each array has a similar signal distribution among the markers in the two-copy state. In the course of our study, we observed that RAP greatly improved the accuracy of our copy number analysis (Supplementary Materials, Fig. S1). Here, we describe RAP in detail and the statistical principles underlying its success. In Section 2, we describe two widely used alignment procedures, RAP, and a reference selection algorithm (RSA) that accompanies RAP. In Section 3, we propose statistical metrics of misalignment, derive an upper bound for one of those metrics, and describe how the bound provides a theoretical basis for the success of RAP in our application. Section 4 compares the performance of RAP to that of other alignment procedures in the context of our application. Finally, the discussion is given in Section 5.

For *i*=1,…,*n* arrays and *j*=1,…,*m* markers, let *y*_{ij} represent the unaligned signal value for array *i* and marker *j*. Also, for each *i*, let (*j*) index the markers so that the signal values for array *i* are arranged in increasing order of the unaligned signal values, i.e. *y*_{i(1)}≤*y*_{i(2)}≤···≤*y*_{i(m)}. The QA signals are given by

(1)

The QA signals satisfy the property for all *i*, *i*′ and (*j*). The QA method has been used in gene expression studies with much success (Bolstad *et al*., 2003). Subsequently, QA has been implemented in several software packages to prepare single nucleotide polymorphism (SNP) microarray data for copy number analysis.

Li and Wong (2001) proposed ‘invariant-set normalization’ as a procedure to align the probe set signals. In this article, the procedure is called ‘invariant s*et al*ignment’ (ISA) to clarify that this is a signal alignment procedure and not a signal summarization procedure. For each array *i*, define

(2)

where I(·) is the indicator function that equals 1 if the enclosed statement is true and equals 0 otherwise. By definition, *q*_{i(j)} is the (adjusted) proportion of all signals *y*_{i(1)}, *y*_{i(2)},…,*y*_{i(m)} less than or equal to *y*_{i(j)} for each *i*. Let *q*_{ij} correspond to the values of (2) in the original ordering. Let *i*_{0} be the index of a selected *baseline array*. For each other array *i*, let

(3)

For array *i*, the rank-invariant set is given by the set **J**_{i} of *j* such that *d*_{ij} is less than or equal to a chosen threshold τ. The ISA signals are given by

(4)

where is obtained by fitting a local regression curve to the pairs (*y*_{ij}, *y*_{i0j}) for all *j* in **J**_{i}.

For purposes of copy number analysis, it is clearly advantageous to align the signal distributions of the two-copy markers across arrays. If the two-copy markers' signal distributions are quite dissimilar, then the across-array comparisons will not be biologically meaningful. RAP is specifically designed to achieve the objective of aligning the signal distribution of two-copy markers across arrays.

RAP requires that a set of markers be specified to serve as internal references for each array. Each array may have a distinct set of reference markers or all arrays may have the same set of reference markers. Given the specified sets of reference markers, RAP aligns the signals so that each array has a similar signal distribution among its selected reference markers. Section 2.4 describes strategies to select reference markers that represent the two-copy state. RAP ensures that each array with a well-chosen set of reference markers has similar distribution of aligned signals among two-copy markers. Section 3 discusses this desirable property of RAP in greater detail.

For *i*=1,…,*n* and *j*=1,…,*m*, let *r*_{ij}=1 if marker *j* is selected as an internal reference marker for array *i* and let *r*_{ij}=0 otherwise. Also, for each *i*, let **R**_{i} be the set of *j* such that *r*_{ij}=1. Additionally, for each *i*, define *r*_{i·}=∑_{j=1}^{m}*r*_{ij}, which is the number of reference markers for array *i*. Now, define

(5)

For each array *i*, *u*_{i(j)} is the (adjusted) proportion of *reference* markers with signal value less than or equal to *y*_{i(j)}. For each (*j*) in **R**_{i}, there is a pair (*q*_{i(j)}, *u*_{i(j)}) where *q*_{i(j)} is the proportion of *all* markers with signal value less than or equal to *y*_{i(j)} as defined in (2). Also, for each *i*, let be defined by the linear interpolation of the series of points (0,0), (*q*_{i(j)},*u*_{i(j)}) for (*j*) in **R**_{i}, and (1,1). Note that maps the adjusted proportion of *all* markers with unaligned signal less than or equal to *x* to a value approximately equal to the adjusted proportion of *reference* markers with unaligned signal less than or equal to *x*. The adjustment of *u*_{i(j)} by 1/2*r*_{i·} in (5) and the adjustment of *q*_{i(j)} by 1/2*m* in (2) improve the performance of transformation in the tails of the distribution. The adjustments ensure that 0<*u*_{i(j)}<1, 0<*q*_{i(j)}<1, and for all *i* and *j*. Also, for each *i*, satisfies the properties of a cumulative distribution function (CDF) for *x* in the interval (0,1). Thus, is approximately equal to the empirical distribution function (EDF) of the unaligned signals among markers in **R**_{i} evaluated at *y*_{i(j)}.

The RAP signals are given by

(6)

where *G*^{−1}(·) is the quantile function of a selected *target distribution*. The target distribution may be specified as a well-known distribution (such as the standard normal distribution) or may be empirically defined as in Equation (1). For all *i* and *i*′, the distribution of *y*_{i(j)} among markers in **R**_{i} is similar to that of *y*_{i′(j)} among markers in **R**_{i′} by the definition of and Equations (2), (5) and (5). Thus, for each pair *i* and *i*′ with references chosen so that **R**_{i} and **R**_{i′} represent the signal distribution of the two-copy markers of their respective arrays, the RAP signals will have similar distribution among two-copy markers for each array.

Clearly, the performance of RAP depends on the selection of the reference markers. We have found two reference selection strategies that lead to reliable results. In our study (Mullighan *et al*., 2007), we used cytogenetics data for our tumor specimens to select reference markers for each array. For each array *i*, we set *r*_{ij}=0 for each marker *j* on chromosome X, Y or a chromosome with a cytogenetically detected abnormality in specimen *i* and set *r*_{ij}=1 for all other markers. This strategy eliminated any chromosome with large abnormalities from the set of reference markers and thereby ensured that the reference markers were predominantly representative of the two-copy state.

Additionally, we have developed a RSA that may be used to computationally select reference markers when auxiliary information on copy number abnormalities is not available. RSA uses the unaligned signal values and genotype calls of the SNP microarray data. In our study, the results obtained with RSA-selected references are very similar to those obtained with cytogenetically selected references (see Section 4).

RSA performs the following operations separately for each array *i*:

- Eliminate the sex chromosomes from consideration to avoid selecting them as a reference chromosome. The copy number of the sex chromosome varies across gender and thus should not be included in the reference set.
- For each remaining chromosome
*c*, compute the proportion*h*_{ic}of markers with a heterozygous genotype call. - Eliminate chromosomes with
*h*_{ic}less than a specified threshold η. This eliminates any chromosome with extensive deletion as indicated by an absence of heterozygosity. Data from the HapMap project (www.hapmap.org) indicate that η=0.15 is a very good choice for this threshold (Supplementary Material, Fig. S2). - For each remaining chromosome
*c*, compute the mean and the variance*s*^{2}_{ic}of the unaligned signals*y*_{ij}. - Compute the ranks of and the ranks of
*s*^{2}_{ic}among the remaining chromosomes. (In each set of ranks, assign rank 1 to the smallest value.) - For each remaining chromosome
*c*, compute the sum*S*_{ic}of the rank of the mean and the rank of the variance*s*^{2}_{ic}. - Select the chromosome(s) with the smallest
*S*_{ic}to be the reference chromosome(s). Among chromosomes with an acceptable heterozygosity rate, selecting for a small mean signal avoids including amplified chromosomes as a reference chromosome. Also, small signal variance indicates that the chromosome is homogeneous in terms of copy number. - For each
*j*, set*r*_{ij}=1 if marker*j*is located on a reference chromosome and set*r*_{ij}=0, otherwise.

Empirically, RSA has a very good probability of a successful selection. In our application, RSA selected a chromosome with no cytogenetically detectable abnormality for 225 of 242 (93.0%) tumor samples (95% CI 89.4%, 95.2%). For 12 of the remaining 17 tumors, the cytogenetic abnormality was either a balanced translocation or a deletion/amplification involving only one cytoband. Thus, the selected chromosome was still predominantly representative of the two-copy state. If the selection is considered successful for these 12 tumors, then the selection is successful for 237 of 242 (97.9%) tumors (95% CI 95.2%, 99.3%). Subsequently, copy number analysis results based on RSA and RAP were generally more concordant with cytogenetics than were results based on QA or ISA (see Section 4).

Here, we formally introduce two metrics of the misalignment of signal distributions for markers in the two-copy state. Let *k*_{ij} be the true copy number state of marker *j* for array *i*. Also, let *m*_{i2}=∑_{j=1}^{m} I(*k*_{ij}=2) be the number of markers in the two-copy state for array *i*. For each array *i*, the *internal misalignment* (IMA) of RAP signals is defined as

(7)

where and

(8)

is the proportion of two-copy markers with RAP signal value less than or equal to *x*.

For each array *i*, the IMA measures how much the distribution of RAP signals among reference markers differs from the distribution of RAP signals among two-copy markers. Clearly, by definition. By the definition of RAP, *G*(·) is approximately equal to the EDF of the reference markers' RAP signals. By definition (8), is the EDF of the two-copy markers' signals. By the probability integral transform, if and only if for all *x*. In the limiting case that the two distributions have no overlap, i.e. there exists some *x* such that and *G*(*x*)=0 (or vice-versa), .

For each pair of arrays *i*_{1} and *i*_{2}, define the *pairwise misalignment* of RAP signals as

(9)

The pairwise misalignment measures how much the signal distribution of the two-copy markers of array *i*_{1} differs from that of the two-copy markers of array *i*_{2}. Note that if and only if for all *x* because the supports of and are each a subset of the support of *G*(·). The condition regarding the supports holds because all RAP signals are mapped by *G*^{−1}(·) in the Equation (6). Additionally, only in the limiting case that the distributions *F*_{i1} and *F*_{i2} have no overlap. Thus, a large value of the pairwise misalignment indicates that comparisons of the signals across the two arrays are not biologically meaningful because the two-copy markers' signals are not on the same scale.

Importantly, for any pair of arrays *i*_{1} and *i*_{2}, the pairwise misalignment is bounded above by the average of the two arrays' IMAs. By the triangle inequality, (7) and (9) imply that

(10)

Therefore, in practice, selecting references for array *i*_{1} and *i*_{2} in such a way that each array's internal alignment is very small ensures that the pairwise misalignment is also small. This observation motivates the use of array-specific reference selection methods (such as RSA) in practice. Improving the reference selection for each array (in the sense that the IMA is reduced) also reduces the upper bound for each pairwise misalignment.

The upper bound of (10) is applicable to QA and ISA signals as well because IMA and pairwise misalignment can be analogously defined. The IMA measures the difference between the distributions of the signals of the two-copy markers and the markers used to determine the alignment transformation. The pairwise misalignment measures the difference between two arrays' distributions of two-copy markers' signals. Thus, the implicit reference selection strategy of an alignment procedure is quite relevant in practice.

Mullighan *et al*., (2007) used Affymetrix SNP microarrays to study copy number abnormalities in the tumors of 242 ALL subjects. For comparative controls, SNP array data were collected from bone marrow samples obtained during remission from 61 subjects with acute myeloid leukemia. This example uses the Xba genotyping array data collected in this study. Unaligned summary signals were computed using dChip SNP software (www.dchip.org). Genotype calls were generated using GTYPE software, version 4.0 (Affymetrix, Santa Clara, CA, USA). The example dataset is freely available from www.stjuderesearch.org/depts/biostats/refnorm.

Four different alignment procedures were applied to the data. A set of QA signals was obtained by computing Equation (1) with the log-transformed unaligned signal values. To obtain ISA signals, dChip software was used with default settings. A set of ‘cyto-RAP’ signals was obtained by using cytogenetics data to select reference markers as described in Section 2.4 and then applying RAP. A set of RSA-RAP signals was obtained by using RSA (with η=0.15) to select reference markers and then applying RAP. The normal (0,1) distribution was selected as the target distribution for cyto-RAP and RSA-RAP signals. This resulted in four sets of aligned signals.

For each set of signals, a profile of standardized differences were computed for each tumor by comparing the signals of the tumor to the signals of the control arrays (Supplementary Materials, Section A). After ordering markers by chromosome and position, Thomas, (2003, 2005) algorithm was used to segment the standardized difference profile of each chromosome for each tumor. Each resulting segment was inferred as a region of deletion, no copy number change, or amplification on the basis of the proportion of markers with a positive standardized difference (PPSD). Notably, PPSD equals 0.5 if and only if the median standardized difference of a segment equals zero. Therefore, given a threshold 0≤γ≤0.5, a segment was declared a region of deletion if PPSD<0.5−γ, a region of no copy number change if 0.5 −γ≤PPSD≤0.5+γ, or a region of amplification if PPSD>0.5+γ. The value of γ defines the interval [0.5−γ,0.5+γ] for which a segment's PPSD is considered to indicate the two-copy state. Increasing γ widens this interval and thus increases the number of segments that are inferred to be in the two-copy state. Setting γ=0 declares all segments as gain or loss because no segment has a PPSD exactly equal to 0.5. Setting γ=0.5 declares all segments as being in the two-copy state because 0≤PPSD≤1 by definition. Except where explicitly stated otherwise, γ=0.1 is used.

Figure 1 shows the results of QA for one hyperdiploid tumor array (Hyperdip > 50-SNP#12). The QA signals of the control arrays are stochastically greater than the QA signals of markers on chromosomes with no cytogenetically detected abnormality in the tumor (mostly two-copy state) and stochastically less than QA signals of markers on amplified chromosomes (Fig. 1A). Consequently, the segmentation results obtained using QA signals are very difficult to interpret without the benefit of external cytogenetics data (Fig. 1C). The alternating pattern of the signals indicates that the underlying copy number also alternates. However, none of the segments have signal differences that are centered near zero. Thus, without the benefit of cytogenetics data, it is difficult to tell which (if any) set of segments is in the two-copy state. ISA gives qualitatively similar results (data not shown).

Alignment and segmentation results for a hyperdiploid tumor. (**A**) The quantiles of QA signals for markers on the amplified chromosomes (black dashed line) and chromosomes with no cytogenetic abnormality (solid black line) of a hyperdiploid tumor's array **...**

For this tumor, copy number inferences based on QA signals are concordant with cytogenetics for only 51% of the markers. Virtually the entire genome is inferred to be a region of either amplification or deletion. However, cytogenetics indicates that each chromosome is amplified or in the two-copy state. The chromosomes inferred as deletions have two copies according to cytogenetics. The problem cannot be resolved by choosing a different threshold γ. Choosing γ<0.1 gives the same inference described above; choosing γ to be somewhat greater than 0.10 erroneously infers amplified chromosomes to be in the two-copy state; and an even larger γ erroneously infers the entire genome to be in the two-copy state. The QA and ISA signals give qualitatively similar pattern for most hyperdiploid tumors in our study (data not shown). These results are directly attributable to the poor alignment of the signals of two-copy markers in the tumor with the signals of the two-copy markers in the control samples (Fig. 1A). The poor alignment arises from the poor reference selection strategies of QA and ISA (Supplementary Materials, Sections B and C).

Figure 1 shows the results of RSA-RAP for this tumor. The markers on chromosomes with no cytogenetically detected abnormality (Fig. 1B) have a similar distribution of RSA-RAP signals as the control arrays (Fig. 1B). Consequently, the standardized differences for the two-copy segments are centered very near zero and thus the results are easily interpreted (Fig. 1D). Cyto-RAP gives qualitatively similar results (data not shown). RSA-RAP and cyto-RAP give copy number inferences that are concordant with cytogenetics for >99.9% of the markers. Moreover, the highly concordant inference is robust in terms of the selection of γ. The copy number inference is exactly the same for 0.086≤γ≤0.344. (This range is greater than one-half of the possible range for γ.) These accurate and robust results are obtained because RSA-RAP closely aligns the signal distributions of two-copy markers for this tumor to the two-copy signal distribution of the control arrays (Fig. 1B).

Also, for each tumor, (7) was used to compute approximate IMA values by setting *k*_{ij}=2 for the markers on chromosomes with no cytogenetically detectable abnormalities. The IMA values were computed for QA and RSA-RAP signals. For QA signals, all markers are reference markers, as described previously. For RSA-RAP signals, the reference markers were identified with RSA. The IMA values were not computed for ISA signals because dChip software did not provide the list of markers included in the rank-invariant set. The IMA values for cyto-RAP values were not computed because the reference markers and the two-copy markers were defined in the same way using cytogenetics data. Therefore, the IMA values for cyto-RAP signals equal zero by definition and are thus not meaningful to consider. The proportion of markers with a copy number inference (using γ=0.1) concordant with cytogenetics was also computed for each tumor.

For the hyperdiploid tumor of Figure 1, the IMA of RSA-RAP signals is 0.01 and the IMA of QA signals is 0.17 (Fig. S3, Supplementary Materials). Therefore, inequality (10) implies that the pairwise misalignment of this tumor's RSA-RAP signals with the RSA-RAP signals of any control array must be small. Hence, the accurate copy number results obtained with RSA-RAP signals (Fig. 1D) are to be expected according to inequality (10). Also, inequality (10) does not provide any assurance that the pairwise misalignment of the tumor array's QA signals with any control array is small. Thus, the poor results obtained with QA signals (Fig. 1C) are not surprising either. Across the dataset, an IMA less than 0.05 strongly correlates with a concordance greater than 0.8 for QA and RSA-RAP signals (Fig. S4, Supplementary Materials). Additionally, the number of tumors with poor concordance and poor IMA with QA signals is substantially greater than that with RSA-RAP signals (Fig. S4). For QA signals, a poor IMA has a strong negative correlation with the proportion of markers located on a chromosome with no cytogenetic abnormality (Fig. S5, left panel). The IMA of RSA-RAP signals show much less correlation with the proportion of markers on chromosomes with no cytogenetic abnormality (Fig. S5, right panel), indicating that RSA-RAP is more robust against extensive copy number abnormalities than QA.

Finally, RSA-RAP and cyto-RAP gave copy number results with markedly better concordance with cytogenetics across the study as a whole than did QA or ISA (Fig. 2). The proportion of markers across all tumors with copy number inferences that are concordant with cytogenetics using RSA-RAP or cyto-RAP signals is greater than or equal to that using ISA or QA signals for any fixed value of γ (Fig. 2D). Furthermore, RSA-RAP and cyto-RAP give higher concordance among markers on chromosomes with no cytogenetic abnormality (Fig. 2A) and among markers on amplified chromosomes (Fig. 2B) than QA or ISA for any fixed value of γ. For very small γ, ISA gives a slightly higher concordance among markers on deleted chromosomes than the other methods (Fig. 2C) but apparently at the expense of very poor performance among markers on chromosomes with no cytogenetically apparent abnormality (Fig. 2B). This result is readily explained by the misalignment bias of QA and ISA signals (Supplementary Materials, Sections B and C). Additionally, the overall results of cyto-RAP and RSA-RAP are robust over a wide range of values for γ, as observed for the hyperdiploid tumor discussed above. Most interestingly, the results of RSA-RAP are very similar to those of cyto-RAP. This indicates that RSA selects reference markers well enough that little improvement in performance is achieved by collecting auxiliary copy number data, such as cytogenetics.

RAP is proposed as a procedure to prepare SNP microarray signals for copy number analysis of tumors. Unlike other alignment procedures, RAP is specifically designed to align the signal distribution of two-copy markers across arrays. Across-array signal comparisons can be biologically meaningful only if the arrays have a similar distribution for signals of markers in the two-copy state. Also, poorly aligned signals are quite prone to giving inaccurate copy number results (Figs 1 and S4). Other alignment procedures are not designed with the explicit objective of aligning the signals of two-copy markers. Subsequently, RAP gives more accurate copy number analysis results than ISA or QA in our application (Fig. 2).

RSA is proposed as a strategy to select internal reference markers for RAP. The biological rationale of RSA is that two is the smallest copy number that can result in heterozygosity. Intuitively, this is a better rationale for selecting reference markers to align signals for copy number analysis of tumors than the rationale for using all markers as references for all arrays (as in QA) or the rank-invariant set (as in ISA). RSA performed very well in our application, selecting a chromosome with no cytogenetically detectable abnormality for more than 90% of our tumor samples. Consequently, copy number analyses based on RSA-RAP showed better accuracy than copy number analyses based on ISA or QA (Fig. 2).

The internal and pairwise misalignment are defined as metrics of signal misalignment. A large pairwise misalignment indicates that across-array comparisons will not be biologically meaningful. For each pair of arrays, the pairwise misalignment is bounded above by the average of the individual arrays' IMAs as shown in (10). Thus, a reference selection strategy that allows for a different set of reference markers to be selected for each array, such as RSA, is warranted from the standpoint of attempting to minimize the IMA for each array and thus minimize the upper bound for the pairwise misalignment of each pair of arrays.

It may be possible to modify RSA or develop another RSA that can select reference markers to further reduce the IMA. In our application, the most common source of contamination of the reference set was an amplified or deleted cytoband on the reference chromosome selected by RSA. RAP does not require that full chromosomes be used to define the set of reference markers. Therefore, other reference selection procedures may be able to avoid the problem of single band amplification or deletion by allowing selection of subsets of chromosomes as the reference markers. Additionally, algorithms that allow inclusion of two-copy markers from multiple chromosomes may give smaller IMAs than RSA in some cases.

Unlike QA and ISA, RAP provides a straightforward way to use validation data on copy number abnormalities to rectify poor results. If validation experiments reveal serious errors in the reference selection or copy number results for a particular tumor, the validation data can be used to reselect reference markers and realign the signal values for that array without impacting the alignment and copy number results for other tumors. RAP performs the alignment for each array separately without using the signal data of other arrays (unless the target distribution is empirically defined using other arrays' data as in QA). As such, RAP can be used in conjunction with the results of laboratory validation to iteratively refine the reference marker selection and copy number results. In their current forms, neither QA nor ISA provides such a straightforward solution to this problem because the reference selection is ‘hard-coded’ into the algorithm. With QA and ISA, the only ‘remedy’ for this problem is to simply report the problem or exclude the problematic tumor from the study. If the later ‘solution’ were pursued in our study, then two entire disease subtypes, hyperdiploid and hypodiploid ALL, would be excluded simply because the QA and ISA are not well-suited alignment strategies for these tumors.

RAP, RSA and the concept of internal and pairwise misalignment may be generalized so that they are applicable in other settings as well. For instance, the problems with QA and ISA observed in our study using SNP arrays to examine copy number are likely to arise in applications that use array comparative genomic hybridization to study copy number abnormalities of tumors. However, the problem of reference selection is more difficult in the latter setting because genotype data are not available and thus reference selection remains a challenge. Also, it may be possible to improve reference selection strategies for the signal alignment of gene expression microarray signals. For example, it may be desirable to exclude Y chromosome genes from the reference set in studies that include males and females. Also, it may be possible to develop strategies that use the present–absent *P*-values (Pounds and Cheng, 2005) to select ‘absent’ genes as the reference genes to align the signal values of Affymetrix gene expression microarrays. In this context, RAP would align the signals of ‘absent’ probe sets across arrays. (The rationale of aligning ‘absent’ probe sets is to match the distribution of cross-hybridization signals across arrays.) The practical merit of these and other extensions of RAP should be explored in future research.

The three alignment procedures (QA, ISA and RAP) have several interesting similarities and differences. All three alignment procedures can be characterized in terms of a reference selection strategy, selection of a target distribution and technique to define a transformation of the unaligned signals to the target distribution. For reference selection, QA sets *r*_{ij}=1 for all *i* and *j*, ISA sets *r*_{ij}=1 for markers in the rank-invariant set and *r*_{ij}=0 for other markers, and RAP sets the values of *r*_{ij} as determined by the user. For the target distribution, QA uses (1) to empirically define the target distribution , ISA uses the selected baseline array to define the target distribution, and RAP allows the user to select the target distribution. To define the transformation, QA uses (1), ISA uses local regression as described in Section 2.2, and RAP uses linear interpolation as described in Section 2.3 and the quantile function *G*^{−1} of the user-selected target distribution. Thus, to select the best alignment procedure for a specific application in practice, one should consider which method(s) have the best reference selection strategy, target distribution and technique of transformation for the specific setting of the application.

From this perspective, it is clear that QA is a special form of RAP. QA sets *r*_{ij}=1 for all *i* and *j*. Thus, by (5), *u*_{ij}=*q*_{ij} for all *i* and *j* in this special case. Using (1) to define the target distribution yields , thus deriving QA within the framework of RAP. This observation brings up the question of whether using all markers as internal references for all arrays is the optimal reference selection strategy for a specific application. In our application, it is not intuitively reasonable to s*et al*l *r*_{ij}=1 because we already know that many tumors have many markers that are not in the two-copy state. By using all markers as references, QA signals are prone to have large IMAs for tumors with fewer than 80% of markers on a two-copy chromosome (Fig. S5, left panel). RSA is an intuitively more reasonable strategy for reference selection and thus RSA-RAP is more robust in terms of IMA (Fig. S5, right panel).

The choice of target distribution for RAP remains an open research question. We chose the normal distribution in our study due to its desirable statistical properties. Bolstad *et al*., (2003) argue for the use of empirically defined target distributions. However, in our application, it is difficult to meaningfully interpret the empirically defined distribution of Equation (1) because of the radical variation of the true copy number status across arrays. For example, the 75th percentile for a hyperdiploid tumor is likely to represent an amplified copy state while the 75th percentile for a control should represent the two-copy state. In terms of Equation (1), the average across *i* is computed across a cohort that is not homogeneous in terms of copy number for many values of *j*. Nevertheless, it may prove worthwhile to develop criteria to guide selection of the target distribution or develop techniques to empirically define the target distribution when each array may have a different set of reference markers.

RAP and QA operate on the marginal distribution of unaligned signal values of each array. These methods do not explicitly consider the correlation among markers. It may be possible to further improve signal alignment by developing methods that explicitly consider correlation among marker signal values.

We thank Dr Fridjtof Thomas for providing R code to implement his change-point algorithm and Dr Xueyuan Cao for summarizing the HapMap data and providing helpful editorial suggestions. We also appreciate the editorial advice of Dr Wei Liu and Mr David Galloway.

*Funding*: American Syrian Lebanese Associated Charities (ALSAC); the National Institutes of Health Cancer Center Support Grant (CA21765); the NIH/NIGMS Pharmacogenetics Research Network and Database (U01GM61393, U01GM61374, http://pharmgkb.org/); and National Institutes of Health (R01CA115422-01A1, R01CA129541-01, and R01CA132946-01).

*Conflict of Interest*: none declared.

- Bolstad BM, et al. A comparison of normalization methods for high density oligonucleotide array based on variance and bias. Bioinformatics. 2003;19:185–193. [PubMed]
- Mullighan CG, et al. Genes regulating B cell development are mutated in acute lymphoid leukaemia. Nature. 2007;446:758–764. [PubMed]
- Li C, Wong WH. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol. 2001;2:8. [PMC free article] [PubMed]
- Pounds S, Cheng C. Statistical development and evaluation of gene expression data filters. J. Comput. Biol. 2005;12:482–495. [PubMed]
- Raimondi SC, et al. Cytogenetics as a diagnostic aid for childhood hematologic disorders: conventional cytogenetic techniques, fluorescence in situ hybridization, and comparative genomic hybridization. In: Hanausek M, Walaszek Z, editors. Tumor Marker Protocols. Methods Molecular Medicine. Totowa, NJ: Humana Press; 1998. pp. 209–227. 1998.
- Steinhoff C, Vingron M. Normalization and quantification of differential expression in gene expression microarrays. Brief. Bioinform. 2006;7:166–177. [PubMed]
- Thomas F. Statistical approach to road segmentation. J. Transp. Eng. 2003;129:300–308.
- Thomas F. Automated road segmentation using a bayesian algorithm. J. Transp. Eng. 2005;131:591–598.

Articles from Bioinformatics are provided here courtesy of **Oxford University Press**

PubMed Central Canada is a service of the Canadian Institutes of Health Research (CIHR) working in partnership with the National Research Council's Canada Institute for Scientific and Technical Information in cooperation with the National Center for Biotechnology Information at the U.S. National Library of Medicine(NCBI/NLM). It includes content provided to the PubMed Central International archive by participating publishers. |