High-density oligonucleotide tiling-microarrays currently provide the most powerful method of investigating genome-wide protein-DNA interactions and chromatin structure in vivo. As illustrated in Figure , the technology allows tiling regions of interest on DNA with probes separated by short chromosome distances. A typical NimbleGen array has about 400,000 probes that are 40-60 nucleotides long and separated by 10-100 base-pairs (bp) in the genome. Both NimbleGen and Agilent provide two-color microarrays with flexible designs where one can choose probes that are partially overlapping for high resolution studies of chromatin structure. The experimental protocol requires labeling the treatment and control samples with fluorescent dyes, usually green and red, and then hybridizing them on a microarray. Each probe's intensity of fluorescence upon scanning the microarray will give an approximate measure of the abundance of DNA that hybridized to the probe. Because each probe has an associated genomic coordinate, one can plot the intensities as a function of chromosome locations and then reconstruct the enrichment of particular DNA or RNA fragments compared to the genomic background. As in Figure , the enriched regions appear as peaks, which can represent protein-bound DNA fragments.
Figure 1 ChIP-chip. Regions of interest on DNA are densely tiled, with probes separated by short distances. In this figure, each bar corresponds to the log-ratio hybridization signals of two channels measured by a probe. Small sub-regions that are over-represented (more ...)
The technology is continuing to develop rapidly, but certainly not without difficulties that are imposed by the inherent complexity of biological systems and, as such, must be addressed by computational means for the foreseeable future. The main computational challenge lies in properly normalizing the data and distinguishing true peaks from the noisy background. Many problems that confound this type of microarray data actually arise from probe-specific biases, such as differential sequence copy numbers in the genome or variable melting temperature dependent upon the GC content. For Affymetrix tiling arrays, several good model-based methods already exist to account for probe biases and, thus, to adjust for probe-specific baseline signals. The recently introduced MAT [1
], for instance, estimates probe affinity from probe sequence and copy number and provides a powerful tool for finding enriched regions in chromatin immunoprecipitation (ChIP) and other applications on Affymetrix tiling-array experiments. Incidentally, similar problems are also found in Affymetrix expression arrays, for which extensive effort has been previously exerted by various groups to develop robust methods for background correction and probe-level normalization (for example, [2
]). It is relatively hard and expensive for Affymetrix to provide custom designed microarrays.
Commercial custom tiling arrays are relatively new in the field of microarray biotechnology and, just as expression arrays allow global assays of gene expression, provide an invaluable tool for investigating the locations and roles of DNA-binding proteins in the whole genome at high resolution. All currently available custom tiling arrays use the two-color technology. Considering the utility and power of high-resolution tiling arrays, it is thus imperative that reliable computational methods be developed now to facilitate the extraction of precise and accurate conclusions from such experiments.
It turns out that two-color arrays also exhibit a sequence bias, particularly dependent upon the GC content of probes. More precisely, probes with high GC counts tend to have high intensity; furthermore, as Figure indicates, the two channels show a higher correlation in the high-GC probes than in the low-GC probes. However, no satisfactory normalization and peak-detection methods are yet available for two-color tiling arrays. For example, even though NimbleGen provides flexible custom designs, with long probes to minimize cross-hybridization and variable probe spacing to allow dense tiling, a robust method of analysis has not been hitherto developed for the platform. Indeed, NimbleGen currently uses a simple method of globally scaling all probe ratios by the median, attempting to remove any dye-bias across arrays but neglecting other probe-specific biases. As illustrated in Figure , the median scaled ratios retain the bimodal distribution attributable to GC probe effects and, thus, this approach is inadequate in removing all dye and sample biases from the data.
Figure 2 Scatter plots of the Cy5 versus Cy3 channels for 50-mer probes from  with (a) 28256_Input versus 28256_ChIP for G+C = 11 bases and (b) 28256_Input versus 28256_ChIP for G+C = 39 bases. The correlation is 0.364 in (a) and 0.860 in (b). (c) Plot of (more ...)
Figure 3 Histograms of intensities. (a) Histogram of single-channel log-intensity values for a single array from 28256_Input . The red bars represent the log-intensities for the probes with G+C less than 20, indicating that the bimodal behavior is caused by (more ...)
For dual-channel cDNA arrays, several normalization methods have been proposed (for example, [2
]), but these procedures typically utilize methods that neglect probe sequence information and are also computationally expensive and, thus, unsuitable for currently available high-density tiling arrays. One common way of locally normalizing two-color arrays is the so-called M
loess normalization. The fundamental assumption behind this procedure is that most probes should have similar values between the two-channels, an assumption violated in studies of chromatin structure such as nucleosome mapping described in [7
]. This method also does not account for sequence-specific effects, which may be significant in high-density tiling arrays, and also does not normalize the variance of M
Single-channel normalization methods can be also applied to two-color arrays, such as those proposed by [3
], but they ignore the fact that the two channels are paired, and such approaches are thus likely to retain residual effects or correlation. Recently, Dabney and Storey [10
] have introduced a normalization method that adjusts for intensity-dependent dye bias and array-to-array variations. However, their method, which was developed for expression arrays, does not model sequence-specific probe effects and is based on smoothing procedures that can be computationally demanding for tiling arrays; the approach also requires a dye swap and, thus, cannot be applied to single array experiments, which are often performed as test runs. In fact, as far as we are aware, there are, to date, only two published tools, MPeak [11
] and ChIPOTle [13
] for analyzing two-color high-density tiling arrays, but neither considers probe-specific normalization or is able to combine replicate experiments directly. This problem is rather serious since biological replicate experiments are perceived to be indispensable in any sound research utilizing microarrays.
In this paper, we address many of the issues discussed above and present robust algorithms for normalizing the raw data at probe-level and detecting peaks, implemented as a Java program called MA2C (model-based analysis of two-color arrays). Because our normalization method standardizes the probe intensities, our peak-detection algorithm naturally generalizes to combine replicate arrays.