We were able to use and reproduce a GC content effect of probe sequence for both Daphnia (Nimblegen) and Drosophila (Affymetrix) tile expression data. Signals normalized this way do not differ grossly from the raw signals. However, fuzziness at detecting gene sequence structure (exon/intron boundaries) appears to be one result of sequence content normalization. Sequence normalization (quantilenorm) reduced sensitivity and specificity for exon detection by 1% for the Daphnia data, and by 2% for Drosophila data.
3.1 Normalization reduces GC content correlation
The quantile normalization and RLS methods reproduce generally the GC content effect of probe sequence reported by Royce and colleagues, for both species experiments. The plots in look compelling: raw signal gives a higher signal for GC-rich probes. After normalization by sequence, that effect goes away. Average exon signal and GC values are above those of introns, and this remains after normalizations, although correlation of GC and signals is reduced.
Fig. 1. (A) Raw and two normalized signals, by base per probe sequence position and (B) as a dot plot of signal strength versus GC content. Plots (A) are as in Royce et al. () of average signal per base over probe sequence position. Bases G + C in Raw are (more ...)
3.2 Normalization reduces gene structure signals
For detecting gene structures, the overlap of high-scoring tiles with known exons provides a measure of accuracy for normalization results. Both species data showed a drop off in sensitivity and specificity with normalized signal.
Use of raw signals improves the detection of gene structures as seen with signal changes at exon/intron boundaries. One effect of normalization is to obscure gene structure boundaries, which are often related to sequence changes. plots the statistical power of raw and quantile norm signals to distinguish exon and gene boundaries. The raw signal has a greater statistical discrimination of boundaries. These effects are correlated with GC content, also displayed. With a per-base comparison of GC and score, the major effect is for higher score-GC correlation in intron regions. Quantile normalization reduces this correlation, so that GC-poor introns have a relatively higher tile score.
Fig. 2. Tile score statistical power at finding gene/exon boundaries. Student's t-statistic and log10 (probability) for raw (triangle) and quantile norm (cross) scores measure ability to distinguish boundary at base positions away from position 0 (gene or exon (more ...)
Using partially overlapped tiles of experiments for both species, differences in GC content between overlapped tiles had lower correlation with signal level. The overall correlation of GC and signal strength is 20% in both species. For overlapped tiles this correlation drops to 3% (Drosophila) or 15% (Daphnia). When signal and GC content are measured at exon–intron boundaries, overlapped tiles have a high 60% correlation for Daphnia, and 9% in Drosophila, both about three times higher than outside of boundaries. These species differ in total GC content, and in DNA methylation processing genes associated with variations in GC, so large species differences are not unexpected.
Nornalization problems at detecting gene structures were first evident on gene maps. RLS and quantile normalization down-weighted exons and up-weighted introns so that the normalized signal was strongest for introns of several genes. shows examples of this for two genes. The boxed areas show cases where detection of intron–exon boundaries by tile signal is diminished after normalization compared with the raw tile signals. These areas coincide with changes in GC content. The normalizations have increased the score, and thus noise, of non-expressed introns and intergenic regions.
Exon–intron signal loss examples. Genome maps show gene models, the raw and normalized tile signals and GC content, for Daphnia and Drosophila genes. Box (highlighted) areas where normalization has obscured the biological signal.