We have used two entirely independent spike-in data sets to derive a nucleotide substitution matrix that allows BLASTN to more accurately identify probes that are susceptible to cross- or bulk-hybridization with a given target sequence. In addition to its increased accuracy, this substitution matrix has several desirable features: first, the matrix is relatively similar to the original, default matrix; only two parameters are different. Second, the substitution matrix is symmetric; if this had not been the case, it would be necessary to transpose the matrix when reversing the direction of the BLASTN search. Third, these parameters make intuitive sense; A–T base pairs are generally less energetically favorable than G–C base pairs.
It is somewhat tempting to attempt further optimization of the substitution matrix, and to use separate substitution matrices for prediction of cross- and bulk-hybridization. In fact, it is likely that the true sequence-dependent specificity of cross-hybridization is at least slightly different from that of bulk-hybridization, because bulk-hybridization occurs between two RNA strands, whereas cross-hybridization occurs between a DNA and an RNA strand. However, we believe that the relatively small gain in performance from additional optimization steps is not sufficient to justify the increased complexity of two separate matrices.
Although the optimized substitution matrix is more accurate than the default matrix, an alignment score provides only a rough estimate of relative hybridization affinity. For a more accurate estimation of binding energy between oligonucleotides, it is necessary to consider base stacking energy, positional effects, sequence complexity and other features that cannot be captured in the simple nucleotide substitution matrix used in BLASTN (19
). Furthermore, although we have considered only the top-scoring hit, multiple high-scoring hits may each contribute to the observed intensity. We and others have speculated that a crude nearest neighbor model could be implemented with the BLASTN algorithm by using dinucleotides as the atomic elements. Additionally, others have developed efficient algorithms to search for hybridization partners that are not based on BLASTN (20
). The data from our experiments, which we have made publicly available, may aid in the development of such algorithms.
As a preliminary step in our study, we evaluated the effect of a high-concentration spike on gene expression measurements. We observed many individual probes with substantial changes in intensity, both upwards and downwards, in response to the spike. With this result alone, it would be tempting to speculate that the observed intensity changes could be due to random noise or to experimental error. However, the association between intensity change and BLASTN alignment score suggests a sequence-dependent relationship that is consistent with cross- and bulk-hybridization. Furthermore, our observation of very similar results in two independent data sets suggests that this is not a chance occurrence. Notably, cross- and bulk-hybridization affect a comparable fraction of probes, suggesting that it is important to consider both effects when designing microarray probes.
As we reported previously in the hemoglobin study, the number of changes in probeset-level expression values was relatively small, in spite of the large degree of cross- and bulk-hybridization affecting individual probes. It is reassuring that the addition of a spike at 10% of the total target, which we suspect is above the level that is likely to be encountered in most microarray experiments, produces only a few false changes in expression level. However, experiments using blood may encounter substantial changes in transcript abundance, and other researchers have found that removal of highly abundant hemoglobin transcripts can improve data quality (22
). Additionally, even if non-specific hybridization has only small effects on individual expression values, its coordinated effect on two or more susceptible genes can substantially increase their apparent correlation (2
Our experiments were limited to Affymetrix 3′ expression arrays because of our familiarity with the platform, and because the large number of probes per gene provide ample data for our primary goal of optimizing BLASTN parameters. In general, our observations of the extent of cross- and bulk-hybridization may not apply to other microarray platforms or methods. For example, we previously found that the target generation method has a substantial effect on specificity (4
). In general, cross-hybridization has not been a major consideration in comparisons between microarray platforms, even when spike-in experiments were available (23
). Our approach using high-concentration spikes could be applied to compare the relative specificity of various microarray platforms or methods.
In our analysis we treat bulk-hybridization as the mirror image of cross-hybridization: it is predicted by similarity to the opposite strand, and it causes decreases rather than increases in intensity. However, the reality is likely to be much more complicated, because the effect size of bulk-hybridization depends on the concentration of two RNA species in a non-trivial manner. Furthermore, bulk-hybridization may involve more than two RNA species. In our simple model presented in , the relevant target sequence is limited to the section directly complementary to the probe. However, any flanking sequence on the target RNA may also affect its binding to the spike sequence. It is probably possible to reduce the relative level of bulk-hybridization by decreasing the target concentration during array hybridization, but whether this benefit would outweigh the resulting loss of signal is unknown.
We should note that we have used S. cerevisiae
strain W303 as our baseline sample, whereas the microarray probes were designed to query S. cerevisiae
strain S288C. The nucleotide divergence between S288C and W303 has been estimated at 0.08% (24
); thus we expect ~1 in 50 probes to contain a single-base mismatch against its intended target. This seems unlikely to have a noticeable effect on the cross-hybridization analysis, which is independent of the baseline target sequence. However, our analysis of bulk-hybridization was performed with the implicit assumption that the target sequence is the exact reverse complement of the probe sequence. We expect sequence discrepancies to introduce small random errors in the alignment scores. Given the large number of probes analyzed, these errors are very unlikely to affect our results.
Efficient prediction of nucleotide hybridization has several potential applications beyond probe selection for microarrays. Sequence-specific hybridization is a cornerstone of several molecular biology techniques, such as PCR, Southern and northern blots, and fluorescent in situ hybridization. Thus, the optimized BLASTN substitution matrix presented in this work may also be useful in the design of probes for these applications.