In this work, we demonstrate that probe cross-hybridization signals can be mapped to specific off-target transcripts. Incorporating exon array probe mapping information, we exclude probes showing strong correlations with corresponding off-target transcripts to remove cross-hybridization biases from resulting gene-level expression estimates. We evaluated our strategy for gene-level expression using independent estimates of transcript abundances from Solexa ultra high-throughput sequencing. We find that expression estimates for a number of genes can be improved by removing cross-hybridization artifacts.
Our work gives further understanding to factors affecting microarray probe cross-hybridization. The set of exon array full probes, designed to target computationally predicted exonic regions, tends to have probe intensities near background levels and can be used to study how probes respond to transcripts to which they share sequence similarity. We found decreasing correlation between probes and the expression patterns of matching off-target transcripts as the match edit distance between the probe and transcript is increased. Allowing an edit distance of 3 bp between probes and off-target transcripts, probes may show strong signals of cross-hybridization, compared with signals expected by chance (see also Supplementary Fig. S5
Matches including insertion/deletions between probe and transcript sequences can also give rise to strong cross-hybridization signals. Therefore, it will be important to apply sequence mapping programs which have the ability to detect these types of alignments. In previous work, it has been reported that probes with much shorter alignments of 10–16 nt may be sufficient for cross-hybridization (Wu et al.
). However, as the number of matches between probes and transcripts rapidly increases with larger edit distance, it will be important to develop more sophisticated models to predict individual probe cross-hybridization. Future models may incorporate factors, such as the type of probe-transcript alignment, probe sequence (Wu et al.
) or transcript secondary structure. For example, we found that the probe sequence GC-content affects the extent of cross-hybridization and may affect cross-hybridization in different ways, depending on the type of alignment between probes and transcripts. For perfect matches between probes and transcripts, probes with intermediate GC-content tend to have the highest correlation with the transcript expression level. However, for larger match edit distances between probes and transcripts, probes with larger GC-content show higher correlation with the transcript expression levels (Supplementary Fig. S6
). With more detailed knowledge of how probe sequence affects cross-hybridization, we will be able to design probes to be more specific to target transcripts.
In the absence of sequence-based predictive models of cross-hybridization, we found that the empirical data can be used to detect cross-hybridization. For a matching probe and off-target transcript, we use the transcript's expression pattern to determine whether the probe intensity reflects cross-hybridization. This approach takes advantage of the large amount of annotation of the transcriptome and can be used on other arrays with genome-wide coverage. The number of samples for which the cross-hybridization correction can improve gene expression estimates will depend on the expression pattern of the off-target transcripts. For example, from , removing the cross-hybridizing probes will dramatically change expression estimates in many different tissue types. In a few other tissue types the estimates will not be affected because the off-target is not highly expressed in those samples. As a result, the set of genes which are affected by the cross-hybridization correction will tend to overlap among the different samples. We found that our method based on the empirical data is limited by the array design. Genes with small numbers of probes uniquely matching the target transcript can yield less reliable estimates of gene expression. For example, we found that many genes where GeneBASE-xhyb estimates are less concordant with sequencing than the GeneBASE estimates have fewer than five uniquely matching probes.
In many microarray applications careful selection of probes to uniquely match target transcripts can be used to eliminate cross-hybridization biases. In future microarray designs, SeqMap (Jiang and Wong, 2008
) or similar algorithms (Li et al.
; Smith et al.
) can be used to select probes which do not share sequence similarity to off-target transcripts, allowing up to a certain number of mismatches or insertion/deletions. However, there are many microarray applications where it is unavoidable for probes to share some sequence similarity to off-target transcripts. For querying exon–level expression or for certain paralog gene families there may be only a small number of probes which uniquely match the target transcript. For probes designed to target individual sequences which differ at a particular locus by a single nucleotide polymorphism (SNP), each probe will have a single nucleotide difference between the competing genomic transcripts. Additionally, probes which target exon-exon junctions may be subject to cross-talk from hybridization to mRNA transcript isoforms which include only one of the exons (Boutz et al.
; Srinivasan et al.
). In these situations detailed knowledge of cross-hybridization will be useful to design probes with high specificity to their target transcripts.