Empirical determination of cross-hybridization thresholds on a pathogen detection microarray
To systematically investigate the dynamics of array-based pathogen detection, we created an oligonucleotide array using Nimblegen array synthesis technology [13
]. The array was designed to detect up to 35 RNA viruses using 40-mer probes tiled at an average 8-base resolution across the full length of each genome (53,555 probes; Figure S1 and Table S1 in Additional data file 1). Together with 7 replicates for each viral probe, and control sequences for array synthesis and hybridization (see Materials and methods), the array contained a total of 390,482 probes. Initially, we studied virus samples purified from cell lines, reverse-transcribed and PCR-amplified with virus-specific primers (instead of random primers). This allowed us to study array hybridization dynamics in a controlled fashion, without the complexity of cross-hybridization from human RNA and random annealing dynamics, which occur with random primers. We then applied our findings to clinical samples amplified using random primers.
SARS coronavirus and Dengue serotype 1 genomic cDNA were amplified in entirety (as confirmed by sequencing), labeled with Cy3 and hybridized separately on microarrays. The SARS sample hybridized well to the SARS tiling probes, with all 3,805 SARS-specific probes displaying fluorescent (Cy3) signal well above the detection threshold (determined by probe signal intensities >2 standard deviations (SD) above the mean array signal intensity; Figure ). Cross-hybridization with other pathogen probe sets was minimal, observed only for other members of Coronaviridae and a few species of Picornaviridae and Paramyxoviridae, consistent with the observation that SARS shares little sequence homology with other known viruses [14
]. The hybridization pattern of Dengue 1, on the other hand, was more complex (Figure ). First, we observed that hybridization to the Dengue 1 probe set was partially incomplete (that is, there were regions absent of signal) due to sequence polymorphisms. The Dengue 1 sample hybridized on the array was cultured from a 1944 Hawaiian isolate, whereas the array probe set was based on the sequence of a Singaporean strain S275/90, isolated in 1990 [15
]. Sequencing the entire genomes of these 2 isolates revealed that the array probes that failed to hybridize each contained at least 3 mismatches (within a 15-base stretch) to the sample sequence. Second, we observed that cross-hybridization occurred to some degree with almost all viral probe sets present on the array, particularly with probes of other Flaviviridae members, consistent with the fact that the 4 Dengue serotypes share 60-70% homology. To understand the relationship between hybridization signal output and annealing specificity, we first compared all probe sequences to each viral genome using two measures of similarity: probe hamming distance (HD) and maximum contiguous match (MCM). HD measures the overall similarity distance of two sequences, with low scores for similar sequences [16
]. MCM measures the number of consecutive bases that are exact matches, with high scores for similar sequences [17
Figure 1 Heatmap of microarray probe signal intensities. Cells corresponding to probes are aligned in genomic order and colored according to the signal intensity-color scales shown. Hybridization signatures corresponding to (a) SARS Sin850 or (b) Dengue 1 Hawaiian (more ...)
We calculated the HD and MCM scores for every probe relative to the Hawaiian Dengue 1 isolate and observed that these scores correlated negatively (HD) and positively (MCM) with probe signal intensity (Figure ). All probes on the array with high similarity to the Hawaiian Dengue I genome, that is, HD ≤ 2 (n = 942) or MCM ≥ 27 (n = 627), hybridized with median signal intensity 3 SD above detection threshold. Although 98% of probes were detectable at the low HD range from 0-4, or high MCM range from 18-40, median probe signal intensity decreased at every increment of sequence distance (Figure ). Median signal intensity dropped off sharply to background levels at HD = 7 and MCM = 15, with 43% and 46% detectable probes, respectively. The majority of probes (>96%, n > 51,000) had HD scores between 8 and 21 and/or MCM scores between 0 and 15, of which only 1.23% and 1.57%, respectively, were detectable.
Figure 2 Relationship between probe HD, probe MCM and probe signal intensity. Average probe signal intensity and percentage of detectable probes (signal intensity > mean + 2 SD) decreases as HD increases and MCM decreases. The optimal cross-hybridization (more ...)
At the optimal similarity thresholds HD ≤ 4 and MCM ≥ 18, >98% of probes could be detected with median signal intensity 2 SD above detection threshold, whereas adjusting the similarity threshold down 1 step to HD ≤ 5 and MCM ≥ 17 would result in only approximately 85% probe detection and median signal intensity approximately 1.2 SD above detection threshold (Figure ). Using these optimal HD and MCM thresholds to guard against cross-hybridization, we binned all probes into specific 'recognition signature probe sets' (that is, r-signatures) most likely to specifically detect a given pathogen, and we defined r-signatures for each of the 35 pathogen genomes represented on the array (Table ). Each pathogen's r-signature comprised tiling probes derived from its genome sequence (HD = 0, MCM = 40), as well as cross-hybridizing probes derived from other pathogens (HD ≤ 4, MCM ≥ 18). According to these criteria, a given probe could belong to multiple different r-signatures, thereby maximizing probe-level evidence for pathogen detection.
Binning of probes into specific pathogen signature probe sets
We next considered other non-specific hybridization phenomena that could affect performance of our r-signature probes. For example, we observed a linear relationship between probe signal and %GC content (data not shown). Consistent with previous observations, we found that probes <40% GC hybridized with diminished signal intensities, while probes with >60% GC content showed higher signal intensities [19
]. Thus, we censored probes with GC <40% or >60% from the r-signatures, despite optimal HD or MCM values. Furthermore, as cross-hybridization with human sequences could also confound results, we compared all probes to the human genome assembly (build 17) by BLAST using a word size of 15 [21
]. Probes with an expectation value of 100 were also censored (Table ).
While the ideal pathogen r-signature would be one where all probes would hybridize to the target sequence at detectable levels, polymorphic variation between the probes (derived from a consensus sequence) and the actual target would be expected to impede the performance of the r-signature probes at some level. To test this hypothesis, we compared the ratios of detectable to undetectable probes across all r-signatures in the context of the hybridization involving the Hawaiian Dengue 1 isolate. Although the Dengue 1 sequence used to derive the Dengue 1 r-signature was approximately 5% different from the Hawaiian isolate, the detectable probe ratio of the Dengue 1 specific probes was 151/152 (99%), 12 times higher then that for the nearest Dengue serotype signature, suggesting that moderate polymorphic variation is quite tolerable, allowing, in this case, for discernment of the correct pathogen.
Predicting genome-wide amplification bias
Random priming amplification, rather than primer-specific amplification, is preferred for identifying unknown pathogens in clinical specimens. However, in initial experiments using random priming amplification to identify known pathogens, we frequently observed incomplete hybridization of the pathogen genome marked by interspersed genomic regions not detected by the probes. An example involving the amplification of respiratory syncytial virus (RSV) B from a human nasopharyngeal aspirate is shown in Figure . In preliminary analyses, sequence polymorphisms, probe GC content and genome secondary structure failed to explain this phenomenon, suggesting that it might result from a PCR-based amplification bias stemming from differential abilities of the random primers to bind to the viral genome at the reverse transcription (RT) step. The random primer used in our experiments was a 26-mer composed of a random nonamer (3') tagged with a fixed 17-mer sequence (5'-GTTTCCCAGTCACGATA) [4
]. Intra-primer secondary structure formation, such as dimer and hairpin formation between the 17-mer tag and nonamer, and probe melting temperature are known to influence binding efficiency [23
]. To explore our hypothesis, we designed an algorithm to model the RT-PCR process using experimental data (see Additional data file 1 for details). Briefly, it calculates the probability that a 500-1,000 base-pair product (average size range of PCR product) can be generated from each possible starting position in the genome assuming that a nonamer in the random primer mix will complement the viral sequence perfectly. This probability is reduced when intra-primer hairpin formation is predicted, and increased according to degree of complementarity between tag sequence and viral sequence. In this manner, the probability that each nucleotide will be successfully PCR-amplified is reflected in its AES (see supplemental methods in Additional data file 1 and [25
]). To validate the algorithm, we ranked the hybridization signal intensities for all 1,948 probes tiled across the RSV B genome and compared them to their AES values (Figure ). We observed that high AES significantly correlates to probe hybridization signal intensity above the detection threshold (P
= 2.2 × 10-16
; Fisher's exact test). In another experiment involving a patient sample positive for metapneumovirus (hMPV), the probes tiled across the hMPV genome showed a similar result, P
= 1.3 × 10-9
. Repeatedly, we observed that higher AES correlated with greater probe detection, with, on average, >70% detection for probes in the top 20% AES (see supplemental methods in Additional data file 1).
Figure 3 Measurement and application of AES. An RSV patient sample was amplified using original primer A1 (black line), or AES-optimized primer (blue line). The probes that have detectable signal above threshold are shown in purple in the corresponding heatmaps. (more ...)
While HD, MCM, %GC and sequence uniqueness were valuable parameters for probe selection, they did not take into account PCR bias, and were insufficient predictors of probe performance when considered in the absence of AES (Figure ). We found that using only the probes within the top 20% AES (Table ) substantially improved the efficacy of our prediction algorithm (discussed in the following section). In total, after applying all probe selection criteria, the r-signatures utilized 9,768 of the >50,000 unique probes initially included on the array.
Figure 4 Effects of probe filtering criteria on r-signature probe detection. The 1,948 probes tiled across the RSV B genome were binned according to different filtering criteria and plotted against the percentage of probes with detectable signal. Measurements (more ...)
We next hypothesized that amplification efficiency scoring could be used to select an optimal tag sequence (that is, for the RT-PCR primers) for achieving uniformly high AES across viral genomes, thus globally maximizing PCR efficiency (see supplemental methods in Additional data file 1 and [25
]). Briefly, we generated 10,000 primer sequences, eliminated those that formed self-dimers, and calculated AES for every genome based on each candidate primer tag. Primer A2, which had the highest average AES for all 35 viruses present on the array, was selected as the 'AES-optimized' primer. In a comparative study of eight patient samples (five RSV, three hMPV), we observed that primer A2 showed a marked improvement in overall PCR efficiency in amplifying both RSV and hMPV over the original primer, A1 (Figures S2 and S3 in Additional data file 1). The increased PCR efficiency contributed to increased hybridization of DNA to the probes, and is reflected in the uniformly higher signal intensities observed using primer A2. Consequently, >70% of viral probes had signal intensities above detection threshold when using primer A2, compared to approximately 20% using primer A1 (Anova test, P
= 0.00026; Figure S3 in Additional data file 1).
PDA: an algorithm for detecting pathogens
We observed that while the signal intensities for all pathogen r-signatures approximate a normal distribution, a large proportion of probes comprising the signature of a detectable pathogen have relatively strong signal intensities resulting in a right-skewed distribution (Figure ). We reasoned that analysis of the tails of the signal intensity distributions for each r-signature might better enable not only the identification of an infecting pathogen, but also the presence of co-infecting pathogens in the same sample. Thus, we devised a robust statistics-based PDA that analyzes the distribution of probe signal intensities relative to the in silico
r-signatures (see supplemental methods in Additional data file 1 and [25
]). The PDA software comprises two parts: evaluation of signal intensity of probes in each pathogen r-signature using a modified Kullback-Leibler Divergence (KL); and statistical analysis of modified KL scores using the Anderson-Darling test.
Figure 5 Distribution of probe signal intensities and WKL scores. RNA isolated from a RSV-infected patient was hybridized onto the array. (a) Distribution of probe signal intensities of all 53,555 probes (red) and r-signature probes for an absent pathogen, for (more ...)
Since the original KL cannot reliably determine differences in the tails of a probability distribution, and is highly dependent on the number of probes per genome and the size of each signal intensity bin, we incorporated the Anderson-Darling statistic to give more weight to the tails of each distribution. By using a cumulative distribution function instead of the original probability distribution, the p
value generated is independent of the binning criteria, eliminating errors that occur if a particular signal intensity bin is empty [26
]. We call our modified KL divergence the 'weighted Kullback-Leibler divergence' (WKL):
) is the cumulative distribution function of the signal intensities of the probes in Pa
found in bin bj
is the cumulative distribution function of the signal intensities of the probes in
found in bin bj
. R-signatures representing absent pathogens should have normal signal intensity distributions and thus relatively low WKL scores, whereas those representing present pathogens should have high, statistically significant outlying WKL scores (Figure ). In the second part of PDA, the distribution of WKL scores is subjected to an Anderson-Darling test for normality. If P
< 0.05, the WKL distribution is considered not normal, implying that the pathogen with an outlying WKL score is present. Upon identification of a pathogen, that pathogen's WKL score is left out, and a separate Anderson-Darling test is performed to test for the presence of co-infecting pathogens. In this manner, the procedure is iteratively applied until only normal distributions remain (that is, P
> 0.05). The PDA algorithm is extremely fast, capable of making a diagnosis from a hybridized microarray in less than 10 seconds.
Microarray performance on clinical specimens
To assess the clinical utility of the pathogen prediction platform, we analyzed 36 nasal wash specimens according to the workflow illustrated in Figure . These specimens were obtained from children under 4 years of age with lower respiratory tract infections (LRTI), of which 14 were hospitalized for severe disease and 22 with ambulatory LRTI. The clinical diagnosis of these patients was bronchiolitis or pneumonia. All 36 specimens had been previously analyzed for the presence of hMPV, and RSV A and B using real-time PCR. Twenty-one specimens tested positive for one or more viruses, while fifteen were PCR-negative for all three. All specimens were analyzed by microarray in a blinded fashion (Table ).
Schema of pathogen detection process. AD, Anderson-Darling.
Comparison of microarray and real-time PCR performance in detection of pathogen genera (HRV, pneumovirus)
As the RSV A full-genome sequence has not been published, our array was not designed to specifically detect this virus. Thus, we first assessed array performance using only results from the 16 patients diagnosed with either hMPV or RSV B by PCR (Table ). Of this cohort, the microarray correctly detected the presence of hMPV or RSV B in 13/16 samples. This corresponds to an assay specificity of 100%, sensitivity of 76%, and diagnostic accuracy of 94%. All 4 false negative samples (patients 374, 841, 892, and 924) had Ct values >33.5, which is near the detection limit of real-time PCR, and thus perhaps beyond the range of detection by microarray.
Comparison of microarray and real-time PCR performance in detecting RSV B or hMPV
We next assessed array performance in the group of patients PCR-positive for RSV A (n = 7) and PCR-negative for all tested viruses (n = 15). The microarray made only two positive calls in this group, both for RSV B. Interestingly, both RSV B calls corresponded to high-titre RSV A specimens by PCR (patients 414 and 913), suggesting that certain probe sets can detect the presence of related, but unspecified, viruses. Analysis of the published RSV A partial genome sequence (923 bp, Genbank ID: AF516119) revealed that 7 probes on our microarray had 100% identity to RSV A. We created an 'RSV A r-signature' comprising these 7 probes, enabling the specific detection of RSV A by microarray in 4/7 patient samples PCR-positive for RSV A (patients 414, 832, 913, and 924). Although the performance of this small r-signature was not as robust as the other virus r-signatures (median size: 249 probes), it suggested that it was feasible to pursue a 'viral discovery' approach using r-signatures created to detect viruses at the family or genus level that were related to those species already represented on the microarray. Specifically, we binned probes into family- or genus-level r-signatures by relaxing our similarity criteria (to HD ≤ 5 or MCM ≥ 25) and selecting probes common to genome sequences within families and genera for the picornaviridae family, paramyxoviridae family, rhinovirus genus (HRV) and pneumovirus genus (inclusive of RSV and hMPV).
Upon re-analysis of all 36 samples, we identified the presence of pneumovirus in 17 specimens as expected (1 false positive, patient 283), and additionally detected the presence of HRV in 9 specimens (Table ). As HRV was a novel discovery, we re-screened all 36 samples by PCR and found HRV in 11 specimens. All nine HRV calls by microarray were confirmed by PCR except for one. This finding was intriguing given that the genomic diversity of the over 100 known rhinovirus serotypes makes detection by PCR notoriously difficult [28
]. As the real-time PCR primers were capable of identifying only approximately 70% of rhinovirus strains, it is possible that the microarray correctly detected a rhinovirus strain that PCR failed to detect. Similarly, the pneumovirus genus detected in patient 283 could not be verified by RT-PCR, possibly owing to subtle genetic variations that prevented primer annealing. Thus, the greater genomic coverage afforded by the microarray might, in some cases, provide a more sensitive and accurate detection capability than pathogen-specific PCR.
Though the microarray identified the majority of HRV and RSV A samples using the genus-level r-signatures, it failed to detect three samples positive for HRV and three positive for RSV A by real-time PCR. These false negatives had an average Ct value >32, again suggesting a detection threshold close to that of real-time PCR. However, that the microarray also made a number of accurate discoveries in the 30-35 Ct range suggests a considerable degree of detection variability in the titre range above an approximately 30 Ct equivalency. Notably, the microarray correctly detected the presence of co-infecting pathogens in two samples (337 and 832), demonstrating the unique potential of this microarray platform to reveal complex disease etiologies.
Alternative methods of array design and pathogen detection
Though pathogen detection by microarray is a young field, a number of different platforms and approaches have been described, each with important attributes. For example, the array described by Wang et al
] is based on probes designed to recognize the most conserved viral domains, facilitating the detection of a taxonomic fingerprint that provides powerful clues to viral identity with minimal probe usage. Lin et al
], on the other hand, described a probe-dense resequencing array capable of detecting a smaller set of predefined pathogens, but with higher detection specificity, including the ability to discern highly related subtypes. The microarray described herein represents a blend of these two concepts, integrating a probe tiling approach for substantial genomic coverage (though with lower probe density than a resequencing array), with a taxonomy-based strategy for binning probes into pathogen recognition signatures. Thus, our analytical output includes both family- and genus-level predictions (for r-signatures restricted to conserved probes) as well as species-specific predictions (for r-signatures composed of conserved and unique probes). Indeed, this capability allowed us to detect and accurately identify viruses in clinical samples (Table ).
Central to pathogen prediction are the algorithms that weigh the microarray data against pre-defined recognition signatures. Unfortunately, few such algorithms exist, and only one algorithm, E-Predict, has been reported and validated [5
]. E-Predict matches hybridization signatures with predicted pathogen signatures derived from the theoretical free energy of hybridization for each microarray probe. To examine the performance of E-predict on our microarray platform, we analyzed a number of samples with both E-predict and our PDA algorithm. When applied to our microarray data, E-Predict performed well, with its first prediction tending to be the correct one (Table S2 in Additional data file 1). However, for each specimen, a number of false positive calls were also made, which seemed to reflect species with considerable sequence similarity to the true infecting pathogen (Table S2 in Additional data file 1). For example, in patient sample 412, E-Predict detected RSV (the correct pathogen), but also multiple species of coronavirus (which share some sequence similarity with RSV), yet real-time PCR using pancoronavirus primers as well as primers specific for strains OC43 and 229E indicated the absence of coronavirus from this sample (Figure S4 in Additional data file 1). These false positive calls can be explained by the fact that the function of E-Predict is less geared towards identifying and distinguishing specific pathogen strains, and aimed more at elucidating the best possible candidates as supported by the available probes. Thus, E-Predict is particularly advantageous in situations where a pathogen's sequence is not fully known [5
]. In contrast, our PDA algorithm is designed to make calls with greater species-level resolution. A major strength of PDA is its ability to specifically identify sequence-characterized and co-infecting pathogens with low false positivity. This is aptly demonstrated by the ability of PDA to detect specifically the presence of Dengue 1 in the clinical sample, where 7/35 viruses on the array are from the Flaviviridae family, including 4 dengue serotypes that share 70% sequence homology. The benefits of using both algorithms simultaneously for detecting both known and novel pathogens should be further evaluated.
An important discovery in this study was that the composition of the random primer tag has a significant impact on the efficiency of viral genome amplification, as assessed by an amplification efficiency score. The measurement of amplification efficiency allowed us to predict which probes would provide the most informative recognition signatures, markedly improving our pathogen prediction capability. Moreover, this finding allowed us to design AES-optimized primers that increased the amplification efficiency of our samples, resulting in greater sensitivity of pathogen detection. Whether multiplex RT-PCR using a variety of AES-designed primer tags can further increase amplification efficiency warrants further investigation. Additionally, it is feasible that other tag-based PCR applications, such as the generation of DNA libraries and enrichment of RNA for resequencing, may benefit from primer optimization using the AES algorithm.
DNA microarrays have the potential to revolutionize clinical diagnostics through their ability to simultaneously investigate thousands of potential pathogens in order to make a diagnosis. However, questions remain regarding their sensitivity and reliability. In this work, we investigated the myriad factors that influence microarray performance in the context of virus detection in clinical specimens, and describe an optimized platform capable of identifying individual and co-infecting viruses with high accuracy and sensitivity that brings microarray technology closer to the clinic. Future improvements will include significant reductions in microarray manufacturing and usage costs. Multiplex microarray formats and 're-usable' arrays are developing technologies that promise to drive down these costs. Furthermore, alternative technologies, such as beads [31
], microfluidics [32
] and nanotube microarrays [34
], might provide advantages in both assay cost and speed relative to traditional microarray platforms. Technology considerations aside, the advantages of a highly parallel, nucleic acid-based screening approach for detecting disease pathogens are clear. Validations in larger patient cohorts and in diverse clinical settings will be an important next step towards establishing the clinical role of pathogen detection microarrays.