|Home | About | Journals | Submit | Contact Us | Français|
DNA oligonucleotide microarrays (oligoarrays) are being developed continuously; however, several issues regarding the applicability of these arrays for whole-genome DNA-DNA strain comparisons (genomotyping) have not been investigated. For example, the extent of false negatives (i.e., no hybridization signal is observed when the amino acid sequence is conserved but the nucleotide sequence has diverged to a level that does not allow hybridization) remains speculative. To provide quantitative answers to such questions, we performed competitive DNA-DNA oligoarray (60-mer) hybridizations with several fully sequenced (tester) strains and a reference strain (whose genome was used to design the oligoarray probes) of the genus Burkholderia and compared the experimental results obtained to the results predicted based on bioinformatic modeling of the probe-target pair using the available sequences. Our comparisons revealed that the fraction of the total probes that provided experimental results consistent with the predicted results decreased substantially with increasing divergence of the tester strain from the reference strain. The fractions were 90.8%, 84.3%, and 77.4% for tester strains showing 96% 89%, and 80% genome-aggregate average nucleotide identity (ANI) to the reference strain, respectively. New approaches to determine gene presence or absence based on the hybridization signal, which outperformed previous approaches (e.g., 92.9% accuracy versus 86.0% accuracy) and to normalize across different array experiments are also described. Collectively, our results suggest that the performance of oligoarrays is acceptable for tester strains showing >90% ANI to the reference strain and provide useful guidelines for using oligoarray applications in environmental gene detection and gene expression studies with strains other than the reference strain.
DNA microarrays are common tools in the modern molecular laboratory (34). Oligoarrays, which are typically comprised of one or more short, 30- to 60-nucleotide probes for each gene in the genome, have gained popularity compared with microarrays made with PCR products of genes for expression studies due to their greater specificity during hybridization, flexibility of design, and potential for further technological development (10, 23). More recently, oligoarrays have been used for genetic (or DNA-DNA) comparisons of strains. For this application, the array is typically built using the available genomic sequence of one strain (the reference strain) and is used for competitive hybridization with genomic DNA of closely related strains (the tester strains) (3, 8, 24). The objective is to reveal the differences in genes between the reference strain and tester strains (genomotyping) that could explain the unique characteristics of the strains. Such DNA-DNA oligoarray experiments are also a promising and cost-effective means to uncover the gene content of natural microbial assemblages and thereby track their dynamics over time and during environmental fluctuations. Several studies of this type have been performed recently (4, 29, 32).
One cornerstone principle of DNA-DNA experiments is that the intensity of the hybridization signal of a probe depends directly on the degree of relatedness between the probe and target sequences. The relationship between signal intensity and sequence identity has been studied extensively previously, based on both simple sequence comparison models and more sophisticated models that incorporate the thermodynamic interactions of the probe-target sequence pair. Accordingly, we have a relatively good understanding of the maximum divergence of the target sequence from the probe sequence that provides a hybridization signal greater than the background signal, which indicates the presence of the corresponding target sequence in the hybridization mixture based on the array results (6, 13, 19, 21, 30). However, most, if not all, of the previous studies have been performed using a small set of selected probes and target sequences (13, 21). When thousands of target sequences (e.g., a whole genome) are hybridized, additional complications that cannot be accounted for by the simple experiments described above can occur. Examples of these complications include stereochemical interactions among the many targets and probes present (e.g., nonspecific hybridization), competition between (partially) identical target sequences (cross-hybridization), and nonuniform hybridization behavior of targets due to differences in melting hybridization temperature (Tm) and G+C content. Therefore, the applicability of the results of the previous studies that were based on a few probes and target sequences to the whole-genome level remains somewhat speculative. More importantly, it has been established that DNA-DNA experiments provide meaningful results with tester strains belonging to the same species as the reference strains (i.e., strains whose genomes show more than 95 to 96% nucleotide identity ) but not with tester strains related to the reference strain at levels below the sequence threshold for obtaining a significant signal greater than the background signal (usually around 80% nucleotide identity for oligoarrays ). However, we do not have a robust quantitative understanding of how strains with intermediate genetic relatedness (e.g., 80 to 95%) behave in such experiments. Dealing with these issues is important for DNA-DNA microarray applications, especially applications for environmental samples, which are characterized by high levels of species and genomic heterogeneity. For instance, natural populations sometimes show intrapopulation genomic diversity (expressed as levels of nucleotide identity) ranging from 90 to 100%, as revealed by recent metagenomic surveys (15, 31). The issues described above are also relevant to the use of oligoarrays for gene expression studies with strains other than the reference strain.
Here, we quantitatively assessed the performance of oligoarrays by conducting competitive hybridizations between a reference strain and selected tester strains that were fully sequenced and showed increasing genetic divergence from the reference strain. We then compared the experimental oligoarray results to bioinformatically predicted results based on whole-genome sequence comparisons. Our analyses allowed precise estimation of the numbers of false negatives (FN) (no hybridization signal observed when the amino acid sequence is conserved but the nucleotide sequence has diverged enough that there is no hybridization) and false positives (FP) for strains belonging to the same species as the reference strain or closely and moderately related species. We also describe a new simple method to normalize arrays from different experiments and determine the presence of a gene based on the hybridization signal that outperforms previously described approaches.
The following six strains were used in this study: Burkholderia cenocepacia J2315 (reference strain), B. cenocepacia AU1054, B. cenocepacia HI2424, Burkholderia ambifaria AMMD, Burkholderia vietnamiensis G4, and Burkholderia xenovorans LB400. The whole-genome sequences of these six strains were obtained from the National Center for Biotechnology Information website (ftp://ftp.ncbi.nih.gov/). The genetic relatedness between the reference strain and a tester strain was measured based on the genome-aggregate average nucleotide identity (ANI) and 16S rRNA gene sequence identity. ANI values were calculated based on all reciprocal best-match conserved genes for the two strains (pairwise), as previously described (17). To calculate 16S rRNA gene sequence identities, all 16S rRNA gene copies in the genome of the reference strain were queried against the genome of the tester strain using the BLASTN algorithm (2). The average nucleotide identities for all copies, as determined by BLAST, are shown in Table Table11.
The arrays used in this study were B. cenocepacia 60-mer Agilent oligoarrays with one probe per gene, which have been described previously (18). Here, we focused on the 6,308 probes of this array that were designed based on the gene sequences of the B. cenocepacia strain J2315 genome (reference).
Genomic DNA from all strains was purified from 5-ml cultures grown in LB medium overnight at 37°C by using a Wizard genomic DNA purification kit (Promega, Madison, WI) according to the manufacturer's instructions. Two micrograms of genomic DNA from each strain was sonicated using a Heat Systems Ultrasonics W-225 20-kHz, 200-W cup sonicator (Misonix, Farmingdale, NY) to generate sheared genomic DNA from 0.5 to 5 kb long. For each sample, 300 ng of sheared DNA was labeled with either Cy3 or Cy5 dye using the methods described by Wick et al. (33). Equal amounts of tester and reference (J2315) samples with similar specific activities were mixed in 4 μl 10 mM EDTA (pH 8.0) and heated to 95°C for 5 min. Samples were mixed with 45 μl of SlideHyb buffer 1 (Ambion, Austin, TX) by pipetting. Preparation of hybridization mixtures, loading procedures, and slide washing were performed by using Agilent's recommended protocols (1) with 16 h of hybridization at 65°C and with agitation and the optional stabilization and drying solution final wash. Two or more hybridizations were performed for each tester strain with at least one dye swap. Microarrays were scanned using an Axon GenePix 4000B scanner (Molecular Devices, Sunnyvale, CA), and probe intensities were extracted using GenePix Pro 5.0 software. The log hybridization signal ratio for each probe was calculated using the ratio of the background-subtracted mean probe signal of the tester to the background-subtracted mean probe signal of the reference strain with the following equation: log hybridization ratio = log [(MS − BS)tester/(MS − BS)reference], where MS and BS are the mean probe and median background signals, respectively.
To bioinformatically determine the degree of relatedness between a probe sequence and the target sequence in a tester genome, the BLAST score approach was used, essentially as described previously (26). In brief, the 60-mer oligonucleotide probe sequences were searched against the whole genome of a tester strain using the BLASTN algorithm (2) with the following settings: X = 150 (drop-off value for gapped alignment), q = −1 (penalty for nucleotide mismatch), F = F (filter for repeated sequences), W = 7 (word size), and the default settings for the rest of the parameters. These settings are more robust for identifying moderately related short matches (i.e., sequences showing 60 to 80% nucleotide identity) than the default settings, which target sequences with high levels of identity (>90% nucleotide identity) (17). The BLAST score of a probe against the tester genome was calculated by subtracting the number of internal mismatches from the total length of the alignment for only the best BLASTN match (for multiple matching targets, see below). Thus, the BLAST score for probes that had a perfect match was 60. The BLAST score for a probe against a tester genome was divided by 60 (the score for a perfect match of the probe against the reference genome) to obtain the BLAST score ratio (BSR) of the probe against the tester genome, as follows (similar to calculation of the experimental hybridization signal ratio): BSR (%) = 100 × (BLAST score)tester/60.
To bioinformatically estimate the expected hybridization signal of a probe that had multiple matching targets in a tester genome, the following procedure was used. First, the multiple matches of a probe that had a BSR less than 83% were removed from the analysis because oligoarray hybridization experiments indicated that probes with BSR less than 83% did not typically provide hybridization signals greater than the background signal (Fig. (Fig.1).1). For the remaining matches of a probe, the bioinformatically expected hybridization signal of an individual match was assumed to equal the experimentally derived average hybridization signal of all single-match probes with the same BSR for the match in question (based on the equation shown in Fig. Fig.1,1, inset). Finally, the expected signals for each match of a probe were added to obtain the bioinformatically predicted cumulative signal for the probe due to the multiple matches. This procedure was used for both the tester and reference strains, as follows: predicted hybridization signal for individual match = 0.00009 × exp(0.1601 × BLAST score) and predicted logarithmic signal ratio for probe = log(sum of predicted hybridization signals for tester/sum of predicted hybridization signals for reference).
To bioinformatically determine the genes that were shared by the reference genome and a tester genome, the sequences of all reference genes were searched against the whole genome of the tester strain for reciprocal best-match conserved genes. Conserved genes were defined using the following two cutoffs: (i) for nucleotide-level conservation, BLASTN was used for the search with the settings described above for the probe sequences and a minimum cutoff for a match of at least 70% identity and 70% of the length of the reference gene covered in the corresponding alignment; and (ii) for amino-acid-level conservation, BLASTX was used for the search with default settings and a cutoff for a match of at least 50% identity and 70% of the length of the reference gene in the corresponding alignment (a less stringent cutoff).
Two of the most commonly used methods for determining gene presence based on experimentally derived hybridization signals were compared to our method (see the Results) for the same purposes. In one of these methods (the SNR1 method) the ratio of the background-subtracted signal divided by the standard deviation of the background signal was used (7, 11, 34): SNR1 = (MS − BS) × SBS−1, where SNR is the signal-to-noise ratio, MS and BS are the mean probe and background signals, respectively, and SBS is the standard deviation of background signal. In the other method (the SNR2 method) the ratio of the mean signal to the mean background signal was used, taking the signals of negative controls into account, as described by Loy et al. (22): SNR2 = [MS − (MSN − BSN)] × BS−1, where MSN and BSN are the mean probe signal and background signal of the negative controls, respectively. The most frequently employed SNR cutoffs for these methods are 2.0 and 3.0 (7, 11, 34); an SNR value of 2.0 was used in this study for both methods.
GACK analysis was performed as described by Kim et al. (14). In brief, the GACK algorithm examines the shape of the experimentally derived hybridization signal ratio distribution of an individual microarray experiment to determine a signal threshold for gene presence or absence based on a normal probability density function (expected distribution) and a user-defined estimated probability of presence (EPP). EPP indicates the number of genes that are expected to be absent (i.e., the number of genes that are expected to not give a hybridization signal greater than the background signal) due to sequence divergence of the tester strain from the reference strain. We employed two values for EPP, 0 and 50, in our microarray experiments, as previously suggested (14). We report here only the results obtained using an EPP value of 50 since this standard was more relevant for our experiments that were performed with strains divergent from the reference strain.
The raw microarray intensity data have been deposited in the GenBank Gene Expression Omnibus (GEO) database under accession number GSE20096.
Our DNA-DNA oligoarray experiments were performed with strains of the genus Burkholderia due to the availability of a previously tested 60-mer Agilent oligoarray for an important member of this genus, B. cenocepacia J2315 (18), an epidemic isolate from a cystic fibrosis patient (12), and the availability of several sequenced strains that show different levels of divergence from J2315 (Table (Table1).1). The 16S rRNA gene sequences of the six Burkholderia strains used in this study showed 96 to 99.7% sequence identity, which is consistent with their high level of genetic relatedness (they belong to the same or closely related species). When the genetic relatedness was measured by using genome-aggregate average nucleotide identity (ANI), a more sensitive parameter for measuring evolutionary relatedness of closely related genomes (17), a gradient of genetic relatedness was observed. In particular, strain J2315 showed ~95% ANI to B. cenocepacia strains AU1054 and HI2424. These values match the 95% ANI that corresponds to the 70% DNA-DNA hybridization standard frequently used for species demarcation (9). Hence, these pairs of genomes sample the subspecies level. J2315 showed ~91% and 89% ANI to B. ambifaria AMMD and B. vietnamiensis G4, respectively, which represent different levels of genetic relatedness within the genus. Finally, J2315 showed ~80% ANI to B. xenovorans LB400, which represents the most divergent species sampled within the genus. This genetic gradient provided the opportunity to precisely estimate the number of false positives and false negatives in oligoarray DNA-DNA experiments when J2315 (reference strain) was compared with the five tester strains.
To bioinformatically predict the probes that should provide signals greater than the background signal based on their sequence similarity to the target sequences in the genomes of the tester strains, we used the BLAST score-based approach because of its simplicity and satisfactory accuracy (26). We determined the BLAST score for each probe compared with the top matching gene sequence in the whole genome of a tester strain. Subsequently, we calculated the BLAST score ratio (BSR) of the probe (i.e., the BLAST score with the tester divided by the BLAST score with the reference, which was always 60 due to the perfectly matching probe sequence) and plotted it against the experimentally derived oligoarray hybridization signal ratio for the same probe. The analysis was restricted to probes that had only one match in the genome to avoid the complications resulting from multiple matching targets (see below). Our results revealed that the relationship between the signal ratio and the BSR was clearly biphasic. In particular, a linear dynamic decrease in the oligoarray hybridization signal was observed for BSR of 83 to 100%; at values below the 83% inflection point, the slope of the regression line decreased significantly, and the line became almost flat (Fig. (Fig.1).1). Hence, significant hybridization signals greater than the background signal were obtained only for probes whose BSR with the tester strain (relative to the reference strain) was at least 83%.
The strong linear regression between the signal ratio and the BSR in the 83 to 100% BSR range was attributable to the almost perfect (R2 = 0.99) correlation between the average hybridization signal intensity (based on all available probes) and the BLAST score in the BLAST score range from 50 to 60 (Fig. (Fig.1,1, inset). The latter results also revealed that the hybridization signal can, in general, be predicted robustly from the BLAST score, which is consistent with previous findings (26).
The slope of the regression line for BSR in the 30 to 83% range (Fig. (Fig.1)1) was due to an increase in the hybridization signal from the targets in the reference strain. This is most likely attributable to less competition from the targets of the tester strain for the available probe molecules on the chip in this BSR range due to an increased number of mismatches or no sequence match of the targets with the probe sequence (i.e., little or no hybridization). Hence, more probe molecules were available for the reference targets to bind. These results were reproducible independent of the genetic relatedness of the strain tested to the reference strain.
To evaluate oligoarray performance, we employed a BSR of 83% as the cutoff for predicting bioinformatically which probes should have been expected to provide signals greater than the background signal and the average experimental hybridization signal ratio of all probes that showed a BSR of exactly 83% in each hybridization experiment (e.g., ~−1.3 in the example shown in Fig. Fig.1)1) as the signal threshold for determining which probes provided hybridization signals greater than the background signal. The latter probes represented the genes that would have been considered present in the tester strain based on the experimental hybridization results and the previous signal threshold. Our evaluation revealed that most (>80%) probes provided experimental hybridization results consistent with the bioinformatic prediction using the standards described above (good probes [GP]). However, the fraction of GP decreased with greater divergence of the tester strain from the reference strain (Fig. (Fig.2),2), and there was a corresponding increase in false positives (FP) (i.e., probes that bioinformatically were not supposed to provide a signal greater than the background signal because they matched the tester genome at BSR less than 83%) (Fig. (Fig.1).1). For instance, GP constituted 90.8%, 88.5%, 84.3%, and 77.4% of all probes for tester strains with 95%, 91%, 89%, and 80% ANI to J2315, while the percentages of FP were 4.7%, 5.4%, 9.1%, and 16.7% for the same strains, respectively. The probes that contributed disproportionally to the pool of FP probes were the probes whose BSR were close to the BSR threshold (83%); in fact, there were very few FP probes among the probes that had a BSR less than about 70% (Fig. (Fig.3).3). The number of false-negative (FN) probes (i.e., probes with BSR greater than 83% that provided hybridization signals that were less than the background signal) was relatively constant with the ANI of the tester strain, and these probes constituted about 5 to 6% of all of the probes (Fig. (Fig.22).
To perform an assessment of oligoarray performance for genomotyping that is more relevant for practical use, we also compared the probes that provided experimental hybridization signals greater than the background signal to the actual genes that were shared by the tester and reference genomes based on whole-genome nucleotide sequence comparisons. We found that the number of GP, defined in this case as the probes that gave hybridization signals greater than the background signal regardless of their BSR and corresponded to reference genes present in the tester genome (each probe is specific for one reference gene), also decreased with decreasing ANI of the tester strain to J2315, similar to the probe-level assessment described above (Table (Table22 shows the definitions for GP, FP, and FN for each assessment). However, the decrease was more dramatic for the assessment of gene versus probe level, as the percentages of GP in the former assessment were 90.1%, 86.3%, 82.0%, and 70.0% for tester strains according to their ANI rank (Fig. (Fig.2).2). These findings were due to a higher number of FN probes, defined in this case as the probes that corresponded to shared genes based on the bioinformatic sequence analysis and provided experimental hybridization signals less than the background signal regardless of their BSR, in the assessment of gene versus probe level. The increased number of FN probes in the gene-level assessment was attributable mostly to two factors. One of these factors was tester genes whose levels of nucleotide sequence identity to their orthologs in the reference genome were in the range from 70 to 83%; hence, these genes were considered genes that were shared by the tester and reference genomes (since the level of identity was more than the cutoff for considering a gene present), but the corresponding probe typically provided a hybridization signal less than the background signal (since it had <83% nucleotide identity to the target in the tester genome). This type of FN probes was more abundant for comparisons with distantly related genomes (i.e., genomes with more orthologs with sequence identities in the 70 to 83% range). The second factor was the fact that the FN probes for the gene-level assessment included most, if not all, of the FN probes for the probe-level assessment (i.e., probes the showed BSR of >83% and provided signals less than the background signal). These findings also suggest that DNA-DNA microarray studies can underestimate substantially the actual levels of gene content similarity between the genomes compared. In contrast, the number of FP probes was slightly lower for the assessment of gene versus probe. This was due to the fact that several FP probes for the probe-level assessment (i.e., probes that showed BSR less than 83% and provided hybridization signals greater than the background signal) corresponded to J2315 genes that were actually shared by the tester genome. Such probes were included in the GP category for the gene-level assessment instead of the FP category for the probe-level assessment; thus, the number of FP probes in the gene-level assessment was typically less than the number of FP probes in the probe-level assessment.
When shared genes were assessed at the amino acid level (which provided a less stringent but probably more realistic estimate of shared protein-encoding genes), an even smaller number of GP with decreasing ANI of the tester strain to J2315 was observed. For instance, the percentages of GP were 89.9%, 84.4%, 80.2%, and 64.8% with decreasing ANI at the amino acid level, compared with 90.8%, 88.5%, 84.3%, and 77.4% at the nucleotide gene level, respectively (Fig. (Fig.2).2). The latter results were attributable to the fact that the amino acid level was more conserved than the nucleotide level; thus, more shared genes were found for J2315 and the tester strains in the amino acid comparisons, and more genes in this gene set had related nucleotide sequences with BSR values below the 83% cutoff (contributing FN probes). Finally, the FN and FP probes in the gene- or probe-level assessments were typically tester strain specific, although a few probes consistently showed FP or FN signals independent of the tester genome used (Fig. (Fig.4).4). However, the latter probes were frequently attributed to specific characteristics of the corresponding genes. For instance, the probes that systematically yielded FN signals were frequently associated with genes that evolved faster than the average gene in the genome, such as genes encoding membrane-associated proteins, and thus showed lower levels of nucleotide sequence identity to their J2315 orthologs than to the rest of the genes.
Two of the most popular approaches for defining the presence of genes based on the difference between the probe hybridization signal and the background signal (22) revealed trends similar to those obtained with our method based on the average signal for probes showing a BSR of exactly 83% (data not shown). However, the two former methods performed substantially worse than our BSR method based on two independent experiments with strain HI2424 (~95% ANI to J2315). In particular, the number of FP and FN probes did not exceed 3.7% of all of total probes based on our method, while the FN and FP probes accounted for 8.7% and 10.3% of all of the probes based on the SNR2 and SNR1 methods, respectively (Fig. (Fig.5).5). Also, the percentages of GP were 86.0%, 88.7%, and 92.9% as determined by the SNR1 and SNR2 methods and our method, respectively. Thus, our approach clearly outperformed the approaches described previously. Using a threshold different than 83% is unlikely to further improve the processing of microarray data in practice. For instance, increasing the BSR threshold (i.e., being more stringent) resulted in a disproportionally increased number of FN probes and a proportionally decreased number of FP probes, while decreasing the threshold had the reverse effect, as suggested previously (11).
We also compared the results of our method to those of a tool commonly used in bacterial genomotyping, GACK (14). GACK does not employ an a priori determined signal threshold for gene presence (as SNR methods do) but, similar to our method, determines the threshold based on the shape of the signal ratio distribution for each microarray experiment individually (14). Our evaluation showed that GACK provided slightly worse results than our method for tester strains showing 95% ANI to the reference strain; e.g., the number of GP was smaller, while the performance declined dramatically for strains showing ~90% ANI to the reference strain. In the latter case, the percentage of GP for GACK was only ~75%, compared with ~88% for our method (Fig. (Fig.2).2). The decreased performance of GACK was also reflected by the greater variation (Fig. (Fig.2)2) in the number of GP or FN probes for different oligoarray experiments with the same tester strain (dye swaps). We were not able to perform GACK with more divergent tester strains because GACK requires the shape of the signal ratio distribution to approximate normal in order to determine the signal threshold and such a shape was not observed for these strains (because the majority of the hybridization signals were too similar to the background signal). These findings were also consistent with findings reported previously for divergent tester strains (14). Thus, GACK is not appropriate for tester strains that show less than ~95% ANI to the reference strain (i.e., strains in different species).
To assess the contribution of multiple matching targets to the hybridization signal of a probe (cross-hybridization) and to determine whether a higher hybridization signal can be used to identify duplicated genes in the tester genome compared to the reference genome (or vice versa), the following analysis was performed. We identified the J2315 probes that matched two or more genes in the B. vietnamiensis G4 genome with BSR of ≥83%. Fifty-five probes in the total pool of about 6,308 probes met these criteria. Subsequently, we bioinformatically predicted the expected hybridization signals for these probes in the tester and reference genomes, as described in Materials and Methods, and compared the resulting ratios to the actual, experimentally derived hybridization signal ratios for the same probes. The comparisons indicated that the experimental hybridization signal ratio correlated significantly better with the predicted ratio based on all matches of a probe than with the predicted signal based on only the best match (Pearson correlation coefficients, 0.65 and 0.48, respectively) (Fig. (Fig.6).6). Hence, a higher hybridization signal may indicate multiple matching targets (e.g., duplicated genes) in the tester strain compared to the reference strain, and the equations described here could be used to identify such duplicated genes.
Oligoarrays represent a mature technology; however, all issues associated with the broad applications of oligoarrays have not been fully elucidated, as a plethora of recent studies have indicated (25, 27, 29). In this study, we used complete genomic sequences of several tester strains to validate microarray results and to obtain quantitative assessments of the performance of oligoarrays for whole-genome DNA-DNA competitive hybridization. Our findings reveal that the numbers of microarray FP and FN probes are not negligible even for tester strains of the same species; e.g., they may constitute >5% of all probes. Our findings also provide robust estimates of the number of FP and FN that should be expected based on the degree of genetic relatedness of the tester strain to the reference strain (Fig. (Fig.2).2). Our analysis showed that the performance of oligoarrays decreased gradually (as opposed to abruptly) with decreasing genetic relatedness of the tester strains in the 80 to 100% ANI range. Accordingly, some of the hybridization signal data, particularly the absence of a signal, which indicated that the corresponding gene was not shared, were reliable even with highly divergent tester strains (e.g., strains showing ~80% ANI to the reference strain). As a rule of thumb, however, we recommend that tester strains should not exhibit less than about 90% ANI to the reference strain, if it is expected that at least 90% of the probes will provide reliable signals (GP) (Fig. (Fig.2).2). We also did not observe any apparent biases in the number of FP or FN probes among the probes for core and noncore genes in the genome (“core” denotes genes shared by all genomes in a group), except for a few predictable genes. For instance, informational genes (e.g., genes encoding ribosomal proteins, DNA and RNA polymerases, etc.), which tend to show higher levels of nucleotide sequence conservation than the genome average level (15), had more GP and fewer FN probes than genes that tend to evolve faster than the average gene, such as genes encoding membrane and sensory proteins, transposases, and hypothetical proteins.
Notably, FP results were observed even with probes that had very low BSR (e.g., <70%) with the target sequence, and these probes accounted for about one-half of all FP probes (despite the fact that FP probes with BSR closer to 83% were more frequent [Fig. [Fig.3]).3]). The former FP probes were likely attributable to nonspecific or nonorthologous target sequences that matched the probe sequences in short but identical segments, as hypothesized previously (11). However, we were unable to come up with simple and reliable rules for sequence similarity or the position of internal mismatches in the probe-target sequence pair that underlie the behavior of such FP probes since the target sequences that hybridized to the probes were not known. Further, our evaluations indicated that predicting a priori which probes are likely to provide FN (or FP) signals is not generally feasible, since these probes were typically tester strain specific (Fig. (Fig.4).4). Thus, it should be taken for granted that in DNA-DNA experiments and in expression experiments with strains other than the reference strain, a portion of the hybridization signal, which depends primarily on the degree of genetic relatedness of the tester strain to the reference strain (Fig. (Fig.2),2), cannot be trusted to be a reliable proxy for gene presence and activity, respectively.
Traditional approaches for determining gene presence based on DNA-DNA microarray experiments have typically employed thresholds for the probe hybridization signal based on the background hybridization signal (20, 22) or based on the shape of the signal distribution for all probes (14). Although these approaches are intuitive and easy to use, they are based on arbitrary thresholds and/or are vulnerable to variable biases resulting from slide heterogeneities, including unequal DNA templates, different amounts of incorporated dye, uneven labeling and hybridization conditions etc., and from the degree of genetic divergence of the tester strains (5, 28). In contrast to these traditional methods, the BLAST score-based method developed in this study employs slide-specific thresholds based on the dynamic hybridization range of probes (i.e., the average signal corresponding to the inflection point [Fig. [Fig.1])1]) and outperforms the previously described approaches (Fig. (Fig.22 and and5).5). For the Agilent oligoarrays used in this study, the dynamic hybridization range corresponded consistently to BSR of 83 to 100%, regardless of the tester strain used (data not shown), while the average hybridization signal of probes with a BSR of 83% (i.e., the signal threshold for determining gene presence) varied slightly in different hybridization experiments, depending primarily on the dye incorporation efficiency and the amount of DNA labeled for each dye. The BSR of ~83% is similar to the 85% nucleotide sequence identity cutoff proposed previously for determining the 50-mer probes that were expected to cross-hybridize with the target sequences (11).
To apply our approaches to nonsequenced tester strains, we recommend obtaining the sequences of at least a few genes in the genome of each strain (e.g., genes that are sequenced as part of multilocus sequence typing [MLST]). These sequences could be used to determine the relationship between hybridization signal and BSR, as shown in Fig. Fig.1.1. The curve of this relationship can subsequently be employed to make educated predictions about the signal threshold that corresponds to the inflection point. It can also help normalize the hybridization signal across different oligoarray experiments based on the shape of the curve for each experiment (e.g., by applying a normalization factor that would bring the inflection point to the same signal and BSR values for each slide). Our previous work also provided a means to estimate ANI values for any two strains for which MLST data are available (16).
Gene duplication and independent acquisition of highly similar genes are important evolutionary processes in bacteria and lower eukaryotes, and they frequently reflect ecological adaptation of a lineage, such as increased virulence. However, using competitive DNA-DNA oligoarray experiments to identify such “duplicated” genes in the tester strain compared to the reference strain or vice versa is technically challenging. We found that duplicated genes typically resulted in increased hybridization signals (Fig. (Fig.6);6); thus, an increased signal may in fact indicate duplicated genes. While the trend line shown in Fig. Fig.66 would be useful for identifying such candidate genes, further improvements in modeling the expected hybridization signal based on sequence similarity beyond what is encompassed by the BSR (25) and/or employing several probes per gene are necessary for more accurate predictions.
Although we tested only one type of oligoarray platform, our preliminary work with other platforms, such as the oligo-spotted MWG arrays for Escherichia coli (MWG Biotech) and a custom-made 50-mer in situ-synthesized oligoarray from Biodiscoveries LLC (Ann Arbor, MI), provided results similar to those obtained with the Agilent oligoarrays (our unpublished observations). Nonetheless, small variations in performance between the different platforms were observed, and in situ oligoarrays typically performed better than spotted oligoarrays and had a clearer dynamic hybridization range. Therefore, our results provide useful practical guidelines for designing microarray experiments for a wide range of microarray platforms. Further, the genus Burkholderia targeted by our Agilent oligoarrays is one of the most metabolically versatile bacterial genera known, and its members have some of the largest genomes (~8 Mbp); it also includes both clinical (e.g., J2315) and nonclinical representatives involved in cycling organic matter in soils and sediments and controlling fungal diseases. Therefore, our findings obtained with Burkholderia strains have important implications for studying additional members of this important genus with oligoarrays and should be applicable to other bacterial groups as well. While bacterial genome sequencing is becoming less costly, genomotyping with microarrays will likely remain cost-effective for larger-scale strain comparisons and for environmental surveys of genes of complex microbial communities for some time, since adequately covering all members of a community with sequencing remains out of reach for current sequencing technologies.
We thank Erick Cardenas and two anonymous reviewers for helpful discussions regarding the manuscript. We thank the Sanger Institute and the Joint Genome Institute for giving us early access to the genome sequences used in this study.
This work was supported by the NSF (award DEB-0516252 to J.T. and K.T.K.), the U.S. Department of Energy (award DE-FG02-07ER64389 to J.T. and K.T.K.), and the Cystic Fibrosis Foundation Therapeutics.
Published ahead of print on 12 March 2010.