|Home | About | Journals | Submit | Contact Us | Français|
The rapid and accurate identification of pathogens is critical in the control of infectious disease. To this end, we analyzed the capacity for viral detection and identification of a newly described high-density resequencing microarray (RMA), termed PathogenID, which was designed for multiple pathogen detection using database similarity searching. We focused on one of the largest and most diverse viral families described to date, the family Rhabdoviridae. We demonstrate that this approach has the potential to identify both known and related viruses for which precise sequence information is unavailable. In particular, we demonstrate that a strategy based on consensus sequence determination for analysis of RMA output data enabled successful detection of viruses exhibiting up to 26% nucleotide divergence with the closest sequence tiled on the array. Using clinical specimens obtained from rabid patients and animals, this method also shows a high species level concordance with standard reference assays, indicating that it is amenable for the development of diagnostic assays. Finally, 12 animal rhabdoviruses which were currently unclassified, unassigned, or assigned as tentative species within the family Rhabdoviridae were successfully detected. These new data allowed an unprecedented phylogenetic analysis of 106 rhabdoviruses and further suggest that the principles and methodology developed here may be used for the broad-spectrum surveillance and the broader-scale investigation of biodiversity in the viral world.
The ability to simultaneously screen for a large panel of pathogens in clinical samples, especially viruses, will represent a major development in the diagnosis of infectious diseases and in surveillance programs for emerging pathogens. Currently, most diagnostic methods are based on species-specific viral nucleic acid amplification. Although rapid and extremely sensitive, these methods are suboptimal when testing for a large number of known pathogens, when viral sequence divergence is high, when new but related viruses are anticipated, or when no clear viral etiologic agent is suspected. To overcome these technical difficulties, newer technologies have been employed, especially microarrays dedicated to pathogen detection. Indeed, DNA microarrays have been shown to be a powerful platform for the highly multiplexed differential diagnosis of infectious diseases. For example, pathogen microarrays can be simultaneously used to screen various viral or bacterial families and have been successfully used in the detection of microbial agents from different clinical samples (10-12, 19, 32, 35, 41, 42, 48).
The “classical” DNA microarrays developed so far are based on the use of long-oligonucleotide pathogen-specific probes (≥50 nucleotides [nt]). Although powerful in terms of sensitivity, these diagnostic tools have the disadvantage of decreased specificity, making it necessary to target multiple markers, and rely on hybridization patterns for pathogen identification, leading to unquantifiable errors (4). Moreover, these methods lack comprehensive information about the pathogen at the single-nucleotide level, which could represent a major problem when the sequences in question show a high degree of similarity (21). The microarray-based pathogen resequencing assay represents a promising alternative tool with which to overcome these limitations. This method identifies each specific pathogen and is capable of resequencing, or “fingerprinting,” multiple pathogens in a single test. Indeed, this technology uses tiled sets of 105 to 106 probes of 25mers, which contain one perfectly matched and three mismatched probes per base for both strands of the target genes (16). This technology also offers the potential for a single test that detects and discriminates between a target pathogen and its closest phylogenetic neighbors, which expands the repertoire of identifiable organisms far beyond those that are initially included in the array. Successful results have been obtained using this technology, especially for the detection of broad-spectrum respiratory tract pathogens using respiratory pathogen microarrays (2, 25, 26) or the detection of a broad range of biothreat agents (1, 23, 36, 45). The amplification step, which is more often limiting for this technology, has also benefited from recent developments. Phi29 polymerase-based amplification methods provide amplified DNA with minimal changes in sequence and relative abundance for many biomedical applications (3, 31, 40). The amplification factor varied from 106 to 109, and it was also demonstrated that coamplification occurred when viral RNA was mixed with bacterial DNA (3). This whole-transcriptome amplification (WTA) approach can also be successfully applied to viral genomic RNA of all sizes. Amplifying viral RNA by WTA provides considerably better sensitivity and accuracy of detection than random reverse transcription (RT)-PCR in the context of resequencing microarrays (RMAs) (3).
The rhabdoviruses are single-stranded, negative-sense RNA genome viruses classified into six genera, three of which—Vesiculovirus, Lyssavirus, and Ephemerovirus—include arthropod-borne agents that infect birds, reptiles, and mammals, as well as a variety of non-vector-borne mammalian or fish viruses (International Committee on Taxonomy of Viruses database [ICTVdb]) (reviewed in reference 7). These rhabdoviruses are the etiological agents of human diseases, such as rabies, that cause serious public health problems. Some rhabdoviruses also cause important economic losses in livestock. The three others genera include Nucleorhabdovirus and Cytorhabdovirus, which are arthropod-borne viruses infecting plants, and Novirhabdovirus, which comprises fish viruses. Other than the well-characterized rhabdoviruses that are known to be important for agriculture and public health, there is also a constantly growing list of rhabdoviruses, isolated from a variety of vertebrate and invertebrate hosts, that are partially characterized and are still waiting for definitive genus or species assignment. Considering the large spectrum of potential animal reservoirs of these viruses compared to the few identified virus species, it is highly likely that the number of uncharacterized rhabdoviruses is immense.
Unclassified or unassigned viruses have been tentatively identified as members of the family Rhabdoviridae by electron microscopy, based on their bullet-shaped morphology—a characteristic trait of members of this family—or using their antigenic relationships based on serological tests (9, 38). Gene sequencing and phylogenetic relationships have then been progressively applied to complete this initial virus taxonomy (6, 22, 27). Importantly, a strongly conserved domain in the rhabdovirus genome, within the polymerase gene, is a useful target for the exploration of the distant evolutionary relationships among these diverse viruses (6). This region corresponds to block III of the viral polymerase, a region predicted to be essential for RNA polymerase function, as it is highly conserved among most of the RNA-dependent RNA polymerases (14, 33, 46). A direct application using this sequence region was recently described for lyssavirus RNA detection in human rabies diagnosis (13). Taking advantage of these characteristics, this polymerase region was also used to design probes for high-density RMAs, also called PathogenID arrays (Affymetrix), which are optimized for the detection and sequence determination of several RNA viruses, particularly rhabdoviruses (1).
In the present study, PathogenID microarrays containing probes for the detection of up to 126 viruses were tested using a consensus sequence determination strategy for the analysis of output RMA data. We demonstrate that this approach has the potential to identify, in experimentally infected and clinical specimens, known but also phylogenetically related rhabdoviruses for which precise sequence information was not available.
Two generations of PathogenID arrays were used in this study: PathogenID v1.0, containing probes for the detection of 42 viruses (including 3 prototype rhabdoviruses), 50 bacteria, and 619 toxin or antibiotic resistance genes (previously described in reference 1), and PathogenID v2.0, which is able to detect 126 viruses (including 30 different rhabdoviruses), 124 bacteria, 673 toxin or antibiotic resistance genes, and two human genes as controls. These arrays include prototype sequences of all of the species (or genotypes) of the genus Lyssavirus, of the other major genera defined in the family Rhabdoviridae, such as Ephemerovirus and Vesiculovirus, and of 13 rhabdoviruses awaiting classification or tentatively classified among minor groups such as the Le Dantec and Hark Park groups (6). For all of the selected probes tiled on the two versions of the PathogenID array, the same conserved region of the viral polymerase gene was used (block III). However, the size of the target region tiled on the array was longer in the second version (up to 937 nt in length for some sequences, compared to roughly 500 nt in the first version) (Tables (Tables11 and and22).
Detailed descriptions of all of the prototype and field virus strains used in this study and their sources are listed in Tables Tables11 and and2.2. Briefly, 16 and 31 different viruses were tested using PathogenID v1.0 (15 lyssaviruses and 1 vesiculovirus) and PathogenID v2.0 (14 lyssaviruses, 1 vesiculovirus, and 12 unassigned and 4 tentative species of animal rhabdoviruses according to ICTVdb), respectively. Samples tested included in vitro-infected cells, a synthetic nucleotide target (when the corresponding virus strain was not available), brain biopsy specimens obtained from experimentally infected mice, and biological specimens from various animals (bat, cat, dog, and fox brains) and humans (brain, saliva, and skin biopsy specimens).
RNA extraction from biological samples was processed with TRI Reagent (Molecular Research Center) according to the manufacturer's recommendations. After extraction, viral RNAs were reverse transcribed and then amplified using the whole-transcriptome amplification (WTA) protocol (QuantiTect Whole Transcriptome kit; Qiagen) as described previously (3).
All of the amplification products obtained from viral RNA were quantified by Quantit BR (Invitrogen) according to the manufacturer's instructions or by the NanoDrop ND-1000 spectrophotometer instrument (Thermo Scientific). A recommended amount of target DNA was fragmented and labeled according to GeneChip Resequencing Assay manual (Affymetrix). The microarray hybridization process was carried out according to the protocol recommended by the manufacturer (Affymetrix). All of the details and parameter settings for the data analysis (essentially conversion of raw image files obtained from scanning of the microarrays into FASTA files containing the sequences of base calls made for each tiled region of the microarray) have been described previously (1). The base call rate refers to the percentage of base calls generated from the full-length tiled sequence.
In the first approach, resequencing data obtained by the PathogenID v1.0 microarray were manually submitted to the NCBI nr/nt database for BLASTN query. The default BLAST options were modified. The word size was set to 7 nt. The expected threshold was increased from its default value of 10 to 100,000 to reduce the filtering of short sequences and sequences rich in undetermined calls, which can assist correct taxonomic identification. To avoid false-negative results induced by high numbers of undetermined nucleotides in the sequences, the “low complexity level filter” (−F) was also turned off. BLAST sorts the resulting hits according to their bit scores so that the sequence that is the most similar to the entry sequence appears first. Identification of the virus strains tested was considered successful only when the best hit was unique and corresponded to the expected species or isolate (according to the nucleotide sequences of these viruses already available in the NCBI nr/nt database).
In the second approach, an automatic bioinformatics-based analysis of RMA data provided by PathogenID v2.0 was developed, including a consensus sequence determination strategy completed with a systematic BLAST strategy. The general workflow of this strategy is represented in Fig. Fig.1.1. A Perl script reads the input data, which consist of one FASTA file per sample that contains all of the sequences read by the GSEQ software from the hybridization. A modified version of the filtering process described by Malanoski et al. (29) is applied to the sequences. The retained sequences contain stretches of nucleotides that are ascertained according to the following algorithm. Briefly, sequences that do not contain subsequences fulfilling specific parameters (minimum nucleotide length [m] and maximum undetermined nucleotide content [N]) defined by the user are discarded. These parameters differ from those described in the original filtering process, where m was fixed to 20 and N was a value depending on m, leading to the filtering out of all short subsequences, even with a high base call rate. For subsequence determination, the program starts from the first base call of the sequence considered and searches for the first m base window area that scores the elongation threshold defined by the user, which represents another difference from the filtering process described by Malanoski et al., where this elongation threshold was fixed at 60% (29). The subsequence is extended by one base (m + 1) if the percentage of N remains inferior to the elongation threshold. When this threshold is exceeded, the elongation is stopped and the subsequence is conserved. This process is reiterated until the end of the sequence is reached to generate as many informative sequences as possible. All of our analyses were performed with the following filtering parameters: m = 12, N = 10, and elongation threshold = 10%.
A systematic BLAST strategy to search for sequence homologues was then performed with the filtered sequences containing subsequences. These sequences individually undergo a BLAST analysis based on a local viral and bacterial database (sequences obtained after filtering from the NCBI nr/nt database, updated and used for BLAST queries in December 2009), and the taxonomies of the best BLAST hits are retrieved (Fig. (Fig.1A).1A). The default BLAST options were modified as previously described. When several hits obtain the highest bit score, the script automatically retrieves the taxonomies of the 10 first BLAST hits. The final taxonomic identification of each virus strain tested was done by the user as follows: (i) identification at the species or isolate level when a unique best hit corresponds to the expected species or isolate, (ii) identification at the genus level (if available) when multiple best viral hits exist and correspond to different species within the same genus of the family Rhabdoviridae, (iii) identification at the family level when multiple best viral hits exist and correspond to different rhabdoviruses genera, or (iv) negative or inaccurate identification when a BLAST query is not possible or when multiple best hits correspond to other viral families, respectively.
For the consensus sequence determination strategy, resequencing data obtained from rhabdoviral tiled sequences are filtered as previously described and then submitted to a multiple alignment with CLUSTAL W (39), from which a consensus sequence is determined (Fig. (Fig.1B).1B). For each sequence in the alignment, if a called base has undetermined calls on both sides, it is replaced by an undetermined call. If different calls appear in the sequences for a given position, the majority base call is added to the consensus. The positions that contain an undetermined call or a gap are not considered in the majority base call computation. If multiple base calls tie for the majority, an undetermined call appears at this position in the consensus sequence. This procedure generally increases the length and accuracy of the query sequence for subsequent analysis. Homology searching of the consensus sequences is performed with BLAST using the parameters previously described, and the taxonomy of the best hit is retrieved as for the systematic homology searching approach. We tested if the resulting consensus sequences had higher identification accuracy than any individual sequence or could be used to design PCR primers for a characterization of a potential novel isolate.
Conventional sequencing was undertaken after the PCR amplification of viral targets directly from biological samples (after RNA extraction and RT) or from 10- to 100-fold water-diluted WTA products. Primer design was first based on consensus sequences obtained using the consensus sequence determination strategy previously described and/or on rhabdovirus nucleotide sequences available in GenBank. Depending on the results obtained and the virus strain tested, the primer design, the set of primers used, and the PCR conditions used for partial polymerase gene amplification were then adjusted (list of primers and the PCR conditions are available on request from the corresponding author). All PCR products were obtained using the proofreading DNA polymerase ExtTaq (Takara). Sequence assembly and consensus sequences were obtained using Sequencher 4.7 (Gene Codes).
The data set of 15 newly sequenced rhabdoviruses from this study (including the Sandjimba and Kolongo viruses previously only identified on the basis of partial nucleoprotein gene sequences, as well as Piry virus, for which the nucleotide sequences of different genes were available) was compared with the corresponding block III polymerase amino acid sequences of 91 other rhabdoviruses collected from GenBank (see Table Table6).6). DNA translation was performed with BioEdit software (17), and sequence alignment was performed using the CLUSTAL W program (39) and then checked for accuracy by eye. This resulted in a final alignment of 106 sequences 160 amino acid residues in length. Phylogenetic analysis of these sequences was then undertaken using the Bayesian method available in the MRBAYES package (18). This analysis utilized the WAG model of amino acid replacement with a gamma distribution of among-site rate variation. Chains were run for 10 million generations (with a 10% burn in), at which point all of the parameter estimates had converged. The level of support for each node is provided by Bayesian posterior probability (BPP) values.
To test whether PathogenID microarrays, and specifically the prototype tiled regions, could be used for the identification of a broad number of viral variants without relying on predetermined hybridization patterns, representative animal viruses from the family Rhabdoviridae (including unassigned or tentatively classified rhabdoviruses according to ICTVdb) were studied. The capability of these RMAs to identify and discriminate between near phylogenetic neighbors was first tested using one sequence of the genus Lyssavirus (strain PV, genotype or species 1) tiled on the first generation of the PathogenID microarray (Table (Table1).1). It was possible to use BLAST to successfully identify virus strains with approximately 18% nucleotide divergence compared to the prototype (Fig. (Fig.2).2). The hybridization of 15 virus strains representative of the genetic diversity found in this species indicated that a single tiled sequence was able to detect all of the variant strains belonging to the same species.
In addition, we evaluated the spectrum of detection of the second generation of the PathogenID microarray, which included one prototype sequence representative of each of the seven described species in the genus Lyssavirus (Table (Table2).2). All of the isolates tested led to the correct species identification using a systematic BLAST strategy when hybridizing a target belonging to the same species that is tiled on the array (Table (Table3).3). Moreover, all of the tested isolates of a known genotype were also recognized by heterospecific tiled sequences (Table (Table3).3). We also investigated the capacity of this RMA to detect more distantly related viruses not yet classified into a species. Isolates 0406SEN and WCBV, which have been proposed to represent new species of the genus Lyssavirus (5, 15), were surprisingly recognized by almost all of the seven species sequences tiled on the PathogenID v2.0 microarray (Table (Table3).3). This recognition indicates that each sequence tiled on the array has the ability to identify strains that are more than 18% divergent, and up to 25.9% in some cases (Table (Table3).3). This analysis also reveals that information on a strain hybridized on PathogenID v2.0 can be obtained from distinct species or isolates tiled on the array. Evaluation of the spectrum of detection of this RMA was further extended to two other genera of the family Rhabdoviridae—Ephemerovirus and Vesiculovirus (Table (Table4).4). Here again, successful identification was achieved using homospecific sequences tiled on the array, confirming the reliability of the identification.
In both experiments (Tables (Tables33 and and4),4), low base call rate values were obtained for several combinations of hybridized and tiled sequences. These values were sufficient for viral identification by BLAST, despite the presence of sequence reads as short as 14 nt. This indicates that most of these short sequences corresponded to highly conserved sequence domains. The accuracy of these short sequences was checked by comparison with those obtained by classical sequencing (data not shown).
A bioinformatic workflow was developed to gather stretches of sequence reads obtained with more or less distantly related sequences tiled on PathogenID v2.0. The aim of this strategy was to enlarge the length of the sequence determined in order to improve the sensitivity of the BLAST analysis compared to previously described methodologies (29). All of the sequence reads obtained from prototype sequences of the genus Lyssavirus (at least 12 nt long with no more than one undetermined base, whether or not they initially led to a positive BLAST identification) were used to generate a contiguous sequence. When overlapping fragments were identified, a consensus sequence was generated to remove ambiguous or undetermined base calls. The methodology used to obtain consensus sequences confirmed the species identification after BLAST analysis in the case of the seven lyssavirus nucleotide sequences used for hybridization (Table (Table5).5). Moreover, these consensus sequences were found to be more powerful in identifying unclassified or new species of lyssaviruses not tiled on the RMA than resequencing data collected individually from each tiled sequences, as shown for strains 0406SEN and WCBV. In both cases, an increase in the base call rate was observed using this consensus sequence strategy, from 63.5% (best base call rate obtained from individual prototype sequences) to 75.9% for strain 0406SEN and from 32.7% to 60.9% for WCBV (Tables (Tables33 and and5).5). Once again, this increase in nucleotide base determination was associated with a relatively high accuracy (91.8% and 97.3% concordance between the consensus sequences and the reference sequences of isolates 0406SEN and WCBV, respectively (Table (Table5).5). To further demonstrate the ability of this strategy to detect and identify novel virus species, consensus sequences were generated based only on six of the seven prototype tiled sequences (excluding the homospecific sequence of the same species tiled on the array). All of the strains of the seven species tested were accurately and specifically identified using this restricted approach (Table (Table5).5). These results indicate that the consensus sequences obtained could improve the detection of a novel domain(s) not identified using only the closest prototype sequence tiled on the RMA.
A total of 17 brain biopsy samples originating from experimentally infected mice and various clinical samples (n = 8) obtained from the National Reference Centre for Rabies at the Institut Pasteur were tested for lyssavirus detection and identification using the two versions of the PathogenID microarray (Tables (Tables11 and and2).2). These specimens were previously collected from humans and animals with clinically documented encephalitis and suspected of having rabies. They were used to compare RMA results with conventional methods of diagnosis, including the RT-heminested PCR (RT-hnPCR) technique for the intra vitam diagnosis of rabies in humans (13), the fluorescent-antibody test, the rabies tissue culture inoculation test, and the enzyme-linked immunosorbent assay for the postmortem diagnosis of humans and animals (8, 47). Among the eight clinical samples, most were brain biopsy specimens collected from different rabid mammals, including a bat, a cat, a dog, and two foxes, and from a human. The two other samples comprised a saliva specimen and a skin biopsy sample collected from two different rabid human patients (Tables (Tables11 and and2).2). Except for the skin biopsy case, which was not recognized, this comparison demonstrated a complete concordance between our method and conventional methods for all of the samples tested. Hence, the accuracy of the sequences provided with PathogenID microarray was close to that obtained using classical sequencing (data not shown). The failure to detect lyssaviruses in the skin biopsy samples was probably due to insufficient sensitivity of the current RMA method, as viral RNA was only weakly detected after RT-hnPCR.
In sum, these results demonstrated that the newly developed amplification process by WTA coupled to hybridization to the PathogenID microarray allowed the detection of a large range of viral variants from various complex biological samples, including clinical samples (Tables (Tables11 and and22).
Broad-spectrum detection was demonstrated using the consensus sequences-based analysis strategy among viruses of the family Rhabdoviridae, and the more distantly related viruses examined included many viruses that are not yet classified as species. Accordingly, 17 different rhabdoviruses were tested by using brain samples from experimentally infected mice (n = 16) or infected cell suspension. These viruses included four strains belonging to the genus Vesiculovirus, with Vesicular stomatitis Indiana virus (VSIV) and Boteke (BOTK), Jurona (JURV), and Porton's (PORV) viruses, the latter three of which are currently classified as tentative species; two strains belonging to the genus Ephemerovirus, the Kimberley (KIMV) and kotonkan (KOTV) viruses, corresponding to a tentative and an unassigned species, respectively; and 11 presently unassigned rhabdoviruses, namely, the Kamese (KAMV), Mossuril (MOSV), Sandjimba (SAJV), Keuraliba (KEUV), Nkolbisson (NKOV), Garba (GARV), Nasoule (NASV), Ouango (OUAV), Bimbo (BBOV), Bangoran (BGNV), and Gossas (GOSV) viruses (virus taxonomy according to ICTVdb) (Table (Table22).
In the first step, successful detection and identification of these viruses using the PathogenID v2.0 microarray was obtained for 12 (70.5%) out of 17 viruses; accurate taxonomic positioning—that is, within the family Rhabdoviridae—was also achieved, and for some, the corresponding genus (when available) was also matched accurately (data not shown). In the second step, specific and consensus primers were designed based on the stretches of sequences identified by the microarray using the consensus sequence determination strategy and then subsequently used for PCR and classical sequencing of the amplified target nucleotide sequences. For four (GARV, NASV, OUAV, and BBOV) of the five rhabdoviruses not detected by the microarray, a region of 1,000 nt of the polymerase gene encompassing that tiled on the array was successfully amplified by PCR and sequenced using the primers described above. The only exception was the GOSV isolate, which remained undetected by either the microarray or PCR. Further, two other rhabdoviruses not previously tested with the PathogenID v2.0 microarray—Kolongo virus (KOLV, an unclassified species) and Piry virus (PIRYV, a vesiculovirus)—were also amplified and sequenced using these primers.
All of the newly sequenced nucleotide regions of the polymerase gene were further translated into protein sequences and aligned with 88 sequences of animal or plant rhabdoviruses obtained from GenBank, producing a total data set of 106 sequences 160 amino acid residues in length. A Bayesian phylogenetic analysis of these sequences tentatively distinguished 15 groups of viruses based on their strongly supported monophyly (Table (Table66 and Fig. Fig.3).3). The members of the six genera—Ephemerovirus, Lyssavirus, Vesiculovirus, Cytorhabdovirus, Nucleorhabdovirus, and Novirhabdovirus—fall into well-supported monophyletic groups (BPP value, ≥0.97) (Fig. (Fig.3).3). Interestingly, this analysis suggested the existence of at least nine more groups of currently unclassified rhabdoviruses, which reflect important biological characteristics of the viruses in question. Five of these groups have been proposed previously and were further supported by our analysis (data available at the CRORA database website [http://www.pasteur.fr/recherche/banques/CRORA/]) (6, 27; reviewed in reference 7). The first group, tentatively named the Hart Park group, contains the previously described Parry Creek (PCRV), Wongabel (WONV), Flanders (FLANV), and Ngaingan (NGAV) viruses added to the newly identified viruses BGNV, KAMV, MOSV, and PORV. This group has a large distribution that encompasses Africa, Australia, Malaysia, and the United States. These viruses have a wide host range, as they have been found to infect dipterans, birds, and mammals. The second group is the Almpiwar group, containing four members—two strains of Charleville (CHVV) virus, i.e., CHVV_Ch9824 and CHVV_Ch9847—and the Almpiwar (ALMV) and Humpty doo (HDOOV) viruses. Viruses of this group were isolated in Australia and are associated with infections of dipterans and lizards but also birds and mammals, including humans. Another group, herein referred to as the Le Dantec group, was also seen to form a distinct cluster with Le Dantec virus (LDV), Fukuoka virus (FUKV), and the two newly molecularly identified viruses KEUV and NKOV. Members were isolated in Japan and Africa, where they were shown to infect dipterans and mammals, including humans. The fourth group has been tentatively named the Tibrogargan group and includes the Tupaia (TUPV) and Tibrogargan (TIBV) viruses. These viruses were isolated in Southeast Asia, Australia, and New Guinea from dipterans and mammals. Finally, we observed the Sigma group as previously described (27). It includes Drosophila affinis (DAffSV), Drosophila obscura (DObsSV), and two strains of Drosophila melanogaster (SIGMAV_AP30 and SIGMAV_HAP23) sigma viruses, infecting Drosophila flies which were found in the United States and Europe.
In addition, four other tentative groups of viruses are newly described in this study. The Sandjimba group includes the first molecularly classified viruses BBOV, BTKV, NASV, GARV, and OUAV and the previously described Oak-Vale virus (OVRV), SJAV, and KOLV (identification of the latter two based only on a limited region of the nucleoprotein gene). These viruses were isolated from birds and dipterans from the Central African Republic and Australia (data available at http://www.pasteur.fr/recherche/banques/CRORA/) (6, 9). Interestingly, all of the African members of this group clustered closely, whereas the sole Australian virus was more divergent, suggesting a potential geographical segregation. Second, the Sinistar group includes the Siniperca chuatsi rhabdovirus (SCRV) isolated from mandarin fish in China (37) and the starry flounder rhabdovirus (SFRV) from starry flounder in the United States (30). These two viruses appear to be more closely related to the Le Dantec group than to viruses in the genus Vesiculovirus, in which several other fish rhabdoviruses are classified. The third one is the Moussa group, including two isolates of Moussa virus (MOUV_D24 and MOUV_C23) collected from mosquitoes in Ivory Coast (34). Finally, a phylogenetic analysis suggests the presence of another group within the plant rhabdoviruses: the Taastrup group, which comprises the single isolate Taastrup virus (TV) isolated from leafhoppers (Psammotettix alienus) originally collected in France (28). All of these groups were strongly supported by the Bayesian analysis (BPP value, ≥0.98), with the exception of the Sigma group, which exhibits a BPP value of 0.88.
In addition, classification of some uncharacterized rhabdoviruses from our phylogenetic analysis diverged from that previously suggested by serology (according to ICTVdb) and will probably need further investigation to determine their precise taxonomic positions within the family Rhabdoviridae (Table (Table6)6) (9, 38). In particular, PORV and BTKV, previously identified as vesiculoviruses, were included within the Hart Park and Sandjimba groups, respectively, and NKOV was classified into the Le Dantec group instead of the Kern Canyon group. Moreover, in contrast to a previous phylogenetic study (22), TUPV was found to be more closely related to TIBV than to any other isolates in the Sandjimba group. Finally, our study confirmed the previous serology-based classification of JURV and the recently identified Scophthalmus maximus rhabdovirus (SMRV) within the Vesiculovirus genus (38, 49).
We have analyzed the capacity of viral detection and identification of two versions of a newly described RMA, termed PathogenID, which was designed specifically for multiple pathogen detection using database similarity searching (1). To evaluate this microarray, we focused on one of the largest and most diverse viral families described to date, the Rhabdoviridae (ICTVdb, reviewed in reference 7). All of the virus strains tested (except WCBV) were extracted from biological samples and amplified using a nonspecific and unbiased WTA step as previously described (3). Rhabdovirus-targeted sequences were selected among blocks of conservation within the polymerase gene (6). This region was chosen so as to encompass a sufficient number of homologous but also polymorphic sites. The key advantage of this RMA strategy is that it does not require a specific match between the samples tested and tiled sequences; indeed, mismatches add value as they allow precise typing of the unknown genetic resequenced element. In our case, the conserved nature of the target region of the polymerase gene (block III) and the capability of detection of the RMA allows a precise taxonomic identification (i.e., family, genus, species) and also provides key information on phylogenetic relationships for some unclassified, unassigned, or tentative species of rhabdoviruses. For example, results obtained by the PathogenID v1.0 microarray evaluation demonstrated that most of the intraspecies nucleotide diversity found in the genus Lyssavirus can be covered by a single prototype sequence tiled on the microarray. Using the second version of PathogenID which included one prototype sequence of each of the seven species recognized thus far within the genus Lyssavirus, we extended the spectrum of detection of the RMA to potentially all of the known or unknown lyssaviruses (i.e., positive detection of virus isolates presenting up to 25.9% nucleotide divergence with the tiled sequence considered), which is greater than that previously reported (24-26, 43, 44).
This study also indicates that accurate viral identification may still be possible even when only shorter sequences are obtained from individual tiled prototype sequences. Indeed, taken individually, these short stretches of nucleotide sequence could not give positive results during the initial BLAST query. However, when used in the consensus sequence determination strategy employed here, they improved the identification of virus strains distantly related with that tiled on the RMA. For example, we were able to test and detect rhabdoviruses based on sequence data obtained with tiled sequences that originated from other viral genera.
The strategy developed here also allowed the potential detection of genetically diverse rhabdoviruses previously identified or unknown by using a limited number of sequences tiled on the microarray. Using the PathogenID v2.0 microarray, we were able to identify 30 rhabdoviruses in total. This included 12 viruses currently unclassified, unassigned, or assigned as tentative species within the family Rhabdoviridae (according to ICTVdb). Moreover, the consensus sequence-based analysis of RMA results was shown to be accurate compared to sequences obtained through classical sequencing (Table (Table55 and data not shown). Sequence data provided by the PathogenID v2.0 microarray were also extremely helpful in the design of specific primers to further sequence the targeted region of the viral polymerase gene of some other rhabdoviruses. Finally, this approach allowed us to undertake the largest phylogenetic analysis of the family Rhabdoviridae (Table (Table66 and Fig. Fig.3),3), even though it is important to note that the list of viruses and potential taxa described here is still incomplete and more viruses will clearly be characterized in the near future. Despite these phylogenetic divisions, all of the viruses included in these proposed groups are closely related to vesiculoviruses and ephemeroviruses and were found to infect a large spectrum of animals, included dipterans and mammals (and previously referred to as the dimarhabdovirus supergroup (6) but also lizards (Almpiwar group), birds (especially the Sandjimba group but also with Hart Park group), and fish (Sinistar group) (Table (Table66).
Although promising, inadequate sequence selection for the design of the RMA, and consequently a lack of coverage of the viral sequence space, represents an important limitation. A proper selection of blocks of conserved sequence across taxonomic subdivisions in the viral world could be similarly defined and targeted by the RMA assay, and in doing so improve the detection power of this tool and therein greatly aid in the identification of members of the family Rhabdoviridae or even other viral families. The results presented here validated the usefulness of the design methodology. It emphasizes the gain in identification using a consensus sequence strategy determination compared to a systematic BLAST strategy (29). Indeed, this strategy allows us to use and accurately analyze the RMA output data, even if only short subsequences with a high base call rate are obtained. It provides an informative alternative to current molecular methods, such as classical or multiplex PCR, for the rapid identification of viral pathogens. It is currently being applied to assist in a new generation of RMA aimed at the detection and identification of genetically diverse and unknown viral pathogens and more broadly of any virus present in a clinical specimen. In contrast to conventional microarrays, it is not limited by the requirement of prior knowledge of the identities of viruses present in biological samples and it is not restricted to the detection of a limited number of candidate viruses. As such, this strategy has a great potential for being implemented as a high-throughput platform to identify more divergent viral organisms. This technology could be especially useful in clinical diagnosis or in surveillance programs for detecting uncharacterized viral pathogens or highly variable virus strains in the same taxonomic genus or family, which is frequently the case for RNA viruses (2). The potential applications of such a methodology therefore appear to be numerous: differential diagnostics for illnesses with multiple potential causes (for example, central nervous diseases like encephalitis and meningitis), tracking of emergent pathogens, the distinction of biological threats from harmless phylogenetic neighbors, and the broader-scale investigation of biodiversity in the viral world.
This work was supported by grant UC1 AI062613 (G. C. Kennedy) from the U.S. National Institute of Allergy and Infectious Diseases, National Institute of Health; the Programme Transversal de Recherche (PTR DEVA 246) from the Institut Pasteur, Paris, France; the European Commission, through the VIZIER Integrated Project (LSHG-CT-2004-511966); and the Institut Pasteur International Network Actions Concertées InterPasteuriennes (2003/687). We thank the sponsorship of the Total-Institut Pasteur for financial support.
We are grateful to D. Blondel, H. Zeller, and the CRORA database for having provided some of the rhabdovirus isolates tested in this study. We are also grateful to the technical staff of the Genotyping of Pathogens and Public Health Technological Platform for their patience and their excellent work in the sequencing of the different rhabdoviruses.
Published ahead of print on 7 July 2010.