|Home | About | Journals | Submit | Contact Us | Français|
Sequencing data analysis remains limiting and problematic, especially for low complexity repeat sequences and transposon elements due to inherent sequencing errors and short sequence read lengths. We have developed a program, ReviSeq, which uses a hybrid method comprised of iterative remapping and local assembly upon a bacterial sequence backbone. Application of this method to six Brucella suis field isolates compared to the newly revised Brucella suis 1330 reference genome identified on average 13, 15, 19 and 9 more variants per sample than STAMPY/SAMtools, BWA/SAMtools, iCORN and BWA/PINDEL pipelines, and excluded on average 4, 2, 3 and 19 variants per sample, respectively. In total, using this iterative approach, we identified on average 87 variants including SNVs, short INDELs and long INDELs per strain when compared to the reference. Our program outperforms other methods especially for long INDEL calling.
The program is available at http://reviseq.sourceforge.net.
The genome sequences of microbial field isolates often contain a substantial number of loci different from the published references due to the high rate of mutation in bacterial replication (ca. 1/300 per genome per replication) . Fortunately variant calling in bacterial genomes is relatively straightforward compared to that for eukaryotic studies because bacterial genomes are haploid. Incorrect variant calling in bacterial genomes is often caused by structural variants or incorrect mapping due to sequence variants in diverse repeat sequences including tandem repeats and transposon elements. Sequencing errors rarely cause incorrect variant calling because they are easily identified by designing the study to have a high depth of raw sequence coverage (i.e. >20×). Variants occurring in repeat sequences can incorrectly fool mapping programs into assigning high quality scores to incorrectly mapped reads when the sequence reads from the repeat loci are significantly different from the reference sequence (e.g. length variation at two or more tandem repeat loci containing the same motif often causes incorrect mapping of sequence reads and high quality scores to the reads). This leads directly to invalid variant calls in repeat loci because the variation calling programs rely only on the mapping quality scores to filter out false positive variants from incorrectly mapped reads. Several programs have been developed to find structural variations such as insertions, deletions and copy number variation, but they also have a limitation in searching for long (i.e. > 8 bases) insertions or deletions when the number of incorrectly mapped sequences at a locus is high. An improved mapping post processing step is necessary to correct for this class of incorrect variant calls.
To address these issues in variant calling, we have developed ReviSeq which uses an iterative backbone remapping and local assembly method to generate and revise bacterial genome sequences from short sequence reads and a reference sequence. Previous iterative retrieval approaches used in several de novo assembly methods [2, 3] are limited in application to resequencing analysis because they do not assemble contigs into large structural sequences, especially in or near low complexity repetitive sequences. iCORN, which uses an iterative mapping approach to revise a genome sequence, was developed for resequencing, but it does not correct long INDELs because iCORN's approach uses simple iterative mapping and does not benefit from local re-assembly . Here, we report an advanced iterative remapping and local assembly approach which generates the revised whole genome sequence structure at each iteration based upon a backbone sequence structure.
We demonstrated the effectiveness of this approach for identifying accurate sequence variants found within the bacterial mutants of Brucella field isolates. Brucella is a gram-negative pathogenic bacteria that causes zoonotic disease in domestic animals  and has been designated as a category B priority pathogen. Consuming milk products from or having direct contact with infected animals may result in transmission to humans via penetration of skin or mucosal membranes . At the start of resequencing analysis project, it is important to choose a suitable reference genome sequence against which high probability variants can be identified. The variation identified is a foundation for many downstream analyses. Due to the pathogenicity of Brucella, the results of variation detection are the basis for developing assays that are critical to the detection and mitigation of Brucella, as both a potential bioterrorism threat and as an infectious agent.
The Brucella genome is composed of two circular chromosomes of 2.1 Mbp and 1.2 Mbp. The first fully sequenced organism in the genus Brucella was B. melitensis biovar 1 which was published in 2002 . Currently, the complete genome sequences of 11 additional Brucella organisms are publically available and many other genomes from other strains in the genus Brucella are in the process of being sequenced. Here we used the genome sequence of Brucella suis 1330 as a reference for detection of variants in six field isolates of Brucella collected from several different hosts that exhibited highly similar characteristics to Brucella suis 1330 in the ‘gold standard’ antibody diagnostic tests and biochemical tests . Currently, two different versions of Brucella suis 1330 genome sequences are available. The original sequence was published in 2002  and has been used as a reference in several resequencing studies [10, 11], and the revised sequence of the ‘same’ sample was published recently .
The Brucella genome contains an 842 bp IS711 transposon element  that is unique to Brucella and exists at several different locations in the genome. The published Brucella suis 1330 reference genomes have seven copies of IS711. If two or more close proximity variants exist within a transposon element, then this can lead to incorrect mapping of sequencing reads and therefore wrong variant calling in these regions.
The published Brucella suis 1330 genome sequences also have 10 loci containing 8-mer tandem repeats (≥ 3 motif copies) which are highly variable in its field isolate genomes. When the lengths of tandem repeats at these loci are dramatically different from the reference, the reads containing these elements can produce invalid alignments or be mapped to incorrect loci, leading again to incorrect variant calls.
Through this iterative remapping and local assembly approach, we identified an average of 87 variants in genomes of field isolate samples with respect to the reference sequence (Supplemental Table 1). Conversely, traditional mapping approaches called approximately 70 variants from genome sequences of each field isolate with respect to the revised Brucella suis 1330 genome . The largest differences observed between the reference genome and field isolates were observed within two long additional sequences (69 bp and 78 bp) in genomes of all isolates (Fig. 1), which were not detected by traditional mapping methods. Interestingly, since the sequences exist not only in genome sequences of all six field isolates but also in seven other sequenced Brucella species - Brucella abortus S19, Brucella abortus biovar 1, Brucella canis ATCC 23365, Brucella melitensis 16M, Brucella microti CCM 4915, Brucella ovis ATCC 25840 and Brucella suis ATCC 23445 - the sequences are likely to be deletions in the specific sample used to generate the Brucella suis 1330 reference sequences rather than new insertions in the field isolates or closely related species.
To address the limitation of the traditional resequencing methods in correcting mismapping/misalignment of sequence reads, variant calling results of ReviSeq were compared with that from the traditional resequencing analysis method which uses a simple mapping and variant calling pipeline. The most well known pipelines are combinations of BWA/SAMtools and STAMPY/SAMtools. Both mapping programs, BWA (ver. 0.5.9)  and STAMPY (ver. 1.0.13) , were executed with default options, and SAMtools  (mpileup) was used to generate pileup files. Since the purpose of this comparison was to evaluate the limitation of the methods in variant calling for haploid genomes due to incorrect read alignments, we used a simple filtering approach to call variants instead of using variant calling programs such as ‘bctools’ of SAMtools or SNVer . The variants were called from the pileup files only if they were supported by at least 40% of reads covering each locus which were themselves covered by at least 10 reads. For an insertion, we applied 40% × (read length-insertion length)/read length, considering the probability of a read completely covering the inserted sequence. Another program, iCORN (ver. 0.97), which uses an iterative mapping method, was also compared. To assess the reliability of variant calling of each pipeline, we generated a consensus sequence for the pipeline by replacing the reference bases with the variants identified by the pipeline using sample 13. Then we remapped sequence reads of sample 13 to the consensus sequence using BWA and counted the number of problematic reads including unmapped reads, reads with mismatches, clipped (partially aligned) reads, pair unmapped reads and long distance pairs (>500 bases, an average distance was 290 bases with 25 as a standard deviation) in the mapping result. The ReviSeq pipeline shows the smallest percentages for all types of problematic reads, which suggests that it is the most reliable among all compared pipelines (Fig. 2).
Average numbers of variants called by BWA/SAMtools, STAMPY/SAMtools pipelines and iCORN were 78, 74 and 72, respectively (Table 1). Compared to the results of ReviSeq, STAMPY/SAMtools, BWA/SAMtools and iCORN predicted on average of 4, 2 and 3 more variants, and called on average of 13, 15 and 19 less variants per sample respectively. Each variant region inconsistent with the results of ReviSeq and its aligned reads were visually inspected. All three pipelines showed limitations in identifying long INDELs, and called several incorrect SNVs and short INDELs mainly in long INDELs loci (Fig. 3). We also tested the ABYSS  assembly/BWASW  mapping/SAMtools pipeline and the BWA/SNVer  pipeline along with the local alignment program of GATK  applied to the BWA/SAMtools pipeline but did not observe significant improvement (Supplemental Table 2).
INDELs predicted by PINDEL , a program to search for structural variations, were also compared with our results. The mapping results of BWA were used as input data to PINDEL. To reduce false positive calls in the PINDEL results, INDELs supported by at least 10 reads were chosen for comparison. Compared to the results of the ReviSeq pipeline, PINDEL predicted an average of 19 more variants and called an average of 9 less variants per sample (Table 2). Many of insertions and deletions falsely called by PINDEL were predicted from mis-alignments around long INDELs. Interestingly, a few insertions and deletions attributable to possible chimeric DNA, mutants, reads from different strains or contaminants existing in the samples were predicted by PINDEL but not by our method. Since the loci were unique regions in the genome and the frequency of reads contributing to these INDEL predictions were much lower than that of the average read depth (lower than 20% of average depth), but consistent in all variant calling results, we concluded that they were not derived from mis-mapping or mis-alignment (mapped to correct position but partially misaligned) and were therefore appropriately not called by our method.
We additionally sequenced four 8-mer tandem repeat loci and two C homopolymer loci for samples 17, 22, 29, 34 and 35 using the Sanger method to confirm the variants called by our method and the other pipelines, which were different (Supplemental Table 3). Peaks in chromatograms for two C homopolymer loci at all samples were not clearly identified due to possible heterozygous alleles (Supplemental Fig. 1). ReviSeq could correctly identify alleles of 9 loci among 16 loci containing variants (4 in sample 17, 3 in sample 22, 3 in sample 29, 3 in sample 34 and 3 in sample 35), while STAMPY/SAMtools, BWA/SAMtools and iCORN pipelines identified 1, 0 and 0 loci. ReviSeq could not correctly identify alleles from seven tandem repeat loci of which allele lengths were close to or longer than the read lengths (76 bases), and this limitation could be reduced with longer reads.
All six field isolates have a similar number of variants when compared with the revised reference. Among them, the number of common variants in all isolates totaled 39, including 32 SNVs and 7 INDELs. The pairwise comparison of their sequence variants is illustrative of the number of common variants of each pair (Supplemental Table 4), which reveals the evolutionary relationship of the samples (Fig. 4). Specifically, samples 13 and 22 have an additional insertion site of an IS711 element at position 1,578,904 of chromosome 1, which has not yet been reported in any of the previously sequenced Brucella species. The insertion site immediately follows the stop codon of a protein coding locus annotated as a ribose ABC transporter at the Brucella suis 1330 reference.
As samples 29 and 34 were isolated from bovine tissue derived from the same herd, variants in their genomes were almost identical except for one SNV and the lengths of three 8-mer tandem repeat loci. Interestingly, although samples 13 and 22 were isolated from different hosts (samples 13 from equine tissue and 22 from bovine milk), their genomes have more common variants than seen in the other genomes. None of the variation in the genes in the six isolates have been reported to affect the host preferences of the Brucella genus, but approximately 24% of the genes have unknown functions which may be related to host preferences (Supplemental Table 5).
Since next-generation sequencing (NGS) technologies were invented, resequencing followed by mapping to a reference genome has become one of the most widely used approaches for comparative genomic analysis. Even though this advanced methodology has enabled robust comparison of multiple individuals in a time and cost efficient manner, the traditional resequencing analysis methods using a simple mapping and variant calling pipeline were susceptible to falsely calling variants in the vicinity of repeat sequences and structural variants. Here, we have employed an iterative remapping and local assembly approach to improve variant calling from sequencing data and illustrated its effectiveness via analysis of six Brucella suis field isolates whose genomes contain several IS711 transposon elements and 8-mer tandem repeats which are highly variable. As variation analysis results using a resequencing/mapping approach can affect the quality of downstream studies, it is important to correctly identify variants. Using the reliable results from this approach, we identified a number of interesting variants, some common, among the field isolates, which are helping to understand the role that these variants may play in important issues such as transmission, pathogenicity and host preference shifting. For example, we have identified sequence differences within a large number of genes which have unknown functions (Supplemental Table 5), so, for example, host preference of the isolates still remain unexplained, and accurate viariant calls may help target appropriate mechanistic studies necessary to explain this aspect of the host-pathogen behavior of this species. To further enhance the utility of this approach in bacterial genome research, inversion and rearrangement of a genome sequence with respect to the backbone, which are also common variations in bacterial genomes and remain challenging, will be addressed in future versions.
The Texas Animal Health Commission (TAHC) provided Brucella samples from six field isolates that exhibited characteristics highly similar to Brucella suis 1330 in standard antibody and biochemical diagnostic tests  for carbon dioxide utilization, production of hydrogen sulphide and dye (thionin and basic fuschin) sensitivity. The samples were collected from equine tissue (sample 13), porcine tissue (sample 17), bovine milk (sample 22), and bovine tissue (sample 29, 34 and 35). Genomic DNA for each sample was sequenced via the Illumina GAIIx sequencer and data acquired using the standard Illumina 101 (sample 13) and 76 (sample 17, 22, 29, 34, and 35) cycle paired-end protocols generating approximately 23,500,000 sequencing read pairs (47,000,000 reads) per sample.
The Iterative remapping and local assembly approach used by ReviSeq is illustrated in Figure 5. This process was performed on each sample independently. To begin the analysis, we trimmed all low quality base calls from the ends of sequencing reads to remove base calls that have a high probability of being incorrect. Trimming at each read started from the end base of each read and stopped when a high quality base call (≥ Q20, Phred quality score) was encountered. If one of two reads in a pair were shorter than 20 bases after removing the low quality base calls, both reads were excluded. We then mapped the reads to the revised reference genome, S0, using BWA. This approach resulted in more than 99.9% of the reads mapping to the reference, yielding average sequence coverage of 1,460× and 1,048× for the 101- and 76-cycle data, respectively.
A new consensus sequence, S1, was generated using SAMtools from the read mapping results. Concurrent with mapping, we also performed two independent de novo assemblies of the data using ABYSS (with k-mer 55)  and the CLC genomics workbench (with default options) to enable detection of long insertion and deletion events, and possible genomic re-arrangements. After assembly, we aligned the resulting contig sequences and the new consensus sequence, S1, to the reference sequence, S0, using BWASW  which is a mapping tool to align long sequences to a reference. A new consensus sequence, S2, was generated using the simple majority voting method from the mapping results of the de novo assembly contigs and S1. For this step, we gave higher priority to S1 than contigs. As a result, if S1 and contig sequences were different at a locus, S1 was chosen unless two de novo assembly contig sequences were consistent. This step allows for detection of large INDEL variants that are not correctly detected by a simple mapping method.
We next mapped the sequencing reads to the new consensus sequence, S2, using BWA to generate another consensus sequence, S3. If consensus sequences S2 and S3 were different, the reads were mapped to S3 to test whether all reads were correctly aligned to the same positions and a new consensus, S4, was generated from the newly aligned reads. The mapping/comparing was iterated until Si-1 and Si converged. The purpose of this iteration was to remap incorrectly mapped reads which would otherwise cause incorrect variant calling at IS711 transposon elements.
Next, partially mapped reads (clipped reads) to Si were counted to search for long INDELs which were not detected in prior alignment steps. When the number of partially mapped reads and abnormal pairs (distance of a pair > μ + 3σ or < μ - 3σ, where μ is the mean and σ is the standard deviation of the distance between pairs) was more than 10% of total mapped reads at a position in Si, the partially mapped reads were locally assembled by searching for exact matches only and their contig was subsequently aligned to Si. If a long INDEL was detected, Si was modified and used as a new reference sequence and resubmitted to the iterative remapping process until there was no change.
The final consensus sequence of each sample was aligned to the reference sequence by BWASW. Due to the varying lengths of the 8-mer tandem repeat loci and other long INDELs such as a new insertion of an IS711 element in the sample genome, BWASW could not map the whole query sequence to the reference sequence as a single alignment. Instead, it fragmented the query sequence into several pieces and aligned them to the reference separately. We extended the partial alignments adding INDEL information to merge them into one alignment and called the variants from the alignment. Further, the length variants (from 40 bases deletions to 80 bases insertions) of the 8-mer tandem repeat loci in field isolate genomes were fixed and confirmed by Sanger sequencing.
This work was supported by the Medical Informatics and Systems Division director's funds; the U.S. Department of Homeland Security through the National Center of Excellence for Foreign Animal and Zoonotic Disease Defense at Texas A&M University [subaward 570636 from DHS 2007-ST-061-000002]; and the National Institute of Health, National Human Genome Research Institute, 1000 Genomes Project Dataset Analysis Grant [grant number 1U01HG-005719-01]. Genomic DNAs of Brucella field isolates were provided by TAHC (Texas Animal Health Commission) and Texas A&M University. The Core Laboratory Facility and the Data Analysis Core at the Virginia Bioinformatics Institute sequenced the genomic DNAs and assembled contigs, respectively.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.