The genome sequences of microbial field isolates often contain a substantial number of loci different from the published references due to the high rate of mutation in bacterial replication (ca. 1/300 per genome per replication) [1
]. Fortunately variant calling in bacterial genomes is relatively straightforward compared to that for eukaryotic studies because bacterial genomes are haploid. Incorrect variant calling in bacterial genomes is often caused by structural variants or incorrect mapping due to sequence variants in diverse repeat sequences including tandem repeats and transposon elements. Sequencing errors rarely cause incorrect variant calling because they are easily identified by designing the study to have a high depth of raw sequence coverage (i.e. >20×). Variants occurring in repeat sequences can incorrectly fool mapping programs into assigning high quality scores to incorrectly mapped reads when the sequence reads from the repeat loci are significantly different from the reference sequence (e.g. length variation at two or more tandem repeat loci containing the same motif often causes incorrect mapping of sequence reads and high quality scores to the reads). This leads directly to invalid variant calls in repeat loci because the variation calling programs rely only on the mapping quality scores to filter out false positive variants from incorrectly mapped reads. Several programs have been developed to find structural variations such as insertions, deletions and copy number variation, but they also have a limitation in searching for long (i.e. > 8 bases) insertions or deletions when the number of incorrectly mapped sequences at a locus is high. An improved mapping post processing step is necessary to correct for this class of incorrect variant calls.
To address these issues in variant calling, we have developed ReviSeq which uses an iterative backbone remapping and local assembly method to generate and revise bacterial genome sequences from short sequence reads and a reference sequence. Previous iterative retrieval approaches used in several de novo
assembly methods [2
] are limited in application to resequencing analysis because they do not assemble contigs into large structural sequences, especially in or near low complexity repetitive sequences. iCORN, which uses an iterative mapping approach to revise a genome sequence, was developed for resequencing, but it does not correct long INDELs because iCORN's approach uses simple iterative mapping and does not benefit from local re-assembly [4
]. Here, we report an advanced iterative remapping and local assembly approach which generates the revised whole genome sequence structure at each iteration based upon a backbone sequence structure.
We demonstrated the effectiveness of this approach for identifying accurate sequence variants found within the bacterial mutants of Brucella
field isolates. Brucella
is a gram-negative pathogenic bacteria that causes zoonotic disease in domestic animals [5
] and has been designated as a category B priority pathogen. Consuming milk products from or having direct contact with infected animals may result in transmission to humans via penetration of skin or mucosal membranes [6
]. At the start of resequencing analysis project, it is important to choose a suitable reference genome sequence against which high probability variants can be identified. The variation identified is a foundation for many downstream analyses. Due to the pathogenicity of Brucella
, the results of variation detection are the basis for developing assays that are critical to the detection and mitigation of Brucella,
as both a potential bioterrorism threat and as an infectious agent.
genome is composed of two circular chromosomes of 2.1 Mbp and 1.2 Mbp. The first fully sequenced organism in the genus Brucella
was B. melitensis
biovar 1 which was published in 2002 [7
]. Currently, the complete genome sequences of 11 additional Brucella
organisms are publically available and many other genomes from other strains in the genus Brucella
are in the process of being sequenced. Here we used the genome sequence of Brucella suis
1330 as a reference for detection of variants in six field isolates of Brucella
collected from several different hosts that exhibited highly similar characteristics to Brucella suis
1330 in the ‘gold standard’ antibody diagnostic tests and biochemical tests [8
]. Currently, two different versions of Brucella suis
1330 genome sequences are available. The original sequence was published in 2002 [9
] and has been used as a reference in several resequencing studies [10
], and the revised sequence of the ‘same’ sample was published recently [12
genome contains an 842 bp IS711
transposon element [13
] that is unique to Brucella
and exists at several different locations in the genome. The published Brucella suis
1330 reference genomes have seven copies of IS711.
If two or more close proximity variants exist within a transposon element, then this can lead to incorrect mapping of sequencing reads and therefore wrong variant calling in these regions.
The published Brucella suis 1330 genome sequences also have 10 loci containing 8-mer tandem repeats (≥ 3 motif copies) which are highly variable in its field isolate genomes. When the lengths of tandem repeats at these loci are dramatically different from the reference, the reads containing these elements can produce invalid alignments or be mapped to incorrect loci, leading again to incorrect variant calls.