Overview of SRMA
This method relies on short-read alignment algorithms to first align each read to a reference sequence [7
]. After all reads are aligned, they are passed to SRMA for re-alignment. SRMA first builds a variant graph from these initial alignments. Once the variant graph is built, all reads are re-aligned to the variant graph. If the new alignment compared to the original is found, it is reported and annotated as being re-aligned by SRMA, otherwise the original alignment is reported. A novel aspect of this method is the process of building the variant graph iteratively for each genomic region, while reporting new alignments for each read initially aligned within that region. While de novo
assembly (or re-assembly) algorithms report novel sequences without comparing the reads to a reference sequence, this method provides new improved alignments relative to a reference sequence improving downstream consensus calling. Iterative application of SRMA is possible, whereby further rounds of building a variant graph and read re-alignment are performed, but is not examined here.
Creating a variant graph from existing alignments
Here we seek to use individual sequence reads to create a series of possible variant options that include the true variants present within the target genome being sequenced. Ultimately, the goal is to distinguish between true variants and sequencing errors genome-wide. Since, in the interest of novel mutation discovery, we must allow for all possible base positions being variant, as well as for an exponentially larger number of possible indels, we opt for an approach that creates a variant graph that includes all aligned reads at a given position in the genome prior to performing re-alignment. This graph is a compact mathematical representation of the initially determined alignments. Each alignment is represented as a path through the graph, although not every path through the graph corresponds to an actual alignment.
The variant graph is composed of nodes. Each node represents a DNA base at a specific position relative to the forward strand of the reference genome. Two nodes share an undirected edge if they are adjacent read bases in an existing alignment. For example, the variant graph of the reference sequence that is aligned perfectly to itself consists of one node per reference base, with edges connecting nodes that represent adjacent bases in the reference. In this case, the variant graph has one path. To properly order the nodes in the graph relative to the reference, each node is also assigned a position and an offset. The offset is non-zero only if the node represents an insertion relative to the reference. Insertions relative to the reference are given the reference position of the next non-inserted base with higher physical position on the forward strand, and with its offset set as the number of bases from the beginning of the insertion. Insertions at the same reference position can be combined by merging the paths that represent their longest common prefix and longest common suffix, respectively. A single nucleotide substitution would be annotated to have the same position as its relative reference base. In summary, nodes are described as three distinct types: reference, substitution, and insertion. A node's position, base, type, and offset are unique among all nodes in the graph and define a canonical ordering over all nodes in the graph.
Initially, the graph is empty. Bases that match the reference and variants are incorporated into the graph by adding new nodes and edges. Substitutions and insertions are represented as additional nodes in the graph. Deletions, on the other hand, are added as edges that connect nodes that have a positional difference greater than one. An example of creating a variant graph from four alignments is shown in Figure . The variant graph also stores the number of alignments that pass through each node and edge, corresponding to the coverage. This is useful for eliminating unlikely paths when performing re-alignment and will be discussed later.
Figure 6 The creation of a variant graph. Four alignments (left) are successively used to create a variant graph (right). (a) An alignment of a read that matches the reference. The associated variant graph consists of nodes that represent each base of the read. (more ...)
Alignment to a variant graph
Once the variant graph is constructed from all aligned reads, local re-alignment of the reads proceeds through a series of weighted steps to optimize the final alignments. The variant graph is not modified after re-alignment begins. A dynamic programming procedure is used to compare a read to the variant graph in a similar manner to the Smith-Waterman algorithm [24
]. Each path through the graph represents a potential (new) alignment. All paths that begin within w
base positions from the start of the existing alignment are considered as start nodes for a new alignment. A node in the graph is visited at most w
times per re-alignment, even though every path reachable from a starting node is examined. Note that the direction of the paths through the graph match the direction implied by the strand of the original alignment. Therefore, the graph is a directed acyclic graph (DAG) during each local re-alignment, with a partial ordering imposed on the nodes as was explained earlier (position, base, type, and offset). All valid paths from the starting nodes can be efficiently examined using a breadth-first traversal using a heap data structure.
The heap stores nodes sorted by their partial order, the current path length, and the current alignment score, in that order; the path length and alignment score are also stored in the heap. Initially, the start nodes are added to the heap with a path length of one and an alignment score based on comparing the read's first base to the base represented by the start node. If the read base matches the start node base, then no penalty is added to the previous re-alignment score. Otherwise, a negative score based on the original base quality of the read is added to the previous re-alignment score to return the current re-alignment score. Other alignment scoring schemes are possible, but mismatched bases are scored using base quality since it has been shown to improve alignment quality [28
The heap is polled while it is non-empty. Paths to the given node that have the same path length and a smaller alignment score can be removed (from the top of the heap) to remove suboptimal alignment paths. Paths to the same node but with different lengths result from differing start nodes, deletions, and insertions. This pruning step uses a dynamic programming procedure, where the best paths to and from the current node are assumed to be conditionally independent given their respective path lengths (number of read bases examined). Next, if the path length equals the length of the read, all of the bases in the read have been examined. The best (highest alignment score) complete path, if any, is compared to the current path and updated accordingly. Otherwise, the path is extended to each child (successor) of the given node. For each child node, the child node's base is compared to the corresponding base in the read (determined by the path length), with the alignment score modified as above. The child node, incremented path length, and updated alignment score are added to the heap. Once the heap is empty, the path with the best score is returned to give a new alignment. This new alignment may match or differ from the original alignment depending on the graph structure.
As observed during graph creation, the original alignment is represented as a path through the graph, and therefore will be reconsidered during re-alignment. In fact, the original alignment can be used to set a bound on the minimum re-alignment score. Since the alignment score implemented above decreases monotonically, any path with lower alignment score than the original alignment can be removed from the heap. If the original alignment is likely to be the best alignment after re-alignment, then this bound significantly reduces the practical running time of local re-alignment.
The entire variant graph does not need to be constructed before beginning re-alignment, but rather only nodes in the graph that are reachable from the starting nodes need be considered. Therefore, only original alignments that pass through any of these reachable nodes need to be included when creating the variant graph for a specific alignment. Thus, the variant graph can be dynamically built from previous read alignments, with nodes removed from the graph when no longer reachable from the next read re-alignment. This allows only a small local window of the variant graph to be explicitly built and kept in memory, significantly reducing memory requirements.
Accounting for sampling and coverage
Two input parameters prune potential alignment paths through the graph: minimum node/edge coverage, and minimum edge probability. Given a minimum node/edge coverage c, only nodes observed in least c original alignments are considered. The minimum edge probability p considers the all edges through non-insertion nodes (that is, zero offset) at a given genomic position. The total number of observations N across all nodes with the same position (and zero offset) along with the minimum edge probability p is used to bound paths through edges incoming to nodes at that position. Suppose an incoming edge to a node is observed n times, then the edge is pruned if Pr(x ≤ n | N) <p. This probability is modeled using the binomial cumulative distribution function under the assumption that two possible alleles (nodes) are possible at a given position:
While this is a valid assumption if the genome has two copies of each chromosome (diploid), deviations from this do not greatly change the pruning strategy as both input parameters are used in conjunction with an OR logical relationship: a path through a node/edge is included if it passes one or both of the filters. Within high coverage locations, the former filter removes variants that occur by random chance due to sequencing error but is intended to remain sensitive to the detection of alleles that can be obscured by alignment ambiguities. In contrast, within low coverage regions, the former filter will overly penalize variants that were not observed due to insufficient sampling. Thus, the latter filter is designed to include variants in low coverage regions while strictly penalizing variants that do not occur frequently in high coverage regions. These parameters are important for both removing spurious variants and inclusively including potentially real variants in low coverage regions.
Leveraging ABI SOLiD two-base encoded data
Special considerations need to be made to incorporate sequencing reads produced by the ABI SOLiD platform, which are generated in a two-base encoded form. When re-aligning such data to the variant graph, a modified version of the two-base encoded dynamic programming algorithm is used [12
]. In this case, the decoded DNA sequence must exactly match the bases represented by a valid path in the variant graph. The re-alignment score is produced by using the original color sequence and quality scores and is calculated by comparing the original color in the color sequence with the expected color. The expected color is determined by encoding the two bases connected by the previously examined edge in the path, or encoding the known start adapter and first base in the path (for starting nodes). By constraining the decoded bases to match bases represented by nodes in the graph, the computational complexity of the original dynamic programming procedure is reduced to be equivalent to that of the base or nucleotide space sequence comparison.
Simulated and empirical data
Two simulated datasets were created to evaluate SRMA, one simulating data from an Illumina sequencer, and one simulating data from an ABI SOLiD sequencer. A uniform 1% per-base error rate was used for the Illumina dataset, while a uniform 5% per-color read error rate was used for the ABI SOLiD dataset. In practice, sequencing error per read is not uniform, tending to be low at the 5' end (the beginning) of the sequence read and higher towards the end of each read, but that is not modeled here. The distance between the two ends of each paired end read was randomly drawn from a normal distribution of mean 500 bases and standard deviation of 50 bases. Polymorphisms were added to the simulated genome at a rate of 1/1000 with a 1/3 probability of being a homozygous variation. Insertions and deletions each accounted for 5% of all polymorphisms. The probability of an insertion or deletion extending beyond one base was 0.3 per extended base to simulate observed in/del distributions in the human genome. The whole-genome simulation program and subsequent accuracy evaluation can be found in the DNA Analysis (DNAA) package [29
To empirically test the feasibility and utility of SRMA in a whole genome context, a previously published human whole-genome brain cancer dataset was obtained from the Sequence Read Archive (SRA009912.1) [13
]. The original alignments were obtained, which were performed by BFAST [7
], and retained the original color sequence and qualities to allow for color space local re-alignment [12
]. The alignments from BFAST and SRMA were variant-called using the SOAP consensus model implemented in SAMtools (v.0.1.17) using the default settings [10
]. The subsequent alignments were locally re-aligned with SRMA with variant inclusive settings (c
= 2 and p
SAMtools reports three metrics for each variant position: SNP quality, consensus quality, and base coverage. The SNP quality is the Phred-scaled probability that the consensus is identical to the reference, while the consensus quality is the Phred-scaled likelihood that the called genotype is wrong. Typically, a minimum SNP quality filter can be used to reduce false positives while somewhat reducing sensitivity.