|Home | About | Journals | Submit | Contact Us | Français|
Summary: LAST-TRAIN improves sequence alignment accuracy by inferring substitution and gap scores that fit the frequencies of substitutions, insertions, and deletions in a given dataset. We have applied it to mapping DNA reads from IonTorrent and PacBio RS, and we show that it reduces reference bias for Oxford Nanopore reads.
Availability and Implementation: the source code is freely available at http://last.cbrc.jp/
Supplementary information: Supplementary data are available at Bioinformatics online.
The classic approach to pair-wise sequence alignment is to seek alignments that maximize a score, which is a sum of substitution and gap scores. This is equivalent to seeking alignments with maximum likelihood, using a statistical model with probabilities for each kind of substitution (e.g. ct), insertion, and deletion. This approach was developed several decades ago, mainly for proteins, but also for nucleotide sequences (Chiaromonte et al., 2002; States et al., 1991). It is arguably least suited to homology search, because different homologs of one protein have different levels of divergence, so that one set of parameters cannot be optimal for all homologs.
Here, we are interested in aligning nucleotide sequences that differ mainly by sequencing error. Compared to homology search, it is more likely that a single set of substitution and gap probabilities will be a universal good fit, for one version of one sequencing technology, applied to one type of DNA. On the other hand, these probabilities may be quite different for different technologies, and even for different versions of the same technology. Moreover, these probabilities will differ for unusual types of DNA, such as 80%-AT Plasmodium genomes or PAR-CLIP data (Kerpedjiev et al., 2014). Thus, it would be useful to have a tool that automatically determines suitable parameters for a given dataset.
Although the score/model-based approach to alignment is well-known and classic, it has been surprisingly neglected in recent high-throughput DNA aligners (Kerpedjiev et al., 2014). It is likely that accuracy is maximized by using scores that fit the substitution and gap frequencies in the data.
In this study, we introduce a novel tool, LAST-TRAIN, to train alignment parameters from sequence data. We use it to train parameters for PacBio RS, IonTorrent and Nanopore. Finally, we show that it mitigates reference bias (haplotypes appearing in the reference genome tend to be over-estimated) for Oxford Nanopore reads.
LAST-TRAIN’s input is query (e.g. DNA reads) and reference (e.g. a genome) sequence datasets. It uses a standard iterative approach: it first aligns the sequences using some initial score parameters, then infers better score parameters from the alignments, then re-aligns and repeats until the parameters stop changing (Durbin et al., 1998). It achieves adequate speed by an X-drop heuristic (Altschul et al., 1997; Zhang et al., 1998), it depletes paralogs using LAST-SPLIT (Frith and Kawaguchi, 2015), and it allows different insertion and deletion parameters and non-strand-symmetric substitution parameters. Details are in the Supplementary Material S1.
Long-read DNA sequencing is a promising way to determine phasing between DNA variants. Most human cells have two copies, maternal and paternal, of each chromosome (except Y). Suppose that a patient has two variants in different exons of one gene, where each variant destroys the gene’s function but is present in just one chromosome (maternal or paternal). It is important to know whether they are in the same chromosome.
A previous study attempted to determine the haplotypes of CYP2D6, a gene that affects metabolism of clinical drugs, in human sample NA12878, by PCR amplification of the relevant genomic region followed by Oxford Nanopore sequencing (Ammar et al., 2015). This sample is known to have the two haplotypes CYP2D6*3 and CYP2D6*4, shown in Supplementary Table S14 (Numanagi et al., 2015; Twist et al., 2016), however the original study found a prominent third haplotype. A later re-analysis (Laver et al., 2016) suggested two reasons for this. First, chimeric cross-overs between the two variants appeared during PCR amplification, producing two false haplotypes. Second, because one of the false haplotypes matches the reference genome, it was prominently detected after aligning the DNA reads to the reference. The latter phenomenon is termed reference alignment bias.
We reasoned that, if our trained parameters produce more accurate alignments, they should reduce reference bias. Following Ammar et al. (2015), we used high-quality (2D) reads (SRA1748415): 7540 reads with average length 3486. First, we trained alignment parameters using these reads and human reference genome hg19, leading to the following parameters; the substitution score matrix is
and the gap costs are 12 + 3k for a length-k deletion and 15 + 3k for a length-k insertion.
Second, the haplotypes for two target polymorphism sites (Supplementary Materials S14) were predicted in the following (direct and simple) manner. (i) The best alignment was taken for each read after mapping the read to the reference genome (hg19). (Note that multiple maps are expected because there is a paralog of CYP2D6, whose similarity is about 94%.) (ii) Among the obtained alignments, alignments covering both of the target polymorphism sites in CYP2D6 were taken and then the count and frequency of each haplotype were computed from those alignments.
The results are shown in Table 1, indicating that reference bias is lessened by aligning with trained parameters, compared to GraphMap (v0.3.0) (Sovic et al., 2016), BLASR (Chaisson and Tesler, 2012) and LAST with manually-determined parameters. Specifically, the frequency of the chimeric reference haplotype is reduced, whereas the frequency of the chimeric non-reference haplotype is increased.
In addition, we performed probabilistic alignment (Hamada et al., 2011) with trained parameters, because probabilistic alignment tends to estimate columns in the alignment more accurately than conventional alignment. Specifically, we used LAMA alignment (lastal option ‘-j6’) with γ = 2 (Hamada et al., 2011). In this case, we choose the best alignment using ‘forward’ scores (from summing the probabilities of all alignments in the X-drop algorithm) instead of conventional scores (from the single best alignment). Table 1 suggests that trained parameters with LAMA alignment improve the results further (lower frequency of TT:CC).
We applied LAST-TRAIN to PacBio, IonTorrent, and Oxford Nanopore DNA reads using several available datasets (Supplementary Materials S2). It successfully recovers known features, such as PacBio having more insertions than deletions (Supplementary Materials S4), and we established that a query sample size of 1–10 million bases is sufficient (Supplementary Materials S3). Moreover, evaluation on simulated datasets indicates that trained parameters slightly improved alignment accuracy (Supplementary Materials S5). Notice that all the trained parameters and their statistics are shown in supplementary information (Supplementary Materials S9). Finally, the Supplement has discussion of: last-train versus MarginAlign (Jain et al., 2015), resisting the temptation of over-alignment, and use of sequence quality data (Supplementary Materials S8).
We thank Toshiyuki Sato (Mizuho Information and Research Institute, Inc) for supporting the implementation of LAST-TRAIN. Computations were partially performed on the NIG supercomputer at ROIS National Institute of Genetics.
This work was supported by MEXT KAKENHI [grant numbers JP24680031, JP16H05879, JP16H01318 to M.H.; JP26700030 to M.C.F.; JP25240044 to M.H. and K.A.; JP221S0002 to K.A.].
Conflict of Interest: none declared.