Sequencing error along the genome
Under the quality filter we used (details in Materials and methods), the average genome-wide error rate (minor allele frequency) is 0.0009 for the PhiX174 genome, and 0.00167 for the mtDNA genome. This difference could be caused by heteroplasmy and/or alignment problems with mtDNA. Generally, the sequencing error rate fluctuated along the genome with some striking peaks (drops when converted to Phred quality score; Figure S1 in Additional file 1
), with two peaks in the PhiX174 genome corresponding to true 'polymorphic' positions (mixture of two different alleles) in this PhiX174 strain (positions 1401,1644). Outliers in the mtDNA genome are positions 309 to 311, 514, and 3,106 to 3,107, which are either caused by alignment problems or true length heteroplasmies.
Normally, positions with the highest sequencing error rate cause the most problems in distinguishing LLMs; hence, we retrieved the 30 positions with the highest error rate on the PhiX174 genome to visualize the distribution of error rates along reads, as well as 30 positions with the lowest error rate for comparison (Figure ). First, an obvious error rate difference was observed between the two strands: positions with high error rates were mostly dominated by reads mapped to one specific strand whereas reads mapped to the other strand showed a normal error rate. Additionally, the error rate also varied among different parts of the reads; although error rate tended to increase when closer to the end, the trend is much weaker on the reads from the low-error strand (Figure ).
Figure 2 Sequencing error rate for different parts of the read. Along the y-axis, the first 30 positions have the highest error rate on the PhiX174 genome, and the last 30 positions have the lowest error rate. The x-axis indicates the strand (F, forward; R, reverse) (more ...)
We used WebLogo [21
] to identify possible conserved motifs preceding the sequencing error hot spots (Figure S2A in Additional file 1
). Although 'GGT' was found to be the most abundant motif preceding sequencing error hot spots on both strands, there is substantial variation in sequencing error rates at positions following this motif in the PhiX174 genome (Figure S2B in Additional file 1
). Therefore, this motif alone is insufficient to trigger a higher error rate. This is also true for the 'AAA' motif associated with sequencing error cold spots (Figure S2C, D in Additional file 1
) and all the other 3-bp motifs (Figure S3 in Additional file 1
), as the error rate following the same motif shows ten-fold variation. Thus, only a small fraction of the variation in sequencing error variation can be explained by 3-bp motifs consisting of the 2-bp context and the nucleotide itself.
In our previous study, the same error hot spots were repeatedly observed in different individuals [3
], suggesting that the sequencing error rate could be predicted using population data. To test this hypotheses, PhiX174 data were divided into two subsets according to different sequencing runs, while the mtDNA data were also divided into two subsets according to different read lengths (36 bp versus 76 bp). Since the estimated sequencing error rate for a position varied among different sequencing runs and lanes (Figure S4 in Additional file 1
), the ranked error rate in each subset was used rather than the absolute rate. For each position we compared the error rate of the read that mapped to the same strand in different subsets (Figure ); for both PhiX174 and mtDNA, significant positive correlations between the two subsets were observed (P
< 0.0001; Figure ). This is particularly true for the positions having the highest error rate, as half of the positions in the rightmost column of Figure (which includes positions having the highest error rate in the first subset) were also included in the topmost row (which includes positions having the highest error rate in the second subset). In contrast, no specific correlations were observed between the error rate of reads mapped to different strands in the same subsets of the PhiX174 data (Figure ). For mtDNA data, a weak correlation was observed, likely to be caused by true heteroplasmic positions. When comparing our PhiX174 data with the PhiX174 data provided by Dr Ole Skovgaard, which were generated by different machines and base-calling programs, a positive correlation was observed (ρ = 0.40, P
< 0.0001), which suggests that the sequencing error profile observed here is platform-specific (that is, Illumina-specific) or genome-specific rather than machine/chemical/base-caller-specific. Overall, these results support the interpretation that sequencing error hotspots are true features of the data, and not artifacts of particular lanes, runs, or machines.
Figure 3 Sequencing error rate correlation between different subsets of the data. Sequencing errors are ranked from lowest to highest along each axis. The color in each bin represents the number of positions having the corresponding error rates in the two subsets. (more ...)
Using the features of sequencing error described above, we developed a new method that makes use of population re-sequencing data to distinguish real mutations from sequencing errors. Since there is uncertainty regarding the error distribution among different individuals, we used three different distributions (Poisson distribution, Fisher exact test, and the empirical distribution) to calculate the bias statistic and evaluate the performance of these three methods (referred to as 'Poisson method', 'Fisher exact method', 'Empirical method'; see details in Additional file 2
First, we used simulations to explore the distribution of the bias statistic under different sequencing depths and mutation levels. Figure shows the quality score distribution of the minor allele (both real mutations and sequencing errors) from the Poisson method. Here, quality score was positively correlated with sequencing depth and mutation level for the real mutation, but not for the sequencing error, thereby indicating that more reads and/or higher frequency of an LLM allow more accurate distinction of an LLM from sequencing error. Meanwhile, the quality score became lower for both real LLMs and sequencing errors by reducing the bin size from 10 bp to 5 bp (Figure S5 in Additional file 1
), especially when the sequencing depth is low. Smaller bins exhibit larger variation (because of limited reads in each bin) and hence weaken the sensitivity of the method, and therefore are not recommended.
Figure 4 Quality score distribution in simulations. Quality scores are based on the Poisson method (Figures S6 and S7 in Additional file 1 for the Fisher exact and Empirical methods). Red dots represent the real LLMs, black dots represent the sequencing errors, (more ...)
The Fisher exact method gave a similar quality score distribution (Figure S6 in Additional file 1
) to that of the Poisson method, whereas the Empirical method showed a different distribution (Figure S7 in Additional file 1
). This is because the Empirical method measures the rank of the minor allele frequency among all reference samples rather than the absolute difference between the observed minor allele frequency and that expected for an error allele. The expected quality score for an error allele was 0 for the Poisson and Fisher exact methods, but was higher and with a larger range (1 to 7) for the Empirical method. The quality score upper bound is 60 for the Poisson and Fisher exact methods, but depends on the reference sample size for the Empirical method, because the P
-value is estimated by the rank of the observed value among all individuals (see details in Additional file 2
Based on these results, a minimum quality score of 10 for all reads was used to distinguish LLMs from sequencing errors in our study. Applying these criteria to the simulation data results in an extremely low false discovery rate (< 1%), and an acceptable false negative rate (when sequencing depth is 500×, 50% of the rare mutations with minor allele frequency of 5% could be identified). Further details are shown in Tables S1, S2, and S3 in Additional file 3
Artificially mixed samples
To evaluate the performance of the new method, we applied it to three artificially mixed samples, comprising a total of 78 LLMs distributed along the mtDNA genome with minor allele frequencies ranging between 3.7% to 50%. The overall coverage ranged from 138× to 4,840× (median = 1,887×), and all LLMs were successfully identified by all three methods with no false positives (Figure ). Moreover, the minor allele quality score distributions of the three methods were similar, with the same position (position 13,708 in the 1:1 mixture, minor allele frequency of 0.42, coverage of 757×) having the lowest single strand quality score: (59,10) - with the first number the quality score for the forward strand, and the second the quality score for the reverse strand - for the Poisson method; (60,9.3) for the Fisher exact method; and (18,4) for the Empirical method. This position consistently has the lowest quality score because the minor allele is under-represented on the reverse strand: the minor allele was observed in only one bin (out of the six bins having reads) on the reverse strand, whereas it was observed in all eight bins on the forward strand.
Quality scores of the minor allele in the artificially mixed dataset. (a) Quality score distribution with the Poisson method. (b) Quality score distribution with the Fisher exact method. (c) Quality score distribution with the Empirical method.
Comparison with other methods
We compared our method with other available methods for detecting LLMs, using the artificially mixed libraries to evaluate the performance of the other methods; the results are described below and summarized in Table .
Comparison of the present method with other methods
CRISP, which also makes use of the population re-sequencing data, successfully identified all of the mixed positions, but also reported five false positives (when N = 20; four false positives when N = 50), which were caused by an elevated error rate in one sample (significantly different from that in other samples). Moreover, examination of the pileup file indicates that most of the error reads were located in one read bin and have the same starting coordinate on the reference genome, indicating that they probably are caused by duplicate reads.
Following the instructions for SPLINTER, we created a 2-bp context-dependent error matrix from the PhiX174 control data in the same run. When N was set to 20 and only the first 12 bp of the reads were used, 12 (18% of all the true positives) LLMs were missed, and 2 (3% of all the detected LLMs) false positives observed. A cutoff value of -2.64 removed all of the false positives but also added three extra false negatives. Increasing N to 50 did not recover the missed LLMs; examination of the pileup file revealed that these false negatives were caused by underrepresented minor alleles at the beginning of reads. We then extended the length to the first 30 bp, but 6 mutations were still missed (even with a cutoff of 0). Extending the length to 70 bp recovered all of the expected LLMs (cutoff of -6.6), but at the expense of 123 false positives.
When using VarScan, an extremely low P-value of 1 × 10-10 identified all of the expected LLMs, but also identified 190 (71% of all the detected LLMs) false positives. When we manually refined the result by requiring double strand validation (at least three reads on each strand to call the LLM), the number of false positives was reduced to 31 (28% of all the detected LLMs). Increasing the number of reads on each strand to six left only ten (11% of all the detected LLMs) false positives.
For MAQ, a reasonable cutoff for the quality score of 60 resulted in three false negatives but thousands of false positives. Applying a stringent cutoff for the quality score of 200 removed all of the false positives but left six false negatives (7.7% of all the true positives).
In summary, none of the above methods could identify all of the LLMs in the artificial mixtures without giving false positives. To be sure, additional customized filtering or preprocessing may improve their performance. Furthermore, these methods were intended to detect non-reference alleles, rather than actually verifying the minor allele; when the reference allele is the minor allele, the P-value does not reflect the certainty of the minor allele, which could be problematic if the reference allele contributes to the trait investigated.
The use of double indexes (that is, indexes located at both ends of the adaptors [22
]) allows chimeric reads (with mismatched indexes) to be detected. Such chimeric reads can reflect jumping PCR, index contamination, or cluster misidentification. By investigating data from four double indexed libraries, we found chimeric reads occurred at a rate of 10 to 15% when 40 to 90 samples were multiplexed and enriched for mtDNA (Figure ). Chimeric reads could potentially result in a large number of false positive LLMs when samples with different sequences are multiplexed in one sequencing library. Here, by assuming that 15% of the reads were derived evenly from other samples in the same library, for the four libraries prepared with single indexes we found that 65 to 79% of the minor allele could be explained by chimeric reads (P
, Chi-square test), and most of them have a minor allele frequency lower than 5%. However, not all of these LLMs are necessarily caused by chimeric reads, as heteroplasmies are also prone to happen at polymorphic sites [3
]. Nonetheless, these mutations were excluded from further analysis.
Figure 6 Chimeric read fraction estimated from double indexed libraries. P5 is the index inserted at the 5' end and P7 is the index inserted at the 3' end of the template DNA. Diamonds represent the fraction of the unmatched P5-P7 reads among all reads; the vertical (more ...)
Cross-contamination between samples is a potential problem when many samples are multiplexed in one sequencing library, or when many sequencing libraries are prepared at the same time. Although such low-level mixtures may not influence calling the consensus sequence, they may generate artificial LLMs. Our method can help to detect such cross-contamination, since the minor alleles would be identical to the major alleles in another sample.
Therefore, for all samples having more than five verified minor alleles, we examined the entire mtDNA genome dataset to see if any single sample could explain a significant fraction (≥3) of the minor alleles. Strong mixture signals were observed in two samples (Az46, Ir28) by all three methods, where all verified minor alleles (> 17) were identical to the consensus nucleotides of another sample in the same sequencing library, and represent more than 60% of the differences between the two samples (Figure ). By averaging the minor allele frequencies at all expected variable positions, the proportions of the minor component were estimated to be 4.2 ± 0.8% for Az46 and 4.5 ± 2% for Ir28 (mean ± standard deviation).
Figure 7 Mixture signal in the mtDNA dataset. The title of each plot shows the major component, minor component, average coverage, and mixture proportion. Expected minor alleles and observed minor alleles are shown in different colors. A, observed minor alleles (more ...)
Lower levels of contamination were detected in another four samples (Ir37, G29, Ir36, Ir40), where the mixture proportions were between 2.3% and 2.9% (Figure ); this low mixture level makes it more difficult to recover all of the resulting LLMs. Only 37 to 53% of the expected minor alleles were identified by the Empirical method, while 23 to 43% of the expected minor alleles were identified by the Poisson and Fisher exact methods.
From the putative mixtures involving Az46 and Ir48, we could infer a false negative rate of 0.16 (11 out of 68) by the Empirical method when the minor allele frequency is around 4%. Meanwhile, from the mixtures involving Ir37, G29, Ir36, and Ir40, the false negative rate was 0.52 (79 out of 151) when the minor allele frequency is around 2.5%. The overall false discovery rate is 0.8% (1 out of 130) inferred from the 6 mixed samples, and the method could successfully detect heteroplasmy down to a level of 2.3% (Figure ). The Poisson and Fisher exact methods both showed less sensitivity compared with the Empirical method (Figure ). Considering the average sequencing depth of 495×, our methods showed a better performance on real data than in the simulation (Tables S1, S2, and S3 in Additional file 3
). This is most likely because the empirical error rate is lower than that used in the simulations (0.0017 versus 0.01), thereby making it easier to distinguish real LLMs from sequencing errors.
In addition to these six mixtures that could be perfectly explained by contamination from one specific sample, eight samples (Arm20, Az17, Ir26, Ir29, Ir33, Ir55, Az4, Ir10) had more than five verified mutations but could only be partially explained by contamination from another sample. Therefore, they could be caused by contamination from multiple samples or by contamination from one sample as well as having true LLMs. Because of this uncertainty, we removed these samples from further analysis.
Application to mtDNA LLM detection
By analyzing the mtDNA genome sequencing reads from 117 individuals with an average coverage of 638×, 99 LLMs were identified by the Empirical method, 63 by the Poisson method, and 60 by the Fisher exact method, with minor allele frequencies ranging from 0.5 to 49.5%. To avoid potential low-level cross-contamination from similar mtDNA genomes, we limited the analysis to mutations with a minor allele frequency of at least 5%, which then leads to the same 33 LLMs detected by all 3 methods among 30 samples (Table S4 in Additional file 4
Of the 30 LLMs detected in the same individuals in our previous study, 19 were also identified by our new method (Table S4 in Additional file 4
). Another seven positions had quality scores satisfying our requirement of calling LLMs but were suspected to be caused by chimeric reads (Figure ). One position was detected in both studies but excluded here because of its lower minor allele frequency (4.6%); the frequency difference could be due to the more stringent quality control applied here or different mapping strategies (Mia versus BWA (Burrow-Wheeler Aligner)). An additional three LLMs were excluded due to low quality scores; two of these were located at position 3,492, for which all but one of the minor allele reads were mapped to the reverse strand.
Figure 8 LLMs in mtDNA detected by the new method. Circles represent the heteroplasmies detected in our previous study ; squares represent the LLMs only found by our new method; circles in red color represent the LLMs that could also be explained by chimeric (more ...)
All of the LLMs validated by SnaPshot assays in our previous study were recovered by our new method (Figure ), including one position (16,223 in sample G73) that could reflect chimeric reads. If so, the mixture must have occurred during the DNA purification from the samples, as the original sample DNA was used for the SnaPshot assay. Alternatively, this may be a true LLM that happened to occur at a known polymorphic position, which in turn suggests that the LLM number reported here (33 LLMs in 117 samples) is a lower bound, as some of the LLMs excluded because they may have been caused by chimeric reads could be true LLMs.
All of the 33 LLMs involve single-base substitutions. The ratio of transitions to transversions is 10, which is not significantly different from the ratio of 8.09 for polymorphic positions in the same individuals (P = 0.370, Fisher exact test). The ratio of non-synonymous to synonymous LLMs in the coding region is also not significantly different from that reported in our previous study (0.5 versus 1, P = 0.492, Fisher exact test).