The first step in next-generation sequencing (NGS) of genomic DNA is the massively parallel sequencing of millions to billions of short DNA fragments on a single platform, typically generating short sequences (termed ‘reads’) from each end of the DNA fragment. For quality control purposes, the NGS platforms also generate quality values for every sequenced base, in analogy to the capillary sequencing quality values that are named after the phred software (1
). The second step, which involves a high-performance workstation or a compute cluster, is to determine the most probable genomic origin of each fragment by aligning the reads to a reference, typically to the sequences of a whole genome. Automatic, fast and error-tolerant alignment methods such as the Burrows-Wheeler Aligner (BWA) exist (2
), enabling the huge numbers of reads to be aligned within a reasonable time span. The third step, also carried out on a workstation or a compute cluster, is the identification (‘calling’) of variants from the resulting alignments. This variant-calling is not straightforward, because of existing experimental and platform differences, alignment ambiguities and biological particulars such as ploidy changes in tumors and in double minutes [tiny ‘extra’ chromosomes that may contain segmental copies of chromosomes and are replicated during cell division, see (3–5
)]. Typically, single-nucleotide variant (SNV)-calling algorithms, such as in SOLiD Bioscope, the SAMtools software (6
), the Genome Analysis Toolkit (GATK) (7
) and VarScan (8
), generate SNV-lists using filtering or probabilistic methods to exclude artifacts. These software tools generally contain pre-set filters to detect variations.
Quality control (QC) of NGS SNV data is vital and by definition, needs to be performed independently of the data production. For example, in clinical diagnostics, SNVs must usually be validated by visual inspection or several independent SNV-callers. Human geneticists are normally forced to store and present the raw sequence data for the mutation of interest. To this end, chromatograms are attached to clinical reports for Sanger-based tests. For NGS, pibase yields accurate read statistics for a genomic SNV of interest. As a matter of note, the SNVs released by the 1000 Genomes Project (9
) were a consensus from at least two different groups, two different NGS platforms and two different bioinformatic pipelines, significantly reducing the risk of human errors, platform errors and software errors, respectively. Data exchange errors within the 1000 Genomes Project were mitigated by developing shared conventions, including the current standard alignment file format, Binary Sequence Alignment/Map (BAM) (6
) and the Variant Call Format (VCF) (10
). Further strategies and tools for QC, including contamination detection using the pibase tools, are discussed in the Supplementary Methods
. Currently, ‘one of the main uses of next-generation sequencing is to discover variation among large populations of related samples’ (10
) and for this purpose, probabilistic frameworks exist (7
) that help to separate good novel SNV candidates from likely false positives (artifacts) and to determine allele frequencies in populations.
Unfortunately, there are several challenges when faithfully applying the variation–discovery approaches to other uses, such as clinical diagnostics, forensics and targeted-sequencing-based phylogenetic analyses. To begin with, the filtered SNV-lists generated by these approaches do not include low-confidence genotypes, e.g. where both-stranded validation is missing, and the unwary data recipient may interpret missing information as a reference sequence genotype. Also, the default filters sometimes eliminate obvious genotypes (Supplementary Tables S1
; Supplementary Figures S1
). The second problem is that available variant-calling tools usually do not list sequencing failures, where there is low coverage or no coverage at all, and the unwary data recipient may again interpret this omission as a reference sequence genotype. These two errors alone can amount to high error rates, e.g. 59.3% (Supplementary Table S3d
) in an older whole genome sequencing run, or 9.5% (Supplementary Table S4
) in a recent Illumina HiSeq 2000 exome sequencing run. We have noticed that targeted sequencing data are much more prone to these errors than recent whole genome sequencing data at only 0.5% error (Supplementary Table S3a–c
). A third problem is that SNV-lists usually include incorrectly identified heterozygotes (prompted by an occasional sequencing error, misalignment or contaminant sequence) where the pre-set quality filter for machine output or read-alignment is inappropriate. The fourth problem occurs when the user employs several different SNV-callers to perform a basic validation of the SNV-lists by intersecting the individual SNV-lists to separate cross-validated SNVs from less validated ones. Because each of these individual tools is, as explained above, prone to filtering away valid SNVs, the intersected consensus genotypes will exclude even more valid SNVs.
When performing comparisons between healthy and affected cells/individuals, a fifth problem surfaces, as each of the first four problems will lead to false differences in the comparative analyses. In other words, for such comparisons, it may not be advisable to rely on derived SNV-lists. Instead, the underlying BAM files are needed. Going back to the BAM files also resolves the sixth and most important problem: a specific challenge in cell or proband comparisons is to detect significant changes of allelic balance in heterozygous SNVs, e.g. in heterogeneous tumor samples or in the case of copy number variation loci. Only the primary BAM file but not the derived SNV-lists can re-create this proportion of alleles.
Finally, if there is a communication bottleneck between NGS bioinformaticians (data producers) and other scientists/clinicians (data users), this may result in unnecessary analysis reruns with new work flows or filtering parameters, specifically when new people or new NGS experiments are involved.
We have therefore, developed the pibase package (http://www.ikmb.uni-kiel.de/pibase
, 23 August 2012, date last accessed), which, instead of relying on a single set of filtering parameters, applies 10 sets of filtering parameters and then infers the best genotype or the best comparison; complements the available general data analysis tools; saves considerable manual validation work and, unlike the manual approach, can be integrated into a bioinformatic pipeline. pibase was developed as a consequence of our previous study (13
): there, we systematically evaluated the stability of SNV-calls by observing all reads and all unique start points, as well as seven randomly sampled subsets of reads and unique start points yielding average coverages of 100×, 80×, 60×, 40×, 20×, 10× and 5×, respectively. Independently of the sequencing platform, genotype changes suddenly began to occur when coverages were reduced to 20× or lower (13
). This coverage-related genotype instability ultimately affected ~10% of the SNVs when the coverage was lowered to 5× (13
In , we present pibase’s ‘essential’ workflow (the prerequisite for all other workflows), in which pibase accepts BAM files and then extracts and tabulates nucleotide signals at genomic coordinates of interest using 10 different observation methods or ‘filters’ (from which pibase infers its ‘best genotype’): pibase observes reads and unique start points using five distinct and increasingly stringent quality filters (). The resulting information is not restricted to a single stringency or a single filter setting, and is therefore, more complete and less biased than information from single-source SNV-lists or manual inspections. Additionally, reference sequence information is included in this table, which pibase requires and which, as a bonus, also reduces the need for manual inspections in a viewer. demonstrates pibase filtering and the resulting coverage at four positions, representing the bandwidth from acceptable coverage to unacceptably low coverage. Filtering improves genotyping accuracy by eliminating potential errors in the raw reads and the alignments, but at the cost of reducing the number of remaining reads. In , filtering stringency increases from left to right, as explained in the ‘Materials and Methods’ section (pibase_bamref). Coverage is not uniform over the genome, making the identification of some genotypes less confident than others (). A genotype is inferred for each of the 10 filter methods, which ideally all should result in identical genotypes. A summary ‘best genotype’ and its quality are computed from these 10 observations. shows the per-filter genotype evaluated from the read counts in , and the resulting pibase consensus genotype and quality grade. The columns with Filters 1 and 3 are not shown in this example. For directly comparing two data sets, e.g. patients versus healthy controls in disease association studies, pibase uses a statistical approach on filtered read counts (original data) with associated quality control criteria, rather than a simple comparison of SNV-lists (processed data). We implemented this comparison method because, we observed that SNV-calls or allele-calls may be suppressed in one of the samples being compared, merely because of stringency filters and coverage differences. Our approach should not be confused with the one implemented in CRISP (14
), which dramatically improves the accuracy of rare variant-calling in pools using Fisher’s exact test on A- and B-allele counts and multiple pools of samples. It is also not the same as the genotype-free likelihood-based approaches for pairwise comparisons and family trio comparisons that have been recently implemented by Li (11
). In summary, pibase addresses major problems pertaining to the quality control, validation and accurate comparison of NGS variant data, which are a bottleneck in currently emerging translational uses of NGS. Furthermore, the pibase data tables facilitate the practical use of NGS data by non-bioinformaticians such as archaeogeneticists, biologists, clinicians and forensic scientists.
Figure 1. Flow chart showing the standard NGS sequencing and bioinformatic analysis (gray). The ‘essential workflow’ of pibase (yellow): pibase_bamref reads a list of genomic coordinates from a tab-separated text file, a VCF file, a SAMtools pileup (more ...)
Remaining reads after successive filtering at four positions in a public BAM file
Stable and instable genotypes resulting from the filtering in