The pipeline is designed to process (multiplexed) amplicon resequencing experiments, a setup which is becoming frequently present in diagnostic labs (to replace classic Sanger sequencing) where several genes from several patients are tested together in one sequencing run.
The modular approach makes efficient planning possible. Time-consuming reporting steps can for example be postponed to a time when there is sufficient server power available. This is especially interesting when multiple runs need to be analyzed in a short period of time and efficient server usage is required.
The database approach makes it possible to easily generate reports and draw conclusions from a subset of sequences of a sequencing run or multiple runs together (Additional file 4
). It also prevents the user from losing data as all the data is centralized in a single location.
The performance of the pipeline was initially assessed by analyzing two BRCA1/2 resequencing experiments (De Leeneer et al., in preparation). These two runs contained samples that had been analyzed before using classic HRM (high resolution melting) and consequently the variants in the samples were known. Both runs had a similar experimental design.
A total of 111 amplicons (44 BRCA1, 67 BRCA2) were equimolarly pooled together per sample (patient). PCR products of 11 patients (tagged with MIDs) were then equimolarly pooled together. Amplicon sizes ranged from 136 bp to 435 bp (mean 244 bp). The first run generated 542,532 reads (mean length 244 bp); the second run generated 261,646 reads (mean length 247 bp).
Processing the raw data
During the trimming step (processing the raw data), 97.37% of the raw sequences were split into MID sequence, linker sequence, amplicon sequence and trailing sequence. A small fraction of sequences had a start with too many sequencing errors to determine the correct MID, and these sequences were consequently not used in further analyses (but are nevertheless stored in the raw sequences database). At the moment there is no straightforward method to split sequences into MID, linker and amplicon sequence using the AVA software (AVA only allows MIDs to be split off), which makes it difficult to compare the splitting algorithms, but with a 97.37% yield one can assume that few improvements can be made.
Mapping the reads
BLAT mapped 85.24% of the sequenced reads generated in the first run and 93.57% of the reads generated in the second run. Some of the mapped sequences were filtered out because they map to a reference amplicon in two pieces with big gap in between and appear to be clear primer-dimers (the so-called 'short' sequences). 76% of all the sequences reads mapped correctly and passed filters in the first run compared to 92% in the second run. The residual portion of the sequenced reads that did not map were further investigated and appeared to be PCR artefacts such as complex primer dimers. Only 4,653 reads (completely or partially) mapped outside the target regions when mapped genome-wide and only 45 of these mapped completely (i.e. from start to stop) somewhere in the genome, from which we concluded that the PCR reactions were specific. Coverage per amplicon and per MID was heterogeneous and between 0 and 5,134 (mean 310) for run 1, and between 0 and 3,019 (mean 175) for run 2. This heterogeneity is mainly caused by differences in PCR efficiency and suboptimal pooling of samples. This heterogeneity (or 'spread factor') can be lowered by carrying out improved pooling strategies and incorporating normalization steps (Hellemans et al., in preparation). Errors in the labwork and/or PCR reactions not working as expected caused some amplicons to be not covered (2.98% in the first run, 1.51% in the second run).
Mapping on reference amplicons rather than on a reference genome certainly improves speed, but is somewhat discussable because one might miss paralogous amplification products. However, in a diagnostic resequencing setup one is interested in variants in a certain set of genes or even a certain exon using thoroughly validated PCR reactions. Paralogous amplification products might become apparent when mapping genome-wide (and be absent when mapping only to reference amplicons) but they would lack diagnostic significance as they give no or incorrect information on the region that was intended to be screened. Users should strive to the use of validated PCR reactions and omitting genome-wide mapping, rather than mapping genome-wide. Nevertheless, the genome-wide approach is included in the package for users desiring to screen for aspecific PCR products.
Calling true variants
The two BRCA1/2 runs contained samples with known variants (132 distinct sequence variants of which 90 were deletion/insertion mutations).
By setting the filters (using the 'generate reports' script) at the default values one can discriminate between true variants and sequencing errors. The recommended values are: variation frequency (>33% and <67%; >0.95%) and coverage (>20). However, the coverage filter and frequency filter is dynamic and can be changed and exact filter values can be calculated in function of the desired detection power (Hellemans et al., in preparation).
It is difficult to reliably design a filter to only filter out faulty homopolymer variants. Discarding every variation preceding or following a homopolymeric region is not an option as real variants can also flank such regions. We have seen that for homopolymers < 6 bp there is no problem discriminating between correctly sequenced homopolymeric stretches and homopolymer related insertions or deletions (sequencing errors) by using the quality score (Q). When using the recommended filter value Q > 30 only a minor fraction of the homopolymer related sequencing errors pass the filter (homopolymers < 6 bp). When the stretches are 6 bp long or longer the distributions of the normal and the mismatch homopolymers start to overlap and even a correctly sequenced base has a low Q score (figure ). We recommend to set the homopolymer filter at 6 bp and keep in mind that no real variants preceding or following a 6 bp homopolymeric stretch can reliably be detected using the default filter values.
Figure 2 Influence of quality score (Q) on homopolymer accuracy. Distribution of homopolymer related quality scores (Q score). The normal homopolymer Q score distribution is determined by making a distribution of the Q score of the homopolymer base; the mismatch (more ...)
In total 97% of all known variants (homozygous and heterozygous) could be detected (sensitivity). Specificity (the portion of called variants that actually are real variants) was 98.5% which means that the false positive rate is only 1.5%. All non-detected variants were insertions/deletions in or near homopolymeric regions. We are aware of the fact that the given numbers may be overfitted to a BRCA1/2 screening, but very similar results were obtained in other experiments (20 genes, 4 runs) as well (data not shown).
A comparison of the data pre- and post-filtering is given in figure . The figure clearly shows that the pre-filter data contains a lot of 'noise'. Post-filter data is concentrated around the 50% and 100% level as expected and allows easy discrimination between heterozygous and homozygous variants. At higher coverage, the data is concentrated in a band which is narrower than the specified filter settings. This indicates that at higher coverage, mainly low Q-score variants and homopolymer-related variants are filtered out.
Figure 3 Pre and post filtering variants data. Plot of the coverage (times a single sequence is read by the sequencing equipment) and the frequency of an observed variation. In reality, genomic variation occurs at a frequency of either 50% or 100% of total reads. (more ...)
VIP compared with AVA
The performance of the VIP pipeline was compared with the performance of the AVA software (184.108.40.206). Reads from 1 patient (MID1) containing 67 different amplicons were analyzed in detail. The sample was known to contain 12 variants.
The AVA software does not handle the linker sequence between the MID adaptor and the PCR primer very well. The sequence of the MID/Linker combinations (MID1' = MID1/linker, MID2' = MID2/linker etc.) were used as new MIDs in AVA to circumvent the inability of AVA to split off the linker sequences. The AVA analysis was carried out as described in the manufacturer's manual.
The VIP pipeline returned 50 variants compared to the 235 returned by AVA. This is explained by the fact that the VIP pipeline has an internal filter which filters out any variant with a frequency lower than 10% and with a coverage of only 1. These variants are considered random sequence errors.
Setting a minimum frequency filter in AVA to 20%, 33 variants were identified, whereas 14 variants passed the different filters in the VIP pipeline. The VIP pipeline picked up all 12 known variants (including 6 difficult homopolymer related variants) and called only two false positives, both homopolymer related. The AVA software on the other hand missed all 6 homopolymer related variants and called 27 false positives (Table ). A detailed overview of all the variants detected by both AVA and VIP is given in additional file 5
Comparison of AVA and VIP performance (1 sample, 67 amplicons, 12 known variants)
Besides generating an overview of the variants, the pipeline can also generate additional reports which make the pipeline very useful in a diagnostic setting. All the data is intelligently stored in a relational database and therefore custom analyses can be carried out using SQL-statements in either an SQL browser such as HeidiSQL [17
] or by writing custom scripts in any scripting or programming language that has the ability to communicate with a MySQL database.
This database approach allowed optimization of the laboratory work of the BRCA1/2 sequencing experiments, especially the pre-sequencing PCR reactions. Using the data from the database, it was possible to detect which amplicons were underrepresented. It was also clear that there were a lot of short sequences in the first run. Run 1 had 8.73% of total reads flagged as 'short sequence', run 2 had 1.87% of total reads flagged as such. Using this information, the multiplex reactions were optimized and an additional length separation was carried out. These actions improved the efficiency by reducing undesired by-products and/or primer dimers from 24% to 8% of the total sequences (Figure ). It is vital to reduce by-products such as dimers in the early PCR steps as these shorter sequences get preferentially amplified in the emulsion PCR and reduce efficiency even more.
Figure 4 Improving future sequencing efficiency using priors sequencing data. Example of the reporting possibilities. Run 1 had many unmappable and short, mapped sequences. Length distribution showed these were mainly 60-120 bp sequences. In Run 2 optimized PCR (more ...)
There are 8 standard reports integrated in the reporting module. Each of the reports is described in detail in the manual and an overview is present in the additional files.
Validating variation using the VIP Validator
Random variation in the VIP Validator is defined as a random error that is introduced with a certain frequency at a certain position in a set of sequences that map on a random reference amplicon. The Validator introduces the variant into a number of sequences and then verifies in how many of these sequences this specific variant was detected at the exact location where it was introduced.
Random variation validation was initially used to optimize and validate the VIP analysis pipeline but has proven to be an effective instrument to detect problematic amplicons or problematic regions in certain amplicons. There are amplicons for which few variants have been reported yet, meaning that amplicon resequencing experiments can unravel new variants. A simulation with random variations can give a clue about regions where accurate variation detection is difficult; for example multiple variations close to each other or variations in repetitive regions can be problematic for the mapping software. The Validator does not explain why a certain region becomes a difficult region to detect variants but informs about variants that cannot be detected using the pipeline.
Random variation was introduced by choosing a random amplicon, a random position, a random variant, a random frequency, and a random coverage and then introducing the variant accordingly. This process was repeated 1000 times.
The ratio between the number of sequences in which a variation was introduced (the coverage so to speak) and the number of sequences wherein the variation was detected is the detection ratio. This ratio is independent from the frequency by which a variant was introduced.
It is clear that SNV detection with the VIP is no problem as 100% of the introduced variants can be detected by the VIP Validator with a sufficiently high detection ratio. The 67% threshold is considered as sufficiently high because the detection ratio for a heterozygous variant would be at least 33.5% and still pass the 33% filter.
Deletions (and insertions) are often not detected using the newest NGS mapping and variation detection packages (Additional file 5
). The VIP pipeline can detect >99% of random 3 bp deletions with a detection ratio ≥ 67%. Determining the location of a gap appears to be relatively easy. The difficult part of gap detection is determining the exact length. Results are similar for longer (10 bp) deletions and insertions (both 3 bp and 10 bp). An overview of the detection ratios is given in figure .
Figure 5 The VIP Validator. Detection ratios of 1000 known BRCA1/2 variants, random SNVs, random 3 bp and 10 bp deletions, and random 3 bp and 10 bp insertions. The grey horizontal lines indicate 67% detection ratio. The grey vertical lines indicate a 99% cumulative (more ...)
These observations make it clear that deletions and insertions can be detected but one should keep in mind that the exact length of the insertion/deletion that passes through the filters is not necessarily the real length. This is one of the drawbacks of needing multiple reads to have a reliable call for a single nucleotide position. Nevertheless, it is detected that something is wrong, which is essential in a diagnostic setting. The alignment visualizer can be a useful resource to manually assess the exact deletion/insertion size.
Rather than using random variation, the VIP Validator can also use a list of known variations as input. Validating known variation is the most useful application of the VIP Validator in a diagnostic setting. It gives an answer to the question "If variation × is present in an amplicon, would it then be possible to detect it using the VIP pipeline?". The answer to this question is objectively addressed using the detection ratio parameter.
Around 95% of the known BRCA1/2 variants were detected by the Validator with a detection ratio of 100% meaning that in every single read (wherein the variant was introduced) the variant was detected at the correct position. The other 5% of the variants were detected with a detection frequency that is lower than 100% but still > 67% (Figure ).
The VIP Validator can also be used to determine minimal needed coverage in silico. The number of sequences wherein variation is introduced can be altered and one can find a minimal coverage that is needed to have a sufficiently high and reproducible detection frequency.
This Validator is useful for pipeline optimization and determination of the ideal cut-off values. It allows the end-user to fine-tune the pipeline to its own needs and to objectively validate the results of a given pipeline modification. Moreover, it allows validation of the analysis software with respect to the detection of certain variation screening, which is very important in diagnostics, and it determines the detection limits of the pipeline, prior to starting a diagnostic screening.