Search tips
Search criteria

Results 1-8 (8)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  ViVaMBC: estimating viral sequence variation in complex populations from illumina deep-sequencing data using model-based clustering 
BMC Bioinformatics  2015;16(1):59.
Deep-sequencing allows for an in-depth characterization of sequence variation in complex populations. However, technology associated errors may impede a powerful assessment of low-frequency mutations. Fortunately, base calls are complemented with quality scores which are derived from a quadruplet of intensities, one channel for each nucleotide type for Illumina sequencing. The highest intensity of the four channels determines the base that is called. Mismatch bases can often be corrected by the second best base, i.e. the base with the second highest intensity in the quadruplet. A virus variant model-based clustering method, ViVaMBC, is presented that explores quality scores and second best base calls for identifying and quantifying viral variants. ViVaMBC is optimized to call variants at the codon level (nucleotide triplets) which enables immediate biological interpretation of the variants with respect to their antiviral drug responses.
Using mixtures of HCV plasmids we show that our method accurately estimates frequencies down to 0.5%. The estimates are unbiased when average coverages of 25,000 are reached. A comparison with the SNP-callers V-Phaser2, ShoRAH, and LoFreq shows that ViVaMBC has a superb sensitivity and specificity for variants with frequencies above 0.4%. Unlike the competitors, ViVaMBC reports a higher number of false-positive findings with frequencies below 0.4% which might partially originate from picking up artificial variants introduced by errors in the sample and library preparation step.
ViVaMBC is the first method to call viral variants directly at the codon level. The strength of the approach lies in modeling the error probabilities based on the quality scores. Although the use of second best base calls appeared very promising in our data exploration phase, their utility was limited. They provided a slight increase in sensitivity, which however does not warrant the additional computational cost of running the offline base caller. Apparently a lot of information is already contained in the quality scores enabling the model based clustering procedure to adjust the majority of the sequencing errors. Overall the sensitivity of ViVaMBC is such that technical constraints like PCR errors start to form the bottleneck for low frequency variant detection.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-015-0458-7) contains supplementary material, which is available to authorized users.
PMCID: PMC4369097  PMID: 25887734
Illumina sequencing; Codon; Second best base call; Model-based clustering; Viral quasispecies
2.  Impact of variance components on reliability of absolute quantification using digital PCR 
BMC Bioinformatics  2014;15(1):283.
Digital polymerase chain reaction (dPCR) is an increasingly popular technology for detecting and quantifying target nucleic acids. Its advertised strength is high precision absolute quantification without needing reference curves. The standard data analytic approach follows a seemingly straightforward theoretical framework but ignores sources of variation in the data generating process. These stem from both technical and biological factors, where we distinguish features that are 1) hard-wired in the equipment, 2) user-dependent and 3) provided by manufacturers but may be adapted by the user. The impact of the corresponding variance components on the accuracy and precision of target concentration estimators presented in the literature is studied through simulation.
We reveal how system-specific technical factors influence accuracy as well as precision of concentration estimates. We find that a well-chosen sample dilution level and modifiable settings such as the fluorescence cut-off for target copy detection have a substantial impact on reliability and can be adapted to the sample analysed in ways that matter. User-dependent technical variation, including pipette inaccuracy and specific sources of sample heterogeneity, leads to a steep increase in uncertainty of estimated concentrations. Users can discover this through replicate experiments and derived variance estimation. Finally, the detection performance can be improved by optimizing the fluorescence intensity cut point as suboptimal thresholds reduce the accuracy of concentration estimates considerably.
Like any other technology, dPCR is subject to variation induced by natural perturbations, systematic settings as well as user-dependent protocols. Corresponding uncertainty may be controlled with an adapted experimental design. Our findings point to modifiable key sources of uncertainty that form an important starting point for the development of guidelines on dPCR design and data analysis with correct precision bounds. Besides clever choices of sample dilution levels, experiment-specific tuning of machine settings can greatly improve results. Well-chosen data-driven fluorescence intensity thresholds in particular result in major improvements in target presence detection. We call on manufacturers to provide sufficiently detailed output data that allows users to maximize the potential of the method in their setting and obtain high precision and accuracy for their experiments.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2105-15-283) contains supplementary material, which is available to authorized users.
PMCID: PMC4261249  PMID: 25147026
Digital PCR; Absolute nucleic acid quantification; CNV; Variance component; Precision; Accuracy; Reliability; Experimental design; Polymerase chain reaction
3.  Fast Wavelet Based Functional Models for Transcriptome Analysis with Tiling Arrays 
For a better understanding of the biology of an organism, a complete description is needed of all regions of the genome that are actively transcribed. Tiling arrays are used for this purpose. They allow for the discovery of novel transcripts and the assessment of differential expression between two or more experimental conditions such as genotype, treatment, tissue, etc. In tiling array literature, many efforts are devoted to transcript discovery, whereas more recent developments also focus on differential expression. To our knowledge, however, no methods for tiling arrays have been described that can simultaneously assess transcript discovery and identify differentially expressed transcripts. In this paper, we adopt wavelet based functional models to the context of tiling arrays. The high dimensionality of the data triggered us to avoid inference based on Bayesian MCMC methods. Instead, we introduce a fast empirical Bayes method that provides adaptive regularization of the functional effects. A simulation study and a case study illustrate that our approach is well suited for the simultaneous assessment of transcript discovery and differential expression in tiling array studies, and that it outperforms methods that accomplish only one of these tasks.
PMCID: PMC3750750  PMID: 22499683
tiling microarray; wavelets; adaptive regularization; transcript discovery; differential expression; genomics; Arabidopsis thaliana
4.  QTL Analysis of High Thermotolerance with Superior and Downgraded Parental Yeast Strains Reveals New Minor QTLs and Converges on Novel Causative Alleles Involved in RNA Processing 
PLoS Genetics  2013;9(8):e1003693.
Revealing QTLs with a minor effect in complex traits remains difficult. Initial strategies had limited success because of interference by major QTLs and epistasis. New strategies focused on eliminating major QTLs in subsequent mapping experiments. Since genetic analysis of superior segregants from natural diploid strains usually also reveals QTLs linked to the inferior parent, we have extended this strategy for minor QTL identification by eliminating QTLs in both parent strains and repeating the QTL mapping with pooled-segregant whole-genome sequence analysis. We first mapped multiple QTLs responsible for high thermotolerance in a natural yeast strain, MUCL28177, compared to the laboratory strain, BY4742. Using single and bulk reciprocal hemizygosity analysis we identified MKT1 and PRP42 as causative genes in QTLs linked to the superior and inferior parent, respectively. We subsequently downgraded both parents by replacing their superior allele with the inferior allele of the other parent. QTL mapping using pooled-segregant whole-genome sequence analysis with the segregants from the cross of the downgraded parents, revealed several new QTLs. We validated the two most-strongly linked new QTLs by identifying NCS2 and SMD2 as causative genes linked to the superior downgraded parent and we found an allele-specific epistatic interaction between PRP42 and SMD2. Interestingly, the related function of PRP42 and SMD2 suggests an important role for RNA processing in high thermotolerance and underscores the relevance of analyzing minor QTLs. Our results show that identification of minor QTLs involved in complex traits can be successfully accomplished by crossing parent strains that have both been downgraded for a single QTL. This novel approach has the advantage of maintaining all relevant genetic diversity as well as enough phenotypic difference between the parent strains for the trait-of-interest and thus maximizes the chances of successfully identifying additional minor QTLs that are relevant for the phenotypic difference between the original parents.
Author Summary
Most traits of organisms are determined by an interplay of different genes interacting in a complex way. For instance, nearly all industrially-important traits of the yeast Saccharomyces cerevisiae are complex traits. We have analyzed high thermotolerance, which is important for industrial fermentations, reducing cooling costs and sustaining higher productivity. Whereas genetic analysis of complex traits has been cumbersome for many years, the development of pooled-segregant whole-genome sequence analysis now allows successful identification of underlying genetic loci with a major effect. On the other hand, identification of loci with a minor contribution remains a challenge. We now present a methodology for identifying minor loci, which is based on the finding that the inferior parent usually also harbours superior alleles. This allowed construction for the trait of high thermotolerance of two ‘downgraded parent strains’ by replacing in each parent a superior allele by the inferior allele from the other parent. Subsequent mapping with the downgraded parents revealed new minor loci, which we validated by identifying the causative genes. Hence, our results illustrate the power of this methodology for successfully identifying minor loci determining complex traits and with a high chance of being co-responsible for the phenotypic difference between the original parents.
PMCID: PMC3744412  PMID: 23966873
5.  Simultaneous Mapping of Multiple Gene Loci with Pooled Segregants 
PLoS ONE  2013;8(2):e55133.
The analysis of polygenic, phenotypic characteristics such as quantitative traits or inheritable diseases remains an important challenge. It requires reliable scoring of many genetic markers covering the entire genome. The advent of high-throughput sequencing technologies provides a new way to evaluate large numbers of single nucleotide polymorphisms (SNPs) as genetic markers. Combining the technologies with pooling of segregants, as performed in bulked segregant analysis (BSA), should, in principle, allow the simultaneous mapping of multiple genetic loci present throughout the genome. The gene mapping process, applied here, consists of three steps: First, a controlled crossing of parents with and without a trait. Second, selection based on phenotypic screening of the offspring, followed by the mapping of short offspring sequences against the parental reference. The final step aims at detecting genetic markers such as SNPs, insertions and deletions with next generation sequencing (NGS). Markers in close proximity of genomic loci that are associated to the trait have a higher probability to be inherited together. Hence, these markers are very useful for discovering the loci and the genetic mechanism underlying the characteristic of interest. Within this context, NGS produces binomial counts along the genome, i.e., the number of sequenced reads that matches with the SNP of the parental reference strain, which is a proxy for the number of individuals in the offspring that share the SNP with the parent. Genomic loci associated with the trait can thus be discovered by analyzing trends in the counts along the genome. We exploit the link between smoothing splines and generalized mixed models for estimating the underlying structure present in the SNP scatterplots.
PMCID: PMC3575411  PMID: 23441149
6.  Improved base-calling and quality scores for 454 sequencing based on a Hurdle Poisson model 
BMC Bioinformatics  2012;13:303.
454 pyrosequencing is a commonly used massively parallel DNA sequencing technology with a wide variety of application fields such as epigenetics, metagenomics and transcriptomics. A well-known problem of this platform is its sensitivity to base-calling insertion and deletion errors, particularly in the presence of long homopolymers. In addition, the base-call quality scores are not informative with respect to whether an insertion or a deletion error is more likely. Surprisingly, not much effort has been devoted to the development of improved base-calling methods and more intuitive quality scores for this platform.
We present HPCall, a 454 base-calling method based on a weighted Hurdle Poisson model. HPCall uses a probabilistic framework to call the homopolymer lengths in the sequence by modeling well-known 454 noise predictors. Base-calling quality is assessed based on estimated probabilities for each homopolymer length, which are easily transformed to useful quality scores.
Using a reference data set of the Escherichia coli K-12 strain, we show that HPCall produces superior quality scores that are very informative towards possible insertion and deletion errors, while maintaining a base-calling accuracy that is better than the current one. Given the generality of the framework, HPCall has the potential to also adapt to other homopolymer-sensitive sequencing technologies.
PMCID: PMC3534400  PMID: 23151247
7.  Analysis of tiling array expression studies with flexible designs in Bioconductor (waveTiling) 
BMC Bioinformatics  2012;13:234.
Existing statistical methods for tiling array transcriptome data either focus on transcript discovery in one biological or experimental condition or on the detection of differential expression between two conditions. Increasingly often, however, biologists are interested in time-course studies, studies with more than two conditions or even multiple-factor studies. As these studies are currently analyzed with the traditional microarray analysis techniques, they do not exploit the genome-wide nature of tiling array data to its full potential.
We present an R Bioconductor package, waveTiling, which implements a wavelet-based model for analyzing transcriptome data and extends it towards more complex experimental designs. With waveTiling the user is able to discover (1) group-wise expressed regions, (2) differentially expressed regions between any two groups in single-factor studies and in (3) multifactorial designs. Moreover, for time-course experiments it is also possible to detect (4) linear time effects and (5) a circadian rhythm of transcripts. By considering the expression values of the individual tiling probes as a function of genomic position, effect regions can be detected regardless of existing annotation. Three case studies with different experimental set-ups illustrate the use and the flexibility of the model-based transcriptome analysis.
The waveTiling package provides the user with a convenient tool for the analysis of tiling array trancriptome data for a multitude of experimental set-ups. Regardless of the study design, the probe-wise analysis allows for the detection of transcriptional effects in both exonic, intronic and intergenic regions, without prior consultation of existing annotation.
PMCID: PMC3558343  PMID: 22974078
8.  Practical Tools to Implement Massive Parallel Pyrosequencing of PCR Products in Next Generation Molecular Diagnostics 
PLoS ONE  2011;6(9):e25531.
Despite improvements in terms of sequence quality and price per basepair, Sanger sequencing remains restricted to screening of individual disease genes. The development of massively parallel sequencing (MPS) technologies heralded an era in which molecular diagnostics for multigenic disorders becomes reality. Here, we outline different PCR amplification based strategies for the screening of a multitude of genes in a patient cohort. We performed a thorough evaluation in terms of set-up, coverage and sequencing variants on the data of 10 GS-FLX experiments (over 200 patients). Crucially, we determined the actual coverage that is required for reliable diagnostic results using MPS, and provide a tool to calculate the number of patients that can be screened in a single run. Finally, we provide an overview of factors contributing to false negative or false positive mutation calls and suggest ways to maximize sensitivity and specificity, both important in a routine setting. By describing practical strategies for screening of multigenic disorders in a multitude of samples and providing answers to questions about minimum required coverage, the number of patients that can be screened in a single run and the factors that may affect sensitivity and specificity we hope to facilitate the implementation of MPS technology in molecular diagnostics.
PMCID: PMC3184136  PMID: 21980484

Results 1-8 (8)