Search tips
Search criteria

Results 1-6 (6)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  Effects of error-correction of heterozygous next-generation sequencing data 
BMC Bioinformatics  2014;15(Suppl 7):S3.
Error correction is an important step in increasing the quality of next-generation sequencing data for downstream analysis and use. Polymorphic datasets are a challenge for many bioinformatic software packages that are designed for or assume homozygosity of an input dataset. This assumption ignores the true genomic composition of many organisms that are diploid or polyploid. In this survey, two different error correction packages, Quake and ECHO, are examined to see how they perform on next-generation sequence data from heterozygous genomes.
Quake and ECHO perform well and were able to correct many errors found within the data. However, errors that occur at heterozygous positions had unique trends. Errors at these positions were sometimes corrected incorrectly, introducing errors into the dataset with the possibility of creating a chimeric read. Quake was much less likely to create chimeric reads. Quake's read trimming removed a large portion of the original data and often left reads with few heterozygous markers. ECHO resulted in more chimeric reads and introduced more errors than Quake but preserved heterozygous markers.
Using real E. coli sequencing data and their assemblies after error correction, the assembly statistics improved. It was also found that segregating reads by haplotype can improve the quality of an assembly.
These findings suggest that Quake and ECHO both have strengths and weaknesses when applied to heterozygous data. With the increased interest in haplotype specific analysis, new tools that are designed to be haplotype-aware are necessary that do not have the weaknesses of Quake and ECHO.
PMCID: PMC4110727  PMID: 25077414
2.  Probabilistic alignment leads to improved accuracy and read coverage for bisulfite sequencing data 
BMC Bioinformatics  2013;14:337.
DNA methylation has been linked to many important biological phenomena. Researchers have recently begun to sequence bisulfite treated DNA to determine its pattern of methylation. However, sequencing reads from bisulfite-converted DNA can vary significantly from the reference genome because of incomplete bisulfite conversion, genome variation, sequencing errors, and poor quality bases. Therefore, it is often difficult to align reads to the correct locations in the reference genome. Furthermore, bisulfite sequencing experiments have the additional complexity of having to estimate the DNA methylation levels within the sample.
Here, we present a highly accurate probabilistic algorithm, which is an extension of the Genomic Next-generation Universal MAPper to accommodate bisulfite sequencing data (GNUMAP-bs), that addresses the computational problems associated with aligning bisulfite sequencing data to a reference genome. GNUMAP-bs integrates uncertainty from read and mapping qualities to help resolve the difference between poor quality bases and the ambiguity inherent in bisulfite conversion. We tested GNUMAP-bs and other commonly-used bisulfite alignment methods using both simulated and real bisulfite reads and found that GNUMAP-bs and other dynamic programming methods were more accurate than the more heuristic methods.
The GNUMAP-bs aligner is a highly accurate alignment approach for processing the data from bisulfite sequencing experiments. The GNUMAP-bs algorithm is freely available for download at: The software runs on multiple threads and multiple processors to increase the alignment speed.
PMCID: PMC3924334  PMID: 24261665
DNA methylation; Bisulfite sequencing; Probabilistic alignment; Parallel processing
3.  Probabilistic inference and ranking of gene regulatory pathways as a shortest-path problem 
BMC Bioinformatics  2013;14(Suppl 13):S5.
Since the advent of microarray technology, numerous methods have been devised to infer gene regulatory relationships from gene expression data. Many approaches that infer entire regulatory networks. This produces results that are rich in information and yet so complex that they are often of limited usefulness for researchers. One alternative unit of regulatory interactions is a linear path between genes. Linear paths are more comprehensible than networks and still contain important information. Such paths can be extracted from inferred regulatory networks or inferred directly. Since criteria for inferring networks generally differs from criteria for inferring paths, indirect and direct inference of paths may achieve different results.
This paper explores a strategy to infer linear pathways by converting the path inference problem into a shortest-path problem. The edge weights used are the negative log-transformed probabilities of directness derived from the posterior joint distributions of pairwise mutual information between gene expression levels. Directness is inferred using the data processing inequality. The method was designed with two goals. One is to achieve better accuracy in path inference than extraction of paths from inferred networks. The other is to facilitate priorization of interactions for laboratory validation. A method is proposed for achieving this by ranking paths according to the joint probability of directness of each path's edges. The algorithm is evaluated using simulated expression data and is compared to extraction of shortest paths from networks inferred by two alternative methods, ARACNe and a minimum spanning tree algorithm.
Direct path inference appears to achieve accuracy competitive with that obtained by extracting paths from networks inferred by the other methods. Preliminary exploration of the use of joint edge probabilities to rank paths is largely inconclusive. Suggestions for a better framework for such comparisons are discussed.
PMCID: PMC3849606  PMID: 24266986
4.  Parallel Mapping Approaches for GNUMAP 
Mapping short next-generation reads to reference genomes is an important element in SNP calling and expression studies. A major limitation to large-scale whole-genome mapping is the large memory requirements for the algorithm and the long run-time necessary for accurate studies. Several parallel implementations have been performed to distribute memory on different processors and to equally share the processing requirements. These approaches are compared with respect to their memory footprint, load balancing, and accuracy. When using MPI with multi-threading, linear speedup can be achieved for up to 256 processors.
PMCID: PMC3565456  PMID: 23396612
5.  Targeted Amplicon Sequencing (TAS): A Scalable Next-Gen Approach to Multilocus, Multitaxa Phylogenetics 
Genome Biology and Evolution  2011;3:1312-1323.
Next-gen sequencing technologies have revolutionized data collection in genetic studies and advanced genome biology to novel frontiers. However, to date, next-gen technologies have been used principally for whole genome sequencing and transcriptome sequencing. Yet many questions in population genetics and systematics rely on sequencing specific genes of known function or diversity levels. Here, we describe a targeted amplicon sequencing (TAS) approach capitalizing on next-gen capacity to sequence large numbers of targeted gene regions from a large number of samples. Our TAS approach is easily scalable, simple in execution, neither time-nor labor-intensive, relatively inexpensive, and can be applied to a broad diversity of organisms and/or genes. Our TAS approach includes a bioinformatic application, BarcodeCrucher, to take raw next-gen sequence reads and perform quality control checks and convert the data into FASTA format organized by gene and sample, ready for phylogenetic analyses. We demonstrate our approach by sequencing targeted genes of known phylogenetic utility to estimate a phylogeny for the Pancrustacea. We generated data from 44 taxa using 68 different 10-bp multiplexing identifiers. The overall quality of data produced was robust and was informative for phylogeny estimation. The potential for this method to produce copious amounts of data from a single 454 plate (e.g., 325 taxa for 24 loci) significantly reduces sequencing expenses incurred from traditional Sanger sequencing. We further discuss the advantages and disadvantages of this method, while offering suggestions to enhance the approach.
PMCID: PMC3236605  PMID: 22002916
Next-gen sequencing; targeted amplicon sequencing; multiplex identifier; barcode; phylogenetics; population genetics; molecular systematics; Crustacea
6.  Taking the First Steps towards a Standard for Reporting on Phylogenies: Minimal Information about a Phylogenetic Analysis (MIAPA) 
In the eight years since phylogenomics was introduced as the intersection of genomics and phylogenetics, the field has provided fundamental insights into gene function, genome history and organismal relationships. The utility of phylogenomics is growing with the increase in the number and diversity of taxa for which whole genome and large transcriptome sequence sets are being generated. We assert that the synergy between genomic and phylogenetic perspectives in comparative biology would be enhanced by the development and refinement of minimal reporting standards for phylogenetic analyses. Encouraged by the development of the Minimum Information About a Microarray Experiment (MIAME) standard, we propose a similar roadmap for the development of a Minimal Information About a Phylogenetic Analysis (MIAPA) standard. Key in the successful development and implementation of such a standard will be broad participation by developers of phylogenetic analysis software, phylogenetic database developers, practitioners of phylogenomics, and journal editors.
PMCID: PMC3167193  PMID: 16901231

Results 1-6 (6)