PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-13 (13)
 

Clipboard (0)
None

Select a Filter Below

Journals
Authors
more »
Year of Publication
Document Types
1.  Phyloscan: locating transcription-regulating binding sites in mixed aligned and unaligned sequence data 
Nucleic Acids Research  2010;38(Web Server issue):W268-W274.
The transcription of a gene from its DNA template into an mRNA molecule is the first, and most heavily regulated, step in gene expression. Especially in bacteria, regulation is typically achieved via the binding of a transcription factor (protein) or small RNA molecule to the chromosomal region upstream of a regulated gene. The protein or RNA molecule recognizes a short, approximately conserved sequence within a gene's promoter region and, by binding to it, either enhances or represses expression of the nearby gene. Since the sought-for motif (pattern) is short and accommodating to variation, computational approaches that scan for binding sites have trouble distinguishing functional sites from look-alikes. Many computational approaches are unable to find the majority of experimentally verified binding sites without also finding many false positives. Phyloscan overcomes this difficulty by exploiting two key features of functional binding sites: (i) these sites are typically more conserved evolutionarily than are non-functional DNA sequences; and (ii) these sites often occur two or more times in the promoter region of a regulated gene. The website is free and open to all users, and there is no login requirement. Address: (http://bayesweb.wadsworth.org/phyloscan/).
doi:10.1093/nar/gkq330
PMCID: PMC2896078  PMID: 20435683
2.  Exact Calculation of Distributions on Integers, with Application to Sequence Alignment 
Computational biology is replete with high-dimensional discrete prediction and inference problems. Dynamic programming recursions can be applied to several of the most important of these, including sequence alignment, RNA secondary-structure prediction, phylogenetic inference, and motif finding. In these problems, attention is frequently focused on some scalar quantity of interest, a score, such as an alignment score or the free energy of an RNA secondary structure. In many cases, score is naturally defined on integers, such as a count of the number of pairing differences between two sequence alignments, or else an integer score has been adopted for computational reasons, such as in the test of significance of motif scores. The probability distribution of the score under an appropriate probabilistic model is of interest, such as in tests of significance of motif scores, or in calculation of Bayesian confidence limits around an alignment. Here we present three algorithms for calculating the exact distribution of a score of this type; then, in the context of pairwise local sequence alignments, we apply the approach so as to find the alignment score distribution and Bayesian confidence limits.
doi:10.1089/cmb.2008.0137
PMCID: PMC2858568  PMID: 19119992
algorithms; computational molecular biology; statistics; alignment; dynamic programming; genomics
3.  Exact Calculation of Distributions on Integers, with Application to Sequence Alignment 
Abstract
Computational biology is replete with high-dimensional discrete prediction and inference problems. Dynamic programming recursions can be applied to several of the most important of these, including sequence alignment, RNA secondary-structure prediction, phylogenetic inference, and motif finding. In these problems, attention is frequently focused on some scalar quantity of interest, a score, such as an alignment score or the free energy of an RNA secondary structure. In many cases, score is naturally defined on integers, such as a count of the number of pairing differences between two sequence alignments, or else an integer score has been adopted for computational reasons, such as in the test of significance of motif scores. The probability distribution of the score under an appropriate probabilistic model is of interest, such as in tests of significance of motif scores, or in calculation of Bayesian confidence limits around an alignment. Here we present three algorithms for calculating the exact distribution of a score of this type; then, in the context of pairwise local sequence alignments, we apply the approach so as to find the alignment score distribution and Bayesian confidence limits.
doi:10.1089/cmb.2008.0137
PMCID: PMC2858568  PMID: 19119992
algorithms; computational molecular biology; statistics; alignment; dynamic programming; genomics
4.  Significance of Gapped Sequence Alignments 
Journal of Computational Biology  2008;15(9):1187-1194.
Abstract
Measurement of the statistical significance of extreme sequence alignment scores is key to many important applications, but it is difficult. To precisely approximate alignment score significance, we draw random samples directly from a well chosen, importance-sampling probability distribution. We apply our technique to pairwise local sequence alignment of nucleic acid and amino acid sequences of length up to 1000. For instance, using a BLOSUM62 scoring system for local sequence alignment, we compute that the p-value of a score of 6000 for the alignment of two sequences of length 1000 is (3.4 ± 0.3) × 10−1314. Further, we show that the extreme value significance statistic for the local alignment model that we examine does not follow a Gumbel distribution. A web server for this application is available at http://bayesweb.wadsworth.org/alignmentSignificanceV1/.
doi:10.1089/cmb.2008.0125
PMCID: PMC2737730  PMID: 18973434
alignment; dynamic programming; genomics; phylogenetic trees; regulatory regions
5.  Significance of Gapped Sequence Alignments 
Measurement of the statistical significance of extreme sequence alignment scores is key to many important applications, but it is difficult. To precisely approximate alignment score significance, we draw random samples directly from a well chosen, importance-sampling probability distribution. We apply our technique to pairwise local sequence alignment of nucleic acid and amino acid sequences of length up to 1000. For instance, using a BLOSUM62 scoring system for local sequence alignment, we compute that the p-value of a score of 6000 for the alignment of two sequences of length 1000 is (3.4 ± 0.3) × 10−1314. Further, we show that the extreme value significance statistic for the local alignment model that we examine does not follow a Gumbel distribution. A web server for this application is available at http://bayesweb.wadsworth.org/alignmentSignificanceV1/.
doi:10.1089/cmb.2008.0125
PMCID: PMC2737730  PMID: 18973434
alignment; dynamic programming; genomics; phylogenetic trees; regulatory regions
6.  Error statistics of hidden Markov model and hidden Boltzmann model results 
BMC Bioinformatics  2009;10:212.
Background
Hidden Markov models and hidden Boltzmann models are employed in computational biology and a variety of other scientific fields for a variety of analyses of sequential data. Whether the associated algorithms are used to compute an actual probability or, more generally, an odds ratio or some other score, a frequent requirement is that the error statistics of a given score be known. What is the chance that random data would achieve that score or better? What is the chance that a real signal would achieve a given score threshold?
Results
Here we present a novel general approach to estimating these false positive and true positive rates that is significantly more efficient than are existing general approaches. We validate the technique via an implementation within the HMMER 3.0 package, which scans DNA or protein sequence databases for patterns of interest, using a profile-HMM.
Conclusion
The new approach is faster than general naïve sampling approaches, and more general than other current approaches. It provides an efficient mechanism by which to estimate error statistics for hidden Markov model and hidden Boltzmann model results.
doi:10.1186/1471-2105-10-212
PMCID: PMC2722652  PMID: 19589158
7.  Memory-efficient dynamic programming backtrace and pairwise local sequence alignment 
Bioinformatics (Oxford, England)  2008;24(16):1772-1778.
Motivation
A backtrace through a dynamic programming algorithm’s intermediate results in search of an optimal path, or to sample paths according to an implied probability distribution, or as the second stage of a forward–backward algorithm, is a task of fundamental importance in computational biology. When there is insufficient space to store all intermediate results in high-speed memory (e.g. cache) existing approaches store selected stages of the computation, and recompute missing values from these checkpoints on an as-needed basis.
Results
Here we present an optimal checkpointing strategy, and demonstrate its utility with pairwise local sequence alignment of sequences of length 10 000.
Availability
Sample C++-code for optimal backtrace is available in the Supplementary Materials.
doi:10.1093/bioinformatics/btn308
PMCID: PMC2668612  PMID: 18558620
8.  Memory-efficient dynamic programming backtrace and pairwise local sequence alignment 
Bioinformatics  2008;24(16):1772-1778.
Motivation: A backtrace through a dynamic programming algorithm's intermediate results in search of an optimal path, or to sample paths according to an implied probability distribution, or as the second stage of a forward–backward algorithm, is a task of fundamental importance in computational biology. When there is insufficient space to store all intermediate results in high-speed memory (e.g. cache) existing approaches store selected stages of the computation, and recompute missing values from these checkpoints on an as-needed basis.
Results: Here we present an optimal checkpointing strategy, and demonstrate its utility with pairwise local sequence alignment of sequences of length 10 000.
Availability: Sample C++-code for optimal backtrace is available in the Supplementary Materials.
Contact: leen@cs.rpi.edu
Supplementary information: Supplementary data is available at Bioinformatics online.
doi:10.1093/bioinformatics/btn308
PMCID: PMC2668612  PMID: 18558620
9.  A phylogenetic Gibbs sampler that yields centroid solutions for cis-regulatory site prediction 
Bioinformatics (Oxford, England)  2007;23(14):1718-1727.
Motivation
Identification of functionally conserved regulatory elements in sequence data from closely related organisms is becoming feasible, due to the rapid growth of public sequence databases. Closely related organisms are most likely to have common regulatory motifs; however, the recent speciation of such organisms results in the high degree of correlation in their genome sequences, confounding the detection of functional elements. Additionally, alignment algorithms that use optimization techniques are limited to the detection of a single alignment that may not be representative. Comparative-genomics studies must be able to address the phylogenetic correlation in the data and efficiently explore the alignment space, in order to make specific and biologically relevant predictions.
Results
We describe here a Gibbs sampler that employs a full phylogenetic model and reports an ensemble centroid solution. We describe regulatory motif detection using both simulated and real data, and demonstrate that this approach achieves improved specificity, sensitivity, and positive predictive value over non-phylogenetic algorithms, and over phylogenetic algorithms that report a maximum likelihood solution.
doi:10.1093/bioinformatics/btm241
PMCID: PMC2268014  PMID: 17488758
10.  Mammalian Genomes Ease Location of Human DNA Functional Segments but Not Their Description* 
Under the assumption that a significant motivation for sequencing the genomes of mammals is the resulting ability to help us locate and characterize functional DNA segments shared with humans, we have developed a statistical analysis to quantify the expected advantage. Examining uncertainty in terms of the width of a confidence interval, we show that uncertainty in the rate of nucleotide mutation can be shrunk by a factor of nearly four when nine mammals; human, chimpanzee, baboon, cat, dog, cow, pig, rat, mouse; are used instead of just two; human and mouse. Contrastingly, we show confidence interval shrinkage by a factor of only 1.5 for measurements of the distribution of nucleotides at an aligned sequence site. These additional genomes should greatly help in identifying conserved DNA sites, but would be much less effective at precisely describing the expected pattern of nucleotides at those sites.
doi:10.2202/1544-6115.1065
PMCID: PMC1479771  PMID: 16646802
Comparative genomics; Sequence Conservation; Sequence pattern identification
11.  The Relative Inefficiency of Sequence Weights Approaches in Determining a Nucleotide Position Weight Matrix* 
Approaches based upon sequence weights, to construct a position weight matrix of nucleotides from aligned inputs, are popular but little effort has been expended to measure their quality.
We derive optimal sequence weights that minimize the sum of the variances of the estimators of base frequency parameters for sequences related by a phylogenetic tree. Using these we find that approaches based upon sequence weights can perform very poorly in comparison to approaches based upon a theoretically optimal maximum-likelihood method in the inference of the parameters of a position-weight matrix. Specifically, we find that among a collection of primate sequences, even an optimal sequences-weights approach is only 51% as efficient as the maximum-likelihood approach in inferences of base frequency parameters.
We also show how to employ the variance estimators to obtain a greedy ordering of species for sequencing. Application of this ordering for the weighted estimators to a primate collection yields a curve with a long plateau that is not observed with maximum-likelihood estimators. This plateau indicates that the use of weighted estimators on these data seriously limits the utility of obtaining the sequences of more than two or three additional species.
doi:10.2202/1544-6115.1135
PMCID: PMC1479456  PMID: 16646830
sequence weights; maximum likelihood; motifs; phylogeny; sequencing; consensus distribution
12.  The Gibbs Centroid Sampler 
Nucleic Acids Research  2007;35(Web Server issue):W232-W237.
The Gibbs Centroid Sampler is a software package designed for locating conserved elements in biopolymer sequences. The Gibbs Centroid Sampler reports a centroid alignment, i.e. an alignment that has the minimum total distance to the set of samples chosen from the a posteriori probability distribution of transcription factor binding-site alignments. In so doing, it garners information from the full ensemble of solutions, rather than only the single most probable point that is the target of many motif-finding algorithms, including its predecessor, the Gibbs Recursive Sampler. Centroid estimators have been shown to yield substantial improvements, in both sensitivity and positive predictive values, to the prediction of RNA secondary structure and motif finding. The Gibbs Centroid Sampler, along with interactive tutorials, an online user manual, and information on downloading the software, is available at: http://bayesweb.wadsworth.org/gibbs/gibbs.html.
doi:10.1093/nar/gkm265
PMCID: PMC1933196  PMID: 17483517
13.  PhyloScan: identification of transcription factor binding sites using cross-species evidence 
Background
When transcription factor binding sites are known for a particular transcription factor, it is possible to construct a motif model that can be used to scan sequences for additional sites. However, few statistically significant sites are revealed when a transcription factor binding site motif model is used to scan a genome-scale database.
Methods
We have developed a scanning algorithm, PhyloScan, which combines evidence from matching sites found in orthologous data from several related species with evidence from multiple sites within an intergenic region, to better detect regulons. The orthologous sequence data may be multiply aligned, unaligned, or a combination of aligned and unaligned. In aligned data, PhyloScan statistically accounts for the phylogenetic dependence of the species contributing data to the alignment and, in unaligned data, the evidence for sites is combined assuming phylogenetic independence of the species. The statistical significance of the gene predictions is calculated directly, without employing training sets.
Results
In a test of our methodology on synthetic data modeled on seven Enterobacteriales, four Vibrionales, and three Pasteurellales species, PhyloScan produces better sensitivity and specificity than MONKEY, an advanced scanning approach that also searches a genome for transcription factor binding sites using phylogenetic information. The application of the algorithm to real sequence data from seven Enterobacteriales species identifies novel Crp and PurR transcription factor binding sites, thus providing several new potential sites for these transcription factors. These sites enable targeted experimental validation and thus further delineation of the Crp and PurR regulons in E. coli.
Conclusion
Better sensitivity and specificity can be achieved through a combination of (1) using mixed alignable and non-alignable sequence data and (2) combining evidence from multiple sites within an intergenic region.
doi:10.1186/1748-7188-2-1
PMCID: PMC1794230  PMID: 17244358

Results 1-13 (13)