Search tips
Search criteria

Results 1-25 (29)

Clipboard (0)

Select a Filter Below

Year of Publication
more »
Document Types
1.  Improving Retrieval Efficacy of Homology Searches using the False Discovery Rate 
Over the past few decades, discovery based on sequence homology has become a widely accepted practice. Consequently, comparative accuracy of retrieval algorithms (e.g., BLAST) has been rigorously studied for improvement. Unlike most components of retrieval algorithms, the E-value threshold criterion has yet to be thoroughly investigated. An investigation of the threshold is important as it exclusively dictates which sequences are declared relevant and irrelevant. In this paper, we introduce the false discovery rate (FDR) statistic as a replacement for the uniform threshold criterion in order to improve efficacy in retrieval systems. Using NCBI’s BLAST and PSI-BLAST software packages, we demonstrate the applicability of such a replacement in both non-iterative (BLASTFDR) and iterative (PSI-BLASTFDR) homology searches. For each application, we performed an evaluation of retrieval efficacy with five different multiple testing methods on a large training database. For each algorithm, we choose the best performing method, Benjamini-Hochberg, as the default statistic. As measured by the Threshold Average Precision, BLASTFDR yielded 14.1% better retrieval performance than BLAST on a large (5,161 queries) test database and PSI-BLASTFDR attained 11.8% better retrieval performance than PSI-BLAST. The C++ source code specific to BLASTFDR and PSI-BLASTFDR and instructions are available at
PMCID: PMC4568567  PMID: 26357264
Homology search; false discovery rate; retrieval efficacy; uniform E-value thresholding
2.  Frameshift alignment: statistics and post-genomic applications 
Bioinformatics  2014;30(24):3575-3582.
Motivation: The alignment of DNA sequences to proteins, allowing for frameshifts, is a classic method in sequence analysis. It can help identify pseudogenes (which accumulate mutations), analyze raw DNA and RNA sequence data (which may have frameshift sequencing errors), investigate ribosomal frameshifts, etc. Often, however, only ad hoc approximations or simulations are available to provide the statistical significance of a frameshift alignment score.
Results: We describe a method to estimate statistical significance of frameshift alignments, similar to classic BLAST statistics. (BLAST presently does not permit its alignments to include frameshifts.) We also illustrate the continuing usefulness of frameshift alignment with two ‘post-genomic’ applications: (i) when finding pseudogenes within the human genome, frameshift alignments show that most anciently conserved non-coding human elements are recent pseudogenes with conserved ancestral genes; and (ii) when analyzing metagenomic DNA reads from polluted soil, frameshift alignments show that most alignable metagenomic reads contain frameshifts, suggesting that metagenomic analysis needs to use frameshift alignment to derive accurate results.
Availability and implementation: The statistical calculation is available in FALP (, and giga-scale frameshift alignment is available in LAST (
Contact: or
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4253828  PMID: 25172925
3.  Conserved Molecular Signatures in gp120 Are Associated with the Genetic Bottleneck during Simian Immunodeficiency Virus (SIV), SIV-Human Immunodeficiency Virus (SHIV), and HIV Type 1 (HIV-1) Transmission 
Journal of Virology  2015;89(7):3619-3629.
Human immunodeficiency virus (HIV) transmission typically results from infection by a single transmitted/founder (T/F) variant. Are T/F variants chosen uniformly at random from the donor pool, or are they selected based on advantageous traits facilitating transmission? Finding evidence for selection during transmission is of particular interest, because it would indicate that phenotypic and/or genetic properties of the viruses might be harnessed as potential vaccine targets or immunotherapies. Here, we systematically evaluated the differences between the Env proteins of simian immunodeficiency virus/simian HIV (SIV/SHIV) stock and T/F variants in search of “signature” sites of transmission. We also surveyed residue preferences in HIV at the SIV/SHIV signature sites. Four sites of gp120 showed significant selection, and an additional two sites showed a similar trend. Therefore, the six sites clearly differentiate T/F viruses from the majority of circulating variants in the stocks. The selection of SIV/SHIV could be inferred reasonably across both vaccinated and unvaccinated subjects, with infections resulting from vaginal, rectal, and intravenous routes of transmission and regardless of viral dosage. The evidence for selection in SIV and SHIV T/F variants is strong and plentiful, and in HIV the evidence is suggestive though commensurate with the availability of suitable data for analysis. Two of the signature residues are completely conserved across the SIV, SHIV, and HIV variants we examined. Five of the signature residues map to the C1 region of gp120 and one to the signal peptide. Our data raise the possibility that C1, while governing the association between gp120 and gp41, modulates transmission efficiency, replicative fitness, and/or host cell tropism at the level of virus-cell attachment and entry.
IMPORTANCE The present study finds significant evidence of selection on gp120 molecules of SIV/SHIV T/F viruses. The data provide ancillary evidence suggesting the same sites are under selection in HIV. Our findings suggest that the signature residues are involved in increasing the transmissibility of infecting viruses; therefore, they are potential targets for developing a vaccine or other protective measures. A recent study identified the same T/F signature motif but interpreted it as an effect of neutralization resistance. Here, we show that the T/F motif has broader functional significance beyond neutralization sensitivity, because it is present in nonimmune subjects. Also, a vaccine regimen popular in animal trials might have increased the transmission of variants with otherwise low transmission fitness. Our observations might explain why many animal vaccine trials have not faithfully predicted outcomes in human vaccine trials and suggest that current practices in vaccine design need to be reexamined accordingly.
PMCID: PMC4403421  PMID: 25589663
4.  Within a Sample from a Population, the Distribution of the Number of Descendants of a Subsample’s Most Recent Common Ancestor 
Sample n individuals uniformly at random from a population, and then sample m individuals uniformly at random from the sample. Consider the most recent common ancestor (MRCA) of the subsample of m individuals. Let the subsample MRCA have j descendants in the sample (m≤ j ≤ n). Under a Moran or coalescent model (and therefore under many other models), the probability that j = n is known. In this case, the subsample MRCA is an ancestor of every sampled individual, and the subsample and sample MRCAs are identical. The probability that j = m is also known. In this case, the subsample MRCA is an ancestor of no sampled individual outside the subsample. This article derives the complete distribution of j, enabling inferences from the corresponding p-value. The text presents hypothetical statistical applications pertinent to taxonomy (the gene flow between Neanderthals and anatomically modern humans) and medicine (the association of genetic markers with disease).
PMCID: PMC3944932  PMID: 24321308
Most Recent Common Ancestor of a Subsample; Coalescent Theory
5.  A New Method for Estimating Species Age Supports the Coexistence of Malaria Parasites and Their Mammalian Hosts 
Molecular Biology and Evolution  2015;32(5):1354-1364.
Species in the genus Plasmodium cause malaria in humans and infect a variety of mammals and other vertebrates. Currently, estimated ages for several mammalian Plasmodium parasites differ by as much as one order of magnitude, an inaccuracy that frustrates reliable estimation of evolutionary rates of disease-related traits. We developed a novel statistical approach to dating the relative age of evolutionary lineages, based on Total Least Squares regression. We validated this lineage dating approach by applying it to the genus Drosophila. Using data from the Drosophila 12 Genomes project, our approach accurately reconstructs the age of well-established Drosophila clades, including the speciation event that led to the subgenera Drosophila and Sophophora, and age of the melanogaster species subgroup. We applied this approach to hundreds of loci from seven mammalian Plasmodium species. We demonstrate the existence of a molecular clock specific to individual Plasmodium proteins, and estimate the relative age of mammalian-infecting Plasmodium. These analyses indicate that: 1) the split between the human parasite Plasmodium vivax and P. knowlesi, from Old World monkeys, occurred 6.1 times earlier than that between P. falciparum and P. reichenowi, parasites of humans and chimpanzees, respectively; and 2) mammalian Plasmodium parasites originated 22 times earlier than the split between P. falciparum and P. reichenowi. Calibrating the absolute divergence times for Plasmodium with eukaryotic substitution rates, we show that the split between P. falciparum and P. reichenowi occurred 3.0–5.5 Ma, and that mammalian Plasmodium parasites originated over 64 Ma. Our results indicate that mammalian-infecting Plasmodium evolved contemporaneously with their hosts, with little evidence for parasite host-switching on an evolutionary scale, and provide a solid timeframe within which to place the evolution of new Plasmodium species.
PMCID: PMC4408405  PMID: 25589738
Plasmodium; molecular clock; speciation dates; total least squares; regression; malaria; Drosophila
6.  Searching for Repeats, as an Example of Using the Generalized Ruzzo-Tompa Algorithm to Find Optimal Subsequences with Gaps 
Some biological sequences contain subsequences of unusual composition, e.g., some proteins contain DNA binding domains, transmembrane regions, and charged regions; and some DNA sequences contain repeats. Requiring time linear in the length of an input sequence, the Ruzzo-Tompa (RT) Algorithm finds subsequences of unusual composition, using a sequence of scores as input and the corresponding “maximal segments” as output. (Loosely, maximal segments are the contiguous subsequences having greatest total score.) Just as gaps improved the sensitivity of BLAST, in principle gaps could help tune other tools, to improve sensitivity when searching for subsequences of unusual composition.
Call a graph whose vertices are totally ordered a “totally ordered graph”. In a totally ordered graph, call a path whose vertices are in increasing order an “increasing path”. The input of the RT Algorithm can be generalized to a finite, totally ordered, weighted graph, so the algorithm then locates maximal segments, corresponding to increasing paths of maximal weight. The generalization permits penalized deletion of unfavorable letters from contiguous subsequences, so the generalized Ruzzo-Tompa algorithm can find subsequences with greatest total gapped scores. The search for inexact simple repeats in DNA exemplifies some of the concepts. For some limited types of repeats, RepWords, a repeat-finding tool based on the principled use of the Ruzzo-Tompa algorithm, performed better than a similar extant tool.
With minimal programming effort, the generalization of the Ruzzo-Tompa algorithm given in this article could improve the performance of many programs for finding biological subsequences of unusual composition.
PMCID: PMC4135518  PMID: 24989859
7.  NEXT-peak: a normal-exponential two-peak model for peak-calling in ChIP-seq data 
BMC Genomics  2013;14:349.
Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) can locate transcription factor binding sites on genomic scale. Although many models and programs are available to call peaks, none has dominated its competition in comparison studies.
We propose a rigorous statistical model, the normal-exponential two-peak (NEXT-peak) model, which parallels the physical processes generating the empirical data, and which can naturally incorporate mappability information. The model therefore estimates total strength of binding (even if some binding locations do not map uniquely into a reference genome, effectively censoring them); it also assigns an error to an estimated binding location. The comparison study with existing programs on real ChIP-seq datasets (STAT1, NRSF, and ZNF143) demonstrates that the NEXT-peak model performs well both in calling peaks and locating them. The model also provides a goodness-of-fit test, to screen out spurious peaks and to infer multiple binding events in a region.
The NEXT-peak program calls peaks on any test dataset about as accurately as any other, but provides unusual accuracy in the estimated location of the peaks it calls. NEXT-peak is based on rigorous statistics, so its model also provides a principled foundation for a more elaborate statistical analysis of ChIP-seq data.
PMCID: PMC3672025  PMID: 23706083
ChIP-seq; Normal-exponential distribution; Continuous mixture; Poisson regression; Goodness-of-fit
8.  Domain analysis of symbionts and hosts (DASH) in a genome-wide survey of pathogenic human viruses 
BMC Research Notes  2013;6:209.
In the coevolution of viruses and their hosts, viruses often capture host genes, gaining advantageous functions (e.g. immune system control). Identifying functional similarities shared by viruses and their hosts can help decipher mechanisms of pathogenesis and accelerate virus-targeted drug and vaccine development. Cellular homologs in viruses are usually documented using pairwise-sequence comparison methods. Yet, pairwise-sequence searches have limited sensitivity resulting in poor identification of divergent homologies.
Methods based on profiles from multiple sequences provide a more sensitive alternative to identify similarities in host-pathogen systems. The present work describes a profile-based bioinformatics pipeline that we call the Domain Analysis of Symbionts and Hosts (DASH). DASH provides a web platform for the functional analysis of viral and host genomes. This study uses Human Herpesvirus 8 (HHV-8) as a model to validate the methodology. Our results indicate that HHV-8 shares at least 29% of its genes with humans (fourteen immunomodulatory and ten metabolic genes). DASH also suggests functions for fifty-one additional HHV-8 structural and metabolic proteins. We also perform two other comparative genomics studies of human viruses: (1) a broad survey of eleven viruses of disparate sizes and transcription strategies; and (2) a closer examination of forty-one viruses of the order Mononegavirales. In the survey, DASH detects human homologs in 4/5 DNA viruses. None of the non-retro-transcribing RNA viruses in the survey showed evidence of homology to humans. The order Mononegavirales are also non-retro-transcribing RNA viruses, however, and DASH found homology in 39/41 of them. Mononegaviruses display larger fractions of human similarities (up to 75%) than any of the other RNA or DNA viruses (up to 55% and 29% respectively).
We conclude that gene sharing probably occurs between humans and both DNA and RNA viruses, in viral genomes of differing sizes, regardless of transcription strategies. Our method (DASH) simultaneously analyzes the genomes of two interacting species thereby mining functional information to identify shared as well as exclusive domains to each organism. Our results validate our approach, showing that DASH has potential as a pipeline for making therapeutic discoveries in other host-symbiont systems. DASH results are available at
PMCID: PMC3672079  PMID: 23706066
Functional similarity; Domain similarity; Host-pathogen interactions; Gene transfer; Comparative genomics; Cellular homolog; Host-virus coevolution
9.  Fungi in Thailand: A Case Study of the Efficacy of an ITS Barcode for Automatically Identifying Species within the Annulohypoxylon and Hypoxylon Genera 
PLoS ONE  2013;8(2):e54529.
Thailand, a part of the Indo-Burma biodiversity hotspot, has many endemic animals and plants. Some of its fungal species are difficult to recognize and separate, complicating assessments of biodiversity. We assessed species diversity within the fungal genera Annulohypoxylon and Hypoxylon, which produce biologically active and potentially therapeutic compounds, by applying classical taxonomic methods to 552 teleomorphs collected from across Thailand. Using probability of correct identification (PCI), we also assessed the efficacy of automated species identification with a fungal barcode marker, ITS, in the model system of Annulohypoxylon and Hypoxylon. The 552 teleomorphs yielded 137 ITS sequences; in addition, we examined 128 GenBank ITS sequences, to assess biases in evaluating a DNA barcode with GenBank data. The use of multiple sequence alignment in a barcode database like BOLD raises some concerns about non-protein barcode markers like ITS, so we also compared species identification using different alignment methods. Our results suggest the following. (1) Multiple sequence alignment of ITS sequences is competitive with pairwise alignment when identifying species, so BOLD should be able to preserve its present bioinformatics workflow for species identification for ITS, and possibly therefore with at least some other non-protein barcode markers. (2) Automated species identification is insensitive to a specific choice of evolutionary distance, contributing to resolution of a current debate in DNA barcoding. (3) Statistical methods are available to address, at least partially, the possibility of expert misidentification of species. Phylogenetic trees discovered a cryptic species and strongly supported monophyletic clades for many Annulohypoxylon and Hypoxylon species, suggesting that ITS can contribute usefully to a barcode for these fungi. The PCIs here, derived solely from ITS, suggest that a fungal barcode will require secondary markers in Annulohypoxylon and Hypoxylon, however. The URL contains computer programs and other supplementary material relevant to this article.
PMCID: PMC3563529  PMID: 23390499
10.  CBOL Protist Working Group: Barcoding Eukaryotic Richness beyond the Animal, Plant, and Fungal Kingdoms 
PLoS Biology  2012;10(11):e1001419.
A group of protist experts proposes a two-step DNA barcoding approach, comprising a universal eukaryotic pre-barcode followed by group-specific barcodes, to unveil the hidden biodiversity of microbial eukaryotes.
PMCID: PMC3491025  PMID: 23139639
11.  Coalescent: an open-source and scalable framework for exact calculations in coalescent theory 
BMC Bioinformatics  2012;13:257.
Currently, there is no open-source, cross-platform and scalable framework for coalescent analysis in population genetics. There is no scalable GUI based user application either. Such a framework and application would not only drive the creation of more complex and realistic models but also make them truly accessible.
As a first attempt, we built a framework and user application for the domain of exact calculations in coalescent analysis. The framework provides an API with the concepts of model, data, statistic, phylogeny, gene tree and recursion. Infinite-alleles and infinite-sites models are considered. It defines pluggable computations such as counting and listing all the ancestral configurations and genealogies and computing the exact probability of data. It can visualize a gene tree, trace and visualize the internals of the recursion algorithm for further improvement and attach dynamically a number of output processors. The user application defines jobs in a plug-in like manner so that they can be activated, deactivated, installed or uninstalled on demand. Multiple jobs can be run and their inputs edited. Job inputs are persisted across restarts and running jobs can be cancelled where applicable.
Coalescent theory plays an increasingly important role in analysing molecular population genetic data. Models involved are mathematically difficult and computationally challenging. An open-source, scalable framework that lets users immediately take advantage of the progress made by others will enable exploration of yet more difficult and realistic models. As models become more complex and mathematically less tractable, the need for an integrated computational approach is obvious. Object oriented designs, though has upfront costs, are practical now and can provide such an integrated approach.
PMCID: PMC3575375  PMID: 23033878
Population genetics; Object oriented design; Framework; Java; Netbeans platform; Coalescent; Recursion; Exact calculation
12.  The Practical Evaluation of DNA Barcode Efficacy* 
This chapter describes a workflow for measuring the efficacy of a barcode in identifying species. First, assemble individual sequence databases corresponding to each barcode marker. A controlled collection of taxonomic data is preferable to GenBank data, because GenBank data can be problematic, particularly when comparing barcodes based on more than one marker. To ensure proper controls when evaluating species identification, specimens not having a sequence in every marker database should be discarded. Second, select a computer algorithm for assigning species to barcode sequences. No algorithm has yet improved notably on assigning a specimen to the species of its nearest neighbor within a barcode database. Because global sequence alignments (e.g., with the Needleman–Wunsch algorithm, or some related algorithm) examine entire barcode sequences, they generally produce better species assignments than local sequence alignments (e.g., with BLAST). No neighboring method (e.g., global sequence similarity, global sequence distance, or evolutionary distance based on a global alignment) has yet shown a notable superiority in identifying species. Finally, “the probability of correct identification” (PCI) provides an appropriate measurement of barcode efficacy. The overall PCI for a data set is the average of the species PCIs, taken over all species in the data set. This chapter states explicitly how to calculate PCI, how to estimate its statistical sampling error, and how to use data on PCR failure to set limits on how much improvements in PCR technology can improve species identification.
PMCID: PMC3410705  PMID: 22684965
Barcode efficacy in species identification; Probability of correct identification; DNA barcode
13.  New finite-size correction for local alignment score distributions 
BMC Research Notes  2012;5:286.
Local alignment programs often calculate the probability that a match occurred by chance. The calculation of this probability may require a “finite-size” correction to the lengths of the sequences, as an alignment that starts near the end of either sequence may run out of sequence before achieving a significant score.
We present an improved finite-size correction that considers the distribution of sequence lengths rather than simply the corresponding means. This approach improves sensitivity and avoids substituting an ad hoc length for short sequences that can underestimate the significance of a match. We use a test set derived from ASTRAL to show improved ROC scores, especially for shorter sequences.
The new finite-size correction improves the calculation of probabilities for a local alignment. It is now used in the BLAST+ package and at the NCBI BLAST web site (
PMCID: PMC3483159  PMID: 22691307
14.  Objective method for estimating asymptotic parameters, with an application to sequence alignment 
Sequence alignment is an indispensable computational tool in modern molecular biology. The model underlying biological sequence alignment is of interest to physicists because it approximates the statistical mechanics of DNA and protein annealing, while bearing an intimate relationship to models of directed polymers in random media. Recent methods for determining the statistics of random sequence alignments have reduced the computation time to less than 1 s, opening up some interesting possibilities for online computation with biological search engines. Before implementation, however, the methods required an objective technique for computing regression coefficients pertinent to an asymptotic regime. Typically, physicists estimate parameters pertinent to an asymptotic regime subjectively: They eyeball their data; estimate the asymptotic regime where the regression model holds with reasonable accuracy; and then regress data only within the estimated asymptotic regime. Our publicly available computer program ARRP replaces the subjective assessment of the asymptotic regime with an objective change-point detection method, increasing confidence in the scientific objectivity of the parameter estimates. Asymptotic regression has potential applications across most of physics.
PMCID: PMC3233989  PMID: 22060410
15.  Finding functional sequence elements by multiple local alignment 
Nucleic Acids Research  2004;32(1):189-200.
Algorithms that detect and align locally similar regions of biological sequences have the potential to discover a wide variety of functional motifs. Two theoretical contributions to this classic but unsolved problem are presented here: a method to determine the width of the aligned motif automatically; and a technique for calculating the statistical significance of alignments, i.e. an assessment of whether the alignments are stronger than those that would be expected to occur by chance among random, unrelated sequences. Upon exploring variants of the standard Gibbs sampling technique to optimize the alignment, we discovered that simulated annealing approaches perform more efficiently. Finally, we conduct failure tests by applying the algorithm to increasingly difficult test cases, and analyze the manner of and reasons for eventual failure. Detection of transcription factor-binding motifs is limited by the motifs’ intrinsic subtlety rather than by inadequacy of the alignment optimization procedure.
PMCID: PMC373279  PMID: 14704356
16.  Threshold Average Precision (TAP-k): a measure of retrieval designed for bioinformatics 
Bioinformatics  2010;26(14):1708-1713.
Motivation: Since database retrieval is a fundamental operation, the measurement of retrieval efficacy is critical to progress in bioinformatics. This article points out some issues with current methods of measuring retrieval efficacy and suggests some improvements. In particular, many studies have used the pooled receiver operating characteristic for n irrelevant records (ROCn) score, the area under the ROC curve (AUC) of a ‘pooled’ ROC curve, truncated at n irrelevant records. Unfortunately, the pooled ROCn score does not faithfully reflect actual usage of retrieval algorithms. Additionally, a pooled ROCn score can be very sensitive to retrieval results from as little as a single query.
Methods: To replace the pooled ROCn score, we propose the Threshold Average Precision (TAP-k), a measure closely related to the well-known average precision in information retrieval, but reflecting the usage of E-values in bioinformatics. Furthermore, in addition to conditions previously given in the literature, we introduce three new criteria that an ideal measure of retrieval efficacy should satisfy.
Results: PSI-BLAST, GLOBAL, HMMER and RPS-BLAST provided examples of using the TAP-k and pooled ROCn scores to evaluate sequence retrieval algorithms. In particular, compelling examples using real data highlight the drawbacks of the pooled ROCn score, showing that it can produce evaluations skewing far from intuitive expectations. In contrast, the TAP-k satisfies most of the criteria desired in an ideal measure of retrieval efficacy.
Availability and Implementation: The TAP-k web server and downloadable Perl script are freely available at
Supplementary Information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2894514  PMID: 20505002
Annals of statistics  2009;37(6A):3697.
The gapped local alignment score of two random sequences follows a Gumbel distribution. If computers could estimate the parameters of the Gumbel distribution within one second, the use of arbitrary alignment scoring schemes could increase the sensitivity of searching biological sequence databases over the web. Accordingly, this article gives a novel equation for the scale parameter of the relevant Gumbel distribution. We speculate that the equation is exact, although present numerical evidence is limited. The equation involves ascending ladder variates in the global alignment of random sequences. In global alignment simulations, the ladder variates yield stopping times specifying random sequence lengths. Because of the random lengths, and because our trial distribution for importance sampling occurs on a different sample space from our target distribution, our study led to a mapping theorem, which led naturally in turn to an efficient dynamic programming algorithm for the importance sampling weights. Numerical studies using several popular alignment scoring schemes then examined the efficiency and accuracy of the resulting simulations.
PMCID: PMC2818155  PMID: 20148197
Gumbel scale parameter estimation; gapped sequence alignment; importance sampling; stopping time; Markov renewal process; Markov additive process
18.  Promoter Analysis: Gene Regulatory Motif Identification with A-GLAM 
Reliable detection of cis-regulatory elements in promoter regions is a difficult and unsolved problem in computational biology. The intricacy of transcriptional regulation in higher eukaryotes, primarily in metazoans, could be a major driving force of organismal complexity. Eukaryotic genome annotations have improved greatly due to large-scale characterization of full-length cDNAs, transcriptional start sites (TSSs), and comparative genomics. Regulatory elements are identified in promoter regions using a variety of enumerative or alignment-based methods. Here we present a survey of recent computational methods for eukaryotic promoter analysis and describe the use of an alignment-based method implemented in the A-GLAM program.
PMCID: PMC2702474  PMID: 19378149
Promoter regions; transcription factor binding sites; enumerative methods; promoter comparison
19.  Inequalities on the Overshoot beyond a Boundary for Independent Summands with Differing Distributions 
Statistics & probability letters  2007;77(14):1486-1489.
Let {Sn : n ≥ 0} (S0 = 0) denote the successive sums of independent non-negative random variates, of possibly differing distributions. Define: (1) the number N(b) = inf{n ≥ 0 : Sn > b} of sums in the interval [0,b]; and (2) the overshoot Rb = SN(b) −b. This paper bounds the tail ℙ{Rb > c} and the moments 𝔼Rbk.
PMCID: PMC2683021  PMID: 19461943
Renewal theory; excess over the boundary; Lorden's inequality
20.  The whole alignment and nothing but the alignment: the problem of spurious alignment flanks 
Nucleic Acids Research  2008;36(18):5863-5871.
Pairwise sequence alignment is a ubiquitous tool for inferring the evolution and function of DNA, RNA and protein sequences. It is therefore essential to identify alignments arising by chance alone, i.e. spurious alignments. On one hand, if an entire alignment is spurious, statistical techniques for identifying and eliminating it are well known. On the other hand, if only a part of the alignment is spurious, elimination is much more problematic. In practice, even the sizes and frequencies of spurious subalignments remain unknown. This article shows that some common scoring schemes tend to overextend alignments and generate spurious alignment flanks up to hundreds of base pairs/amino acids in length. In the UCSC genome database, e.g. spurious flanks probably comprise >18% of the human–fugu genome alignment. To evaluate the possibility that chance alone generated a particular flank on a particular pairwise alignment, we provide a simple ‘overalignment’ P-value. The overalignment P-value can identify spurious alignment flanks, thereby eliminating potentially misleading inferences about evolution and function. Moreover, by explicitly demonstrating the tradeoff between over- and under-alignment, our methods guide the rational choice of scoring schemes for various alignment tasks.
PMCID: PMC2566872  PMID: 18796526
21.  Finding sequence motifs with Bayesian models incorporating positional information: an application to transcription factor binding sites 
BMC Bioinformatics  2008;9:262.
Biologically active sequence motifs often have positional preferences with respect to a genomic landmark. For example, many known transcription factor binding sites (TFBSs) occur within an interval [-300, 0] bases upstream of a transcription start site (TSS). Although some programs for identifying sequence motifs exploit positional information, most of them model it only implicitly and with ad hoc methods, making them unsuitable for general motif searches.
A-GLAM, a user-friendly computer program for identifying sequence motifs, now incorporates a Bayesian model systematically combining sequence and positional information. A-GLAM's predictions with and without positional information were compared on two human TFBS datasets, each containing sequences corresponding to the interval [-2000, 0] bases upstream of a known TSS. A rigorous statistical analysis showed that positional information significantly improved the prediction of sequence motifs, and an extensive cross-validation study showed that A-GLAM's model was robust against mild misspecification of its parameters. As expected, when sequences in the datasets were successively truncated to the intervals [-1000, 0], [-500, 0] and [-250, 0], positional information aided motif prediction less and less, but never hurt it significantly.
Although sequence truncation is a viable strategy when searching for biologically active motifs with a positional preference, a probabilistic model (used reasonably) generally provides a superior and more robust strategy, particularly when the sequence motifs' positional preferences are not well characterized.
PMCID: PMC2432075  PMID: 18533028
22.  The biological function of some human transcription factor binding motifs varies with position relative to the transcription start site 
Nucleic Acids Research  2008;36(8):2777-2786.
A number of previous studies have predicted transcription factor binding sites (TFBSs) by exploiting the position of genomic landmarks like the transcriptional start site (TSS). The studies’ methods are generally too computationally intensive for genome-scale investigation, so the full potential of ‘positional regulomics’ to discover TFBSs and determine their function remains unknown. Because databases often annotate the genomic landmarks in DNA sequences, the methodical exploitation of positional regulomics has become increasingly urgent. Accordingly, we examined a set of 7914 human putative promoter regions (PPRs) with a known TSS. Our methods identified 1226 eight-letter DNA words with significant positional preferences with respect to the TSS, of which only 608 of the 1226 words matched known TFBSs. Many groups of genes whose PPRs contained a common word displayed similar expression profiles and related biological functions, however. Most interestingly, our results included 78 words, each of which clustered significantly in two or three different positions relative to the TSS. Often, the gene groups corresponding to different positional clusters of the same word corresponded to diverse functions, e.g. activation or repression in different tissues. Thus, different clusters of the same word likely reflect the phenomenon of ‘positional regulation’, i.e. a word's regulatory function can vary with its position relative to a genomic landmark, a conclusion inaccessible to methods based purely on sequence. Further integrative analysis of words co-occurring in PPRs also yielded 24 different groups of genes, likely identifying cis-regulatory modules de novo. Whereas comparative genomics requires precise sequence alignments, positional regulomics exploits genomic landmarks to provide a ‘poor man's alignment’. By exploiting the phenomenon of positional regulation, it uses position to differentiate the biological functions of subsets of TFBSs sharing a common sequence motif.
PMCID: PMC2377430  PMID: 18367472
23.  The identification of complete domains within protein sequences using accurate E-values for semi-global alignment 
Nucleic Acids Research  2007;35(14):4678-4685.
The sequencing of complete genomes has created a pressing need for automated annotation of gene function. Because domains are the basic units of protein function and evolution, a gene can be annotated from a domain database by aligning domains to the corresponding protein sequence. Ideally, complete domains are aligned to protein subsequences, in a ‘semi-global alignment’. Local alignment, which aligns pieces of domains to subsequences, is common in high-throughput annotation applications, however. It is a mature technique, with the heuristics and accurate E-values required for screening large databases and evaluating the screening results. Hidden Markov models (HMMs) provide an alternative theoretical framework for semi-global alignment, but their use is limited because they lack heuristic acceleration and accurate E-values. Our new tool, GLOBAL, overcomes some limitations of previous semi-global HMMs: it has accurate E-values and the possibility of the heuristic acceleration required for high-throughput applications. Moreover, according to a standard of truth based on protein structure, two semi-global HMM alignment tools (GLOBAL and HMMer) had comparable performance in identifying complete domains, but distinctly outperformed two tools based on local alignment. When searching for complete protein domains, therefore, GLOBAL avoids disadvantages commonly associated with HMMs, yet maintains their superior retrieval performance.
PMCID: PMC1950549  PMID: 17596268
24.  Scanning sequences after Gibbs sampling to find multiple occurrences of functional elements 
BMC Bioinformatics  2006;7:408.
Many DNA regulatory elements occur as multiple instances within a target promoter. Gibbs sampling programs for finding DNA regulatory elements de novo can be prohibitively slow in locating all instances of such an element in a sequence set.
We describe an improvement to the A-GLAM computer program, which predicts regulatory elements within DNA sequences with Gibbs sampling. The improvement adds an optional "scanning step" after Gibbs sampling. Gibbs sampling produces a position specific scoring matrix (PSSM). The new scanning step resembles an iterative PSI-BLAST search based on the PSSM. First, it assigns an "individual score" to each subsequence of appropriate length within the input sequences using the initial PSSM. Second, it computes an E-value from each individual score, to assess the agreement between the corresponding subsequence and the PSSM. Third, it permits subsequences with E-values falling below a threshold to contribute to the underlying PSSM, which is then updated using the Bayesian calculus. A-GLAM iterates its scanning step to convergence, at which point no new subsequences contribute to the PSSM. After convergence, A-GLAM reports predicted regulatory elements within each sequence in order of increasing E-values, so users have a statistical evaluation of the predicted elements in a convenient presentation. Thus, although the Gibbs sampling step in A-GLAM finds at most one regulatory element per input sequence, the scanning step can now rapidly locate further instances of the element in each sequence.
Datasets from experiments determining the binding sites of transcription factors were used to evaluate the improvement to A-GLAM. Typically, the datasets included several sequences containing multiple instances of a regulatory motif. The improvements to A-GLAM permitted it to predict the multiple instances.
PMCID: PMC1599759  PMID: 16961919
25.  Alignments anchored on genomic landmarks can aid in the identification of regulatory elements 
Bioinformatics (Oxford, England)  2005;21(Suppl 1):i440-i448.
The transcription start site (TSS) has been located for an increasing number of genes across several organisms. Statistical tests have shown that some cis-acting regulatory elements have positional preferences with respect to the TSS, but few strategies have emerged for locating elements by their positional preferences. This paper elaborates such a strategy. First, we align promoter regions without gaps, anchoring the alignment on each promoter’s TSS. Second, we apply a novel word-specific mask. Third, we apply a clustering test related to gapless BLAST statistics. The test examines whether any specific word is placed unusually consistently with respect to the TSS. Finally, our program A-GLAM, an extension of the GLAM program, uses significant word positions as new ‘anchors’ to realign the sequences. A Gibbs sampling algorithm then locates putative cis-acting regulatory elements. Usually, Gibbs sampling requires a preliminary masking step, to avoid convergence onto a dominant but uninteresting signal from a DNA repeat. However, since the positional anchors focus A-GLAM on the motif of interest, masking DNA repeats during Gibbs sampling becomes unnecessary.
In a set of human DNA sequences with experimentally characterized TSSs, the placement of 791 octonucleotide words was unusually consistent (multiple test corrected P < 0.05). Alignments anchored on these words sometimes located statistically significant motifs inaccessible to GLAM or AlignACE.
The A-GLAM program and a list of statistically significant words are available at
PMCID: PMC1317086  PMID: 15961489

Results 1-25 (29)