Search tips
Search criteria

Results 1-25 (36)

Clipboard (0)
Year of Publication
Document Types
1.  Association studies for untyped markers with TUNA 
Bioinformatics (Oxford, England)  2007;24(3):435-437.
The software package TUNA (Testing UNtyped Alleles) implements a fast and efficient algorithm for testing association of genotyped and ungenotyped variants in genome-wide case-control studies. TUNA uses Linkage Disequilibrium (LD) information from existing comprehensive variation datasets such as HapMap to construct databases of frequency predictors using linear combination of haplotype frequencies of genotyped SNPs. The predictors are used to estimate untyped allele frequencies, and to perform association tests. The methods incorporated in TUNA achieve great accuracy in estimation, and the software is computationally efficient and does not demand a lot of system memory and CPU resources.
PMCID: PMC4051297  PMID: 18057020
2.  TMpro web server and web service: transmembrane helix prediction through amino acid property analysis 
Bioinformatics  2007;23(20):2795-2796.
TMpro is a transmembrane (TM) helix prediction algorithm that uses language processing methodology for TM segment identification. It is primarily based on the analysis of statistical distributions of properties of amino acids in transmembrane segments. This article describes the availability of TMpro on the internet via a web interface. The key features of the interface are: (i) output is generated in multiple formats including a user-interactive graphical chart which allows comparison of TMpro predicted segment locations with other labeled segments input by the user, such as predictions from other methods. (ii) Up to 5000 sequences can be submitted at a time for prediction. (iii) TMpro is available as a web server and is published as a web service so that the method can be accessed by users as well as other services depending on the need for data integration.
PMCID: PMC3263380  PMID: 17724062
3.  A novel non-overlapping bi-clustering algorithm for network generation using living cell array data 
Bioinformatics (Oxford, England)  2007;23(17):2306-2313.
The living cell array quantifies the contribution of activated transcription factors upon the expression levels of their target genes. The direct manipulation of the regulatory mechanisms offers enormous possibilities for deciphering the machinery that activates and controls gene expression. We propose a novel bi-clustering algorithm for generating non-overlapping clusters of reporter genes and conditions and demonstrate how this information can be interpreted in order to assist in the construction of transcription factor interaction networks.
PMCID: PMC3208260  PMID: 17827207
4.  Sliding MinPD: building evolutionary networks of serial samples via an automated recombination detection approach 
Bioinformatics (Oxford, England)  2007;23(22):2993-3000.
Traditional phylogenetic methods assume tree-like evolutionary models and are likely to perform poorly when provided with sequence data from fast-evolving, recombining viruses. Furthermore, these methods assume that all the sequence data are from contemporaneous taxa, which is not valid for serially-sampled data. A more general approach is proposed here, referred to as the Sliding MinPD method, that reconstructs evolutionary networks for serially-sampled sequences in the presence of recombination.
Sliding MinPD combines distance-based phylogenetic methods with automated recombination detection based on the best-known sliding window approaches to reconstruct serial evolutionary networks. Its performance was evaluated through comprehensive simulation studies and was also applied to a set of serially-sampled HIV sequences from a single patient. The resulting network organizations reveal unique patterns of viral evolution and may help explain the emergence of disease-associated mutants and drug-resistant strains with implications for patient prognosis and treatment strategies.
PMCID: PMC3187926  PMID: 17717035
5.  Mining experimental evidence of molecular function claims from the literature 
Bioinformatics (Oxford, England)  2007;23(23):3232-3240.
The rate at which gene-related findings appear in the scientific literature makes it difficult if not impossible for biomedical scientists to keep fully informed and up to date. The importance of these findings argues for the development of automated methods that can find, extract and summarize this information. This article reports on methods for determining the molecular function claims that are being made in a scientific article, specifically those that are backed by experimental evidence.
The most significant result is that for molecular function claims based on direct assays, our methods achieved recall of 70.7% and precision of 65.7%. Furthermore, our methods correctly identified in the text 44.6% of the specific molecular function claims backed up by direct assays, but with a precision of only 0.92%, a disappointing outcome that led to an examination of the different kinds of errors. These results were based on an analysis of 1823 articles from the literature of Saccharomyces cerevisiae (budding yeast).
The annotation files for S.cerevisiae are available from The draft protocol vocabulary is available by request from the first author.
PMCID: PMC3041023  PMID: 17942445
6.  Improved Recognition of Figures containing Fluorescence Microscope Images in Online Journal Articles using Graphical Models 
Bioinformatics (Oxford, England)  2007;24(4):569-576.
There is extensive interest in automating the collection, organization, and analysis of biological data. Data in the form of images in online literature present special challenges for such efforts. The first steps in understanding the contents of a figure are decomposing it into panels and determining the type of each panel. In biological literature, panel types include many kinds of images collected by different techniques, such as photographs of gels or images from microscopes. We have previously described the SLIF system ( that identifies panels containing fluorescence microscope images among figures in online journal articles as a prelude to further analysis of the subcellular patterns in such images. This system contains a pretrained classifier that uses image features to assign a type (class) to each separate panel. However, the types of panels in a figure are often correlated, so that we can consider the class of a panel to be dependent not only on its own features but also on the types of the other panels in a figure.
In this paper, we introduce the use of a type of probabilistic graphical model, a factor graph, to represent the structured information about the images in a figure, and permit more robust and accurate inference about their types. We obtain significant improvement over results for considering panels separately.
The code and data used for the experiments described here are available from Contact:
PMCID: PMC2901545  PMID: 18033795
7.  Information theory applied to the sparse gene ontology annotation network to predict novel gene function 
Bioinformatics (Oxford, England)  2007;23(13):i529-i538.
Despite advances in the gene annotation process, the functions of a large portion of the gene products remain insufficiently characterized. In addition, the “in silico” prediction of novel Gene Ontology (GO) annotations for partially characterized gene functions or processes is highly dependent on reverse genetic or function genomics approaches.
We propose a novel approach, Information Theory-based Semantic Similarity (ITSS), to automatically predict molecular functions of genes based on Gene Ontology annotations. We have demonstrated using a 10-fold cross-validation that the ITSS algorithm obtains prediction accuracies (Precision 97%, Recall 77%) comparable to other machine learning algorithms when applied to similarly dense annotated portions of the GO datasets. In addition, such method can generate highly accurate predictions in sparsely annotated portions of GO, in which previous algorithm failed to do so. As a result, our technique generates an order of magnitude more gene function predictions than previous methods. Further, this paper presents the first historical rollback validation for the predicted GO annotations, which may represent more realistic conditions for an evaluation than generally used cross-validations type of evaluations. By manually assessing a random sample of 100 predictions conducted in a historical roll-back evaluation, we estimate that a minimum precision of 51% (95% confidence interval: 43%–58%) can be achieved for the human GO Annotation file dated 2003.
The program is available on request. The 97,732 positive predictions of novel gene annotations from the 2005 GO Annotation dataset are available at
PMCID: PMC2882681  PMID: 17646340
8.  Genomix 
Bioinformatics (Oxford, England)  2007;23(12):1468-1475.
Correct gene predictions are crucial for most analyses of genomes. However, in the absence of transcript data, gene prediction is still challenging. One way to improve gene-finding accuracy in such genomes is to combine the exons predicted by several gene-finders, so that gene-finders that make uncorrelated errors can correct each other.
We present a method for combining gene-finders called Genomix. Genomix selects the predicted exons that are best conserved within and/or between species in terms of sequence and intron–exon structure, and combines them into a gene structure. Genomix was used to combine predictions from four gene-finders for Caenorhabditis elegans, by selecting the predicted exons that are best conserved with C.briggsae and C.remanei. On a set of ~1500 confirmed C.elegans genes, Genomix increased the exon-level specificity by 10.1% and sensitivity by 2.7% compared to the best input gene-finder.
PMCID: PMC2880447  PMID: 17483502
9.  Evaluation and Integration of 49 Genome-wide Experiments and the Prediction of Previously Unknown Obesity-related Genes 
Bioinformatics (Oxford, England)  2007;23(21):2910-2917.
Genome-wide experiments only rarely show resounding success in yielding genes associated with complex polygenic disorders. We evaluate 49 obesity-related genome-wide experiments with publicly-available findings, including microarray, genetics, proteomics and gene knock-down from human, mouse, rat and worm, in terms of their ability to rediscover a comprehensive set of genes previously found to be causally associated or having variants associated with obesity.
Individual experiments show poor predictive ability for rediscovering known obesity-associated genes. We show that intersecting the results of experiments significantly improves the sensitivity, specificity and precision of the prediction of obesity-associated genes. We create an integrative model that statistically significantly outperforms all 49 individual genome-wide experiments. We find that genes known to be associated with obesity are significantly implicated in more obesity-related experiments and use this to provide a list of genes that we predict to have the highest likelihood of association for obesity. The approach described here can include any number and type of genome-wide experiments and might be useful for other complex polygenic disorders as well.
PMCID: PMC2839901  PMID: 17921495
10.  Choosing where to look next in a mutation sequence space: Active Learning of informative p53 cancer rescue mutants 
Bioinformatics (Oxford, England)  2007;23(13):i104-i114.
Many biomedical projects would benefit from reducing the time and expense of in vitro experimentation by using computer models for in silico predictions. These models may help determine which expensive biological data are most useful to acquire next. Active Learning techniques for choosing the most informative data enable biologists and computer scientists to optimize experimental data choices for rapid discovery of biological function. To explore design choices that affect this desirable behavior, five novel and five existing Active Learning techniques, together with three control methods, were tested on 57 previously unknown p53 cancer rescue mutants for their ability to build classifiers that predict protein function. The best of these techniques, Maximum Curiosity, improved the baseline accuracy of 56% to 77%. This paper shows that Active Learning is a useful tool for biomedical research, and provides a case study of interest to others facing similar discovery challenges.
PMCID: PMC2811495  PMID: 17646286
11.  Correcting for gene-specific dye bias in DNA microarrays using the method of maximum likelihood 
In two-color microarray experiments, well-known differences exist in the labeling and hybridization efficiency of Cy3 and Cy5 dyes. Previous reports have revealed that these differences can vary on a gene-by-gene basis, an effect termed gene-specific dye bias. If uncorrected, this bias can influence the determination of differentially expressed genes.
We show that the magnitude of the bias scales multiplicatively with signal intensity and is dependent on which nucleotide has been conjugated to the fluorescent dye. A method is proposed to account for gene-specific dye bias within a maximum-likelihood error modeling framework. Using two different labeling schemes, we show that correcting for gene-specific dye bias results in the superior identification of differentially expressed genes within this framework. Improvement is also possible in related ANOVA approaches.
PMCID: PMC2811084  PMID: 17623705
12.  TAGster: Efficient Selection of LD tag SNPs in Single or Multiple Populations 
Bioinformatics (Oxford, England)  2007;23(23):3254-3255.
Genetic association studies increasingly rely on the use of linkage disequilibrium (LD) tag SNPs to reduce genotyping costs. We developed a software package TAGster to select, evaluate, and visualize LD tag SNPs both for single and multiple populations. We implement several strategies to improve the efficiency of current LD tag SNP selection algorithms: 1) we modify the tag SNP selection procedure of Carlson et al. (2004) to improve selection efficiency and further generalize it to multiple populations. 2) We propose a redundant SNP elimination step to speed up the exhaustive tag SNP search algorithm proposed by Qin et al. (2006). 3) We present an additional multiple population tag SNP selection algorithm based on the framework of Howie et al. (2006), but using our modified exhaustive search procedure. We evaluate these methods using resequenced candidate gene data from the Environmental Genome Project and show improvements in both computational and tagging efficiency.
PMCID: PMC2782964  PMID: 17827206
13.  Determination and validation of principal gene products 
Alternative splicing has the potential to generate a wide range of protein isoforms. For many computational applications and for experimental research, it is important to be able to concentrate on the isoform that retains the core biological function. For many genes this is far from clear.
We have combined five methods into a pipeline that allows us to detect the principal variant for a gene. Most of the methods were based on conservation between species, at the level of both gene and protein. The five methods used were the conservation of exonic structure, the detection of non-neutral evolution, the conservation of functional residues, the existence of a known protein structure and the abundance of vertebrate orthologues. The pipeline was able to determine a principal isoform for 83% of a set of well-annotated genes with multiple variants.
PMCID: PMC2734078  PMID: 18006548
14.  AVIS: AJAX viewer of interactive signaling networks 
Bioinformatics (Oxford, England)  2007;23(20):2803-2805.
Increasing complexity of cell signaling network maps requires sophisticated visualization technologies. Simple web-based visualization tools can allow for improved data presentation and collaboration. Researchers studying cell signaling would benefit from having the ability to embed dynamic cell signaling maps in web pages.
AVIS is a Google gadget compatible web-based viewer of interactive cell signaling networks. AVIS is an implementation of AJAX (Asynchronous JavaScript with XML) with the usage of the libraries GraphViz, ImageMagic (PerlMagic) and overLib. AVIS provides web-based visualization of text-based signaling networks with dynamical zooming, panning and linking capabilities. AVIS is a cross-platform web-based tool that can be used to visualize network maps as embedded objects in any web page. AVIS was implemented for visualization of PathwayGenerator, a tool that displays over 4000 automatically generated mammalian cell signaling maps; NodeNeighborhood a tool to visualize first and second interacting neighbors of yeast and mammalian proteins; and for Genes2Networks, a tool to connect lists of genes and protein using background protein interaction networks.
A demo page of AVIS and links to applications and distributions can be found at Detailed instructions for using and configuring AVIS can be found in the user manual at
PMCID: PMC2724864  PMID: 17855420
15.  A genotype calling algorithm for the Illumina BeadArray platform 
Bioinformatics (Oxford, England)  2007;23(20):2741-2746.
Large-scale genotyping relies on the use of unsupervised automated calling algorithms to assign genotypes to hybridization data. A number of such calling algorithms have been recently established for the Affymetrix GeneChip genotyping technology. Here, we present a fast and accurate genotype calling algorithm for the Illumina BeadArray genotyping platforms. As the technology moves towards assaying millions of genetic polymorphisms simultaneously, there is a need for an integrated and easy-to-use software for calling genotypes.
We have introduced a model-based genotype calling algorithm which does not rely on having prior training data or require computationally intensive procedures. The algorithm can assign genotypes to hybridization data from thousands of individuals simultaneously and pools information across multiple individuals to improve the calling. The method can accommodate variations in hybridization intensities which result in dramatic shifts of the position of the genotype clouds by identifying the optimal coordinates to initialize the algorithm. By incorporating the process of perturbation analysis, we can obtain a quality metric measuring the stability of the assigned genotype calls. We show that this quality metric can be used to identify SNPs with low call rates and accuracy.
The C++ executable for the algorithm described here is available by request from the authors.
PMCID: PMC2666488  PMID: 17846035
16.  HaploBuild: an algorithm to construct non-contiguous associated haplotypes in family based genetic studies 
Bioinformatics (Oxford, England)  2007;23(16):2190-2192.
We have created a program that searches densely genotyped regions for associated non-contiguous haplotypes using a standard family based haplotype association test. This program was designed to expand upon the ‘sliding window’ methodologies commonly used for haplotype construction by allowing the association of subsets of single nucleotide polymorphisms (SNPs) to drive the construction of the haplotype. This strategy permits HaploBuild to construct more biologically relevant haplotypes that are not constrained by arbitrary length and contiguous orientation.
PMCID: PMC2665175  PMID: 17586827
17.  InterProSurf: a web server for predicting interacting sites on protein surfaces 
Bioinformatics (Oxford, England)  2007;23(24):3397-3399.
A new web server, InterProSurf, predicts interacting amino acid residues in proteins that are most likely to interact with other proteins, given the 3D structures of subunits of a protein complex. The prediction method is based on solvent accessible surface area of residues in the isolated subunits, a propensity scale for interface residues and a clustering algorithm to identify surface regions with residues of high interface propensities. Here we illustrate the application of InterProSurf to determine which areas of Bacillus anthracis toxins and measles virus hemagglutinin protein interact with their respective cell surface receptors. The computationally predicted regions overlap with those regions previously identified as interface regions by sequence analysis and mutagenesis experiments.
PMCID: PMC2636624  PMID: 17933856
18.  Flavitrack: an annotated database of flavivirus sequences 
Bioinformatics (Oxford, England)  2007;23(19):2645-2647.
Properly annotated sequence data for flaviviruses, which cause diseases, such as tick-borne encephalitis (TBE), dengue fever (DF), West Nile (WN) and yellow fever (YF), can aid in the design of antiviral drugs and vaccines to prevent their spread. Flavitrack was designed to help identify conserved sequence motifs, interpret mutational and structural data and track evolution of phenotypic properties.
Flavitrack contains over 590 complete flavivirus genome/protein sequences and information on known mutations and literature references. Each sequence has been manually annotated according to its date and place of isolation, phenotype and lethality. Internal tools are provided to rapidly determine relationships between viruses in Flavitrack and sequences provided by the user.
Supplementary information
PMCID: PMC2629353  PMID: 17660525
19.  SCOOP: a simple method for identification of novel protein superfamily relationships 
Bioinformatics (Oxford, England)  2007;23(7):809-814.
Profile searches of sequence databases are a sensitive way to detect sequence relationships. Sophisticated profile-profile comparison algorithms that have been recently introduced increase search sensitivity even further.
In this article, a simpler approach than profile-profile comparison is presented that has a comparable performance to state-of-the-art tools such as COMPASS, HHsearch and PRC. This approach is called SCOOP (Simple Comparison Of Outputs Program), and is shown to find known relationships between families in the Pfam database as well as detect novel distant relationships between families. Several novel discoveries are presented including the discovery that a domain of unknown function (DUF283) found in Dicer proteins is related to double-stranded RNA-binding domains.
SCOOP is freely available under a GNU GPL license from
PMCID: PMC2603044  PMID: 17277330
20.  BLISS 2.0: a web-based tool for predicting conserved regulatory modules in distantly-related orthologous sequences 
Bioinformatics (Oxford, England)  2007;23(23):3249-3250.
BLISS 2.0 is a web-based application for identifying conserved regulatory modules in distantly related orthologous sequences. Unlike existing approaches, it performs the cross-genome comparison at the binding site level. Experimental results on simulated and real world data indicate that BLISS 2.0 can identify conserved regulatory modules from sequences with little overall similarity at the DNA sequence level.
PMCID: PMC2584781  PMID: 17660203
21.  Creating Protein Models from Electron-Density Maps using Particle-Filtering Methods 
Bioinformatics (Oxford, England)  2007;23(21):2851-2858.
One bottleneck in high-throughput protein crystallography is interpreting an electron-density map; that is, fitting a molecular model to the 3D picture crystallography produces. Previously, we developed Acmi, an algorithm that uses a probabilistic model to infer an accurate protein backbone layout. Here we use a sampling method known as particle filtering to produce a set of all-atom protein models. We use the output of Acmi to guide the particle filter's sampling, producing an accurate, physically feasible set of structures.
We test our algorithm on ten poor-quality experimental density maps. We show that particle filtering produces accurate all-atom models, resulting in fewer chains, lower sidechain RMS error, and reduced R factor, compared to simply placing the best-matching sidechains on Acmi's trace. We show that our approach produces a more accurate model than three leading methods – Textal, Resolve, and ARP/wARP – in terms of main chain completeness, sidechain identification, and crystallographic R factor.
PMCID: PMC2567142  PMID: 17933855
22.  Consensus Data Mining (CDM) Protein Secondary Structure Prediction Server: Combining GOR V and Fragment Database Mining (FDM) 
Bioinformatics (Oxford, England)  2007;23(19):2628-2630.
One of the challenges in protein secondary structure prediction is to overcome the cross-validated 80% prediction accuracy barrier. Here, we propose a novel approach to surpass this barrier. Instead of using a single algorithm that relies on a limited data set for training, we combine two complementary methods having different strengths: Fragment Database Mining (FDM) and GOR V. FDM harnesses the availability of the known protein structures in the Protein Data Bank and provides highly accurate secondary structure predictions when sequentially similar structural fragments are identified. In contrast, the GOR V algorithm is based on information theory, Bayesian statistics, and PSI-BLAST multiple sequence alignments to predict the secondary structure of residues inside a sliding window along a protein chain. A combination of these two different methods benefits from the large number of structures in the PDB and significantly improves the secondary structure prediction accuracy, resulting in Q3 ranging from 67.5 to 93.2%, depending on the availability of highly similar fragments in the Protein Data Bank.
PMCID: PMC2553684  PMID: 17660202
23.  Evaporative cooling feature selection for genotypic data involving interactions 
Bioinformatics  2007;23(16):2113-2120.
Motivation: The development of genome-wide capabilities for genotyping has led to the practical problem of identifying the minimum subset of genetic variants relevant to the classification of a phenotype. This challenge is especially difficult in the presence of attribute interactions, noise and small sample size.
Methods: Analogous to the physical mechanism of evaporation, we introduce an evaporative cooling (EC) feature selection algorithm that seeks to obtain a subset of attributes with the optimum information temperature (i.e. the least noise). EC uses an attribute quality measure analogous to thermodynamic free energy that combines Relief-F and mutual information to evaporate (i.e. remove) noise features, leaving behind a subset of attributes that contain DNA sequence variations associated with a given phenotype.
Results: EC is able to identify functional sequence variations that involve interactions (epistasis) between other sequence variations that influence their association with the phenotype. This ability is demonstrated on simulated genotypic data with attribute interactions and on real genotypic data from individuals who experienced adverse events following smallpox vaccination. The EC formalism allows us to combine information entropy, energy and temperature into a single information free energy attribute quality measure that balances interaction and main effects.
Availability: Open source software, written in Java, is freely available upon request.
PMCID: PMC3988427  PMID: 17586549
24.  MutationFinder: a high-performance system for extracting point mutation mentions from text 
Bioinformatics (Oxford, England)  2007;23(14):1862-1865.
Discussion of point mutations is ubiquitous in biomedical literature, and manually compiling databases or literature on mutations in specific genes or proteins is tedious. We present an open-source, rule-based system, MutationFinder, for extracting point mutation mentions from text. On blind test data, it achieves nearly perfect precision and a markedly improved recall over a baseline.
MutationFinder, along with a high-quality gold standard data set, and a scoring script for mutation extraction systems have been made publicly available. Implementations, source code and unit tests are available in Python, Perl and Java. MutationFinder can be used as a stand-alone script, or imported by other applications.
Project URL
PMCID: PMC2516306  PMID: 17495998
25.  Manual curation is not sufficient for annotation of genomic databases 
Bioinformatics (Oxford, England)  2007;23(13):i41-i48.
Knowledge base construction has been an area of intense activity and great importance in the growth of computational biology. However, there is little or no history of work on the subject of evaluation of knowledge bases, either with respect to their contents or with respect to the processes by which they are constructed. This article proposes the application of a metric from software engineering known as the found/fixed graph to the problem of evaluating the processes by which genomic knowledge bases are built, as well as the completeness of their contents.
Well-understood patterns of change in the found/fixed graph are found to occur in two large publicly available knowledge bases. These patterns suggest that the current manual curation processes will take far too long to complete the annotations of even just the most important model organisms, and that at their current rate of production, they will never be sufficient for completing the annotation of all currently available proteomes.
PMCID: PMC2516305  PMID: 17646325

Results 1-25 (36)