The software package TUNA (Testing UNtyped Alleles) implements a fast and efficient algorithm for testing association of genotyped and ungenotyped variants in genome-wide case-control studies. TUNA uses Linkage Disequilibrium (LD) information from existing comprehensive variation datasets such as HapMap to construct databases of frequency predictors using linear combination of haplotype frequencies of genotyped SNPs. The predictors are used to estimate untyped allele frequencies, and to perform association tests. The methods incorporated in TUNA achieve great accuracy in estimation, and the software is computationally efficient and does not demand a lot of system memory and CPU resources.
TMpro is a transmembrane (TM) helix prediction algorithm that uses language processing methodology for TM segment identification. It is primarily based on the analysis of statistical distributions of properties of amino acids in transmembrane segments. This article describes the availability of TMpro on the internet via a web interface. The key features of the interface are: (i) output is generated in multiple formats including a user-interactive graphical chart which allows comparison of TMpro predicted segment locations with other labeled segments input by the user, such as predictions from other methods. (ii) Up to 5000 sequences can be submitted at a time for prediction. (iii) TMpro is available as a web server and is published as a web service so that the method can be accessed by users as well as other services depending on the need for data integration.
The living cell array quantifies the contribution of activated transcription factors upon the expression levels of their target genes. The direct manipulation of the regulatory mechanisms offers enormous possibilities for deciphering the machinery that activates and controls gene expression. We propose a novel bi-clustering algorithm for generating non-overlapping clusters of reporter genes and conditions and demonstrate how this information can be interpreted in order to assist in the construction of transcription factor interaction networks.
Traditional phylogenetic methods assume tree-like evolutionary models and are likely to perform poorly when provided with sequence data from fast-evolving, recombining viruses. Furthermore, these methods assume that all the sequence data are from contemporaneous taxa, which is not valid for serially-sampled data. A more general approach is proposed here, referred to as the Sliding MinPD method, that reconstructs evolutionary networks for serially-sampled sequences in the presence of recombination.
Sliding MinPD combines distance-based phylogenetic methods with automated recombination detection based on the best-known sliding window approaches to reconstruct serial evolutionary networks. Its performance was evaluated through comprehensive simulation studies and was also applied to a set of serially-sampled HIV sequences from a single patient. The resulting network organizations reveal unique patterns of viral evolution and may help explain the emergence of disease-associated mutants and drug-resistant strains with implications for patient prognosis and treatment strategies.
The rate at which gene-related findings appear in the scientific literature makes it difficult if not impossible for biomedical scientists to keep fully informed and up to date. The importance of these findings argues for the development of automated methods that can find, extract and summarize this information. This article reports on methods for determining the molecular function claims that are being made in a scientific article, specifically those that are backed by experimental evidence.
The most significant result is that for molecular function claims based on direct assays, our methods achieved recall of 70.7% and precision of 65.7%. Furthermore, our methods correctly identified in the text 44.6% of the specific molecular function claims backed up by direct assays, but with a precision of only 0.92%, a disappointing outcome that led to an examination of the different kinds of errors. These results were based on an analysis of 1823 articles from the literature of Saccharomyces cerevisiae (budding yeast).
The annotation files for S.cerevisiae are available from ftp://genome-ftp.stanford.edu/pub/yeast/data_download/literature_curation/gene_association.sgd.gz. The draft protocol vocabulary is available by request from the first author.
There is extensive interest in automating the collection, organization, and analysis of biological data. Data in the form of images in online literature present special challenges for such efforts. The first steps in understanding the contents of a figure are decomposing it into panels and determining the type of each panel. In biological literature, panel types include many kinds of images collected by different techniques, such as photographs of gels or images from microscopes. We have previously described the SLIF system (http://slif.cbi.cmu.edu) that identifies panels containing fluorescence microscope images among figures in online journal articles as a prelude to further analysis of the subcellular patterns in such images. This system contains a pretrained classifier that uses image features to assign a type (class) to each separate panel. However, the types of panels in a figure are often correlated, so that we can consider the class of a panel to be dependent not only on its own features but also on the types of the other panels in a figure.
In this paper, we introduce the use of a type of probabilistic graphical model, a factor graph, to represent the structured information about the images in a figure, and permit more robust and accurate inference about their types. We obtain significant improvement over results for considering panels separately.
The code and data used for the experiments described here are available from http://murphylab.web.cmu.edu/software. Contact: firstname.lastname@example.org
Despite advances in the gene annotation process, the functions of a large portion of the gene products remain insufficiently characterized. In addition, the “in silico” prediction of novel Gene Ontology (GO) annotations for partially characterized gene functions or processes is highly dependent on reverse genetic or function genomics approaches.
We propose a novel approach, Information Theory-based Semantic Similarity (ITSS), to automatically predict molecular functions of genes based on Gene Ontology annotations. We have demonstrated using a 10-fold cross-validation that the ITSS algorithm obtains prediction accuracies (Precision 97%, Recall 77%) comparable to other machine learning algorithms when applied to similarly dense annotated portions of the GO datasets. In addition, such method can generate highly accurate predictions in sparsely annotated portions of GO, in which previous algorithm failed to do so. As a result, our technique generates an order of magnitude more gene function predictions than previous methods. Further, this paper presents the first historical rollback validation for the predicted GO annotations, which may represent more realistic conditions for an evaluation than generally used cross-validations type of evaluations. By manually assessing a random sample of 100 predictions conducted in a historical roll-back evaluation, we estimate that a minimum precision of 51% (95% confidence interval: 43%–58%) can be achieved for the human GO Annotation file dated 2003.
The program is available on request. The 97,732 positive predictions of novel gene annotations from the 2005 GO Annotation dataset are available at http://phenos.bsd.uchicago.edu/mphenogo/prediction_result_2005.txt.
Correct gene predictions are crucial for most analyses of genomes. However, in the absence of transcript data, gene prediction is still challenging. One way to improve gene-finding accuracy in such genomes is to combine the exons predicted by several gene-finders, so that gene-finders that make uncorrelated errors can correct each other.
We present a method for combining gene-finders called Genomix. Genomix selects the predicted exons that are best conserved within and/or between species in terms of sequence and intron–exon structure, and combines them into a gene structure. Genomix was used to combine predictions from four gene-finders for Caenorhabditis elegans, by selecting the predicted exons that are best conserved with C.briggsae and C.remanei. On a set of ~1500 confirmed C.elegans genes, Genomix increased the exon-level specificity by 10.1% and sensitivity by 2.7% compared to the best input gene-finder.
Genome-wide experiments only rarely show resounding success in yielding genes associated with complex polygenic disorders. We evaluate 49 obesity-related genome-wide experiments with publicly-available findings, including microarray, genetics, proteomics and gene knock-down from human, mouse, rat and worm, in terms of their ability to rediscover a comprehensive set of genes previously found to be causally associated or having variants associated with obesity.
Individual experiments show poor predictive ability for rediscovering known obesity-associated genes. We show that intersecting the results of experiments significantly improves the sensitivity, specificity and precision of the prediction of obesity-associated genes. We create an integrative model that statistically significantly outperforms all 49 individual genome-wide experiments. We find that genes known to be associated with obesity are significantly implicated in more obesity-related experiments and use this to provide a list of genes that we predict to have the highest likelihood of association for obesity. The approach described here can include any number and type of genome-wide experiments and might be useful for other complex polygenic disorders as well.
Many biomedical projects would benefit from reducing the time and expense of in vitro experimentation by using computer models for in silico predictions. These models may help determine which expensive biological data are most useful to acquire next. Active Learning techniques for choosing the most informative data enable biologists and computer scientists to optimize experimental data choices for rapid discovery of biological function. To explore design choices that affect this desirable behavior, five novel and five existing Active Learning techniques, together with three control methods, were tested on 57 previously unknown p53 cancer rescue mutants for their ability to build classifiers that predict protein function. The best of these techniques, Maximum Curiosity, improved the baseline accuracy of 56% to 77%. This paper shows that Active Learning is a useful tool for biomedical research, and provides a case study of interest to others facing similar discovery challenges.
In two-color microarray experiments, well-known differences exist in the labeling and hybridization efficiency of Cy3 and Cy5 dyes. Previous reports have revealed that these differences can vary on a gene-by-gene basis, an effect termed gene-specific dye bias. If uncorrected, this bias can influence the determination of differentially expressed genes.
We show that the magnitude of the bias scales multiplicatively with signal intensity and is dependent on which nucleotide has been conjugated to the fluorescent dye. A method is proposed to account for gene-specific dye bias within a maximum-likelihood error modeling framework. Using two different labeling schemes, we show that correcting for gene-specific dye bias results in the superior identification of differentially expressed genes within this framework. Improvement is also possible in related ANOVA approaches.
Genetic association studies increasingly rely on the use of linkage disequilibrium (LD) tag SNPs to reduce genotyping costs. We developed a software package TAGster to select, evaluate, and visualize LD tag SNPs both for single and multiple populations. We implement several strategies to improve the efficiency of current LD tag SNP selection algorithms: 1) we modify the tag SNP selection procedure of Carlson et al. (2004) to improve selection efficiency and further generalize it to multiple populations. 2) We propose a redundant SNP elimination step to speed up the exhaustive tag SNP search algorithm proposed by Qin et al. (2006). 3) We present an additional multiple population tag SNP selection algorithm based on the framework of Howie et al. (2006), but using our modified exhaustive search procedure. We evaluate these methods using resequenced candidate gene data from the Environmental Genome Project and show improvements in both computational and tagging efficiency.
Alternative splicing has the potential to generate a wide range of protein isoforms. For many computational applications and for experimental research, it is important to be able to concentrate on the isoform that retains the core biological function. For many genes this is far from clear.
We have combined five methods into a pipeline that allows us to detect the principal variant for a gene. Most of the methods were based on conservation between species, at the level of both gene and protein. The five methods used were the conservation of exonic structure, the detection of non-neutral evolution, the conservation of functional residues, the existence of a known protein structure and the abundance of vertebrate orthologues. The pipeline was able to determine a principal isoform for 83% of a set of well-annotated genes with multiple variants.
Increasing complexity of cell signaling network maps requires sophisticated visualization technologies. Simple web-based visualization tools can allow for improved data presentation and collaboration. Researchers studying cell signaling would benefit from having the ability to embed dynamic cell signaling maps in web pages.
A demo page of AVIS and links to applications and distributions can be found at http://actin.pharm.mssm.edu/AVIS2. Detailed instructions for using and configuring AVIS can be found in the user manual at http://actin.pharm.mssm.edu/AVIS2/manual.pdf.
Large-scale genotyping relies on the use of unsupervised automated calling algorithms to assign genotypes to hybridization data. A number of such calling algorithms have been recently established for the Affymetrix GeneChip genotyping technology. Here, we present a fast and accurate genotype calling algorithm for the Illumina BeadArray genotyping platforms. As the technology moves towards assaying millions of genetic polymorphisms simultaneously, there is a need for an integrated and easy-to-use software for calling genotypes.
We have introduced a model-based genotype calling algorithm which does not rely on having prior training data or require computationally intensive procedures. The algorithm can assign genotypes to hybridization data from thousands of individuals simultaneously and pools information across multiple individuals to improve the calling. The method can accommodate variations in hybridization intensities which result in dramatic shifts of the position of the genotype clouds by identifying the optimal coordinates to initialize the algorithm. By incorporating the process of perturbation analysis, we can obtain a quality metric measuring the stability of the assigned genotype calls. We show that this quality metric can be used to identify SNPs with low call rates and accuracy.
The C++ executable for the algorithm described here is available by request from the authors.
We have created a program that searches densely genotyped regions for associated non-contiguous haplotypes using a standard family based haplotype association test. This program was designed to expand upon the ‘sliding window’ methodologies commonly used for haplotype construction by allowing the association of subsets of single nucleotide polymorphisms (SNPs) to drive the construction of the haplotype. This strategy permits HaploBuild to construct more biologically relevant haplotypes that are not constrained by arbitrary length and contiguous orientation.
A new web server, InterProSurf, predicts interacting amino acid residues in proteins that are most likely to interact with other proteins, given the 3D structures of subunits of a protein complex. The prediction method is based on solvent accessible surface area of residues in the isolated subunits, a propensity scale for interface residues and a clustering algorithm to identify surface regions with residues of high interface propensities. Here we illustrate the application of InterProSurf to determine which areas of Bacillus anthracis toxins and measles virus hemagglutinin protein interact with their respective cell surface receptors. The computationally predicted regions overlap with those regions previously identified as interface regions by sequence analysis and mutagenesis experiments.
Properly annotated sequence data for flaviviruses, which cause diseases, such as tick-borne encephalitis (TBE), dengue fever (DF), West Nile (WN) and yellow fever (YF), can aid in the design of antiviral drugs and vaccines to prevent their spread. Flavitrack was designed to help identify conserved sequence motifs, interpret mutational and structural data and track evolution of phenotypic properties.
Flavitrack contains over 590 complete flavivirus genome/protein sequences and information on known mutations and literature references. Each sequence has been manually annotated according to its date and place of isolation, phenotype and lethality. Internal tools are provided to rapidly determine relationships between viruses in Flavitrack and sequences provided by the user.
Profile searches of sequence databases are a sensitive way to detect sequence relationships. Sophisticated profile-profile comparison algorithms that have been recently introduced increase search sensitivity even further.
In this article, a simpler approach than profile-profile comparison is presented that has a comparable performance to state-of-the-art tools such as COMPASS, HHsearch and PRC. This approach is called SCOOP (Simple Comparison Of Outputs Program), and is shown to find known relationships between families in the Pfam database as well as detect novel distant relationships between families. Several novel discoveries are presented including the discovery that a domain of unknown function (DUF283) found in Dicer proteins is related to double-stranded RNA-binding domains.
SCOOP is freely available under a GNU GPL license from http://www.sanger.ac.uk/Users/agb/SCOOP/
BLISS 2.0 is a web-based application for identifying conserved regulatory modules in distantly related orthologous sequences. Unlike existing approaches, it performs the cross-genome comparison at the binding site level. Experimental results on simulated and real world data indicate that BLISS 2.0 can identify conserved regulatory modules from sequences with little overall similarity at the DNA sequence level.
One bottleneck in high-throughput protein crystallography is interpreting an electron-density map; that is, fitting a molecular model to the 3D picture crystallography produces. Previously, we developed Acmi, an algorithm that uses a probabilistic model to infer an accurate protein backbone layout. Here we use a sampling method known as particle filtering to produce a set of all-atom protein models. We use the output of Acmi to guide the particle filter's sampling, producing an accurate, physically feasible set of structures.
We test our algorithm on ten poor-quality experimental density maps. We show that particle filtering produces accurate all-atom models, resulting in fewer chains, lower sidechain RMS error, and reduced R factor, compared to simply placing the best-matching sidechains on Acmi's trace. We show that our approach produces a more accurate model than three leading methods – Textal, Resolve, and ARP/wARP – in terms of main chain completeness, sidechain identification, and crystallographic R factor.
One of the challenges in protein secondary structure prediction is to overcome the cross-validated 80% prediction accuracy barrier. Here, we propose a novel approach to surpass this barrier. Instead of using a single algorithm that relies on a limited data set for training, we combine two complementary methods having different strengths: Fragment Database Mining (FDM) and GOR V. FDM harnesses the availability of the known protein structures in the Protein Data Bank and provides highly accurate secondary structure predictions when sequentially similar structural fragments are identified. In contrast, the GOR V algorithm is based on information theory, Bayesian statistics, and PSI-BLAST multiple sequence alignments to predict the secondary structure of residues inside a sliding window along a protein chain. A combination of these two different methods benefits from the large number of structures in the PDB and significantly improves the secondary structure prediction accuracy, resulting in Q3 ranging from 67.5 to 93.2%, depending on the availability of highly similar fragments in the Protein Data Bank.
Motivation: The development of genome-wide capabilities for genotyping has led to the practical problem of identifying the minimum subset of genetic variants relevant to the classification of a phenotype. This challenge is especially difficult in the presence of attribute interactions, noise and small sample size.
Methods: Analogous to the physical mechanism of evaporation, we introduce an evaporative cooling (EC) feature selection algorithm that seeks to obtain a subset of attributes with the optimum information temperature (i.e. the least noise). EC uses an attribute quality measure analogous to thermodynamic free energy that combines Relief-F and mutual information to evaporate (i.e. remove) noise features, leaving behind a subset of attributes that contain DNA sequence variations associated with a given phenotype.
Results: EC is able to identify functional sequence variations that involve interactions (epistasis) between other sequence variations that influence their association with the phenotype. This ability is demonstrated on simulated genotypic data with attribute interactions and on real genotypic data from individuals who experienced adverse events following smallpox vaccination. The EC formalism allows us to combine information entropy, energy and temperature into a single information free energy attribute quality measure that balances interaction and main effects.
Availability: Open source software, written in Java, is freely available upon request.
Discussion of point mutations is ubiquitous in biomedical literature, and manually compiling databases or literature on mutations in specific genes or proteins is tedious. We present an open-source, rule-based system, MutationFinder, for extracting point mutation mentions from text. On blind test data, it achieves nearly perfect precision and a markedly improved recall over a baseline.
MutationFinder, along with a high-quality gold standard data set, and a scoring script for mutation extraction systems have been made publicly available. Implementations, source code and unit tests are available in Python, Perl and Java. MutationFinder can be used as a stand-alone script, or imported by other applications.
Knowledge base construction has been an area of intense activity and great importance in the growth of computational biology. However, there is little or no history of work on the subject of evaluation of knowledge bases, either with respect to their contents or with respect to the processes by which they are constructed. This article proposes the application of a metric from software engineering known as the found/fixed graph to the problem of evaluating the processes by which genomic knowledge bases are built, as well as the completeness of their contents.
Well-understood patterns of change in the found/fixed graph are found to occur in two large publicly available knowledge bases. These patterns suggest that the current manual curation processes will take far too long to complete the annotations of even just the most important model organisms, and that at their current rate of production, they will never be sufficient for completing the annotation of all currently available proteomes.