A challenging problem after a genome-wide association study (GWAS) is to balance the statistical evidence of geno-type-phenotype correlation with a priori evidence of biological relevance.
We introduce a method for systematically prioritizing single nucleotide polymorphisms (SNPs) for further study after a GWAS. The method combines evidence across multiple domains, including statistical evidence of genotype-phenotype correlation, known pathways in the pathologic development of disease, SNP/gene functional properties, comparative genomics, prior evidence of genetic linkage, and linkage disequilibrium. We apply this method to a GWAS of nicotine dependence, and use simulated data to test it on several commercial SNP microarrays.
A backtrace through a dynamic programming algorithm’s intermediate results in search of an optimal path, or to sample paths according to an implied probability distribution, or as the second stage of a forward–backward algorithm, is a task of fundamental importance in computational biology. When there is insufficient space to store all intermediate results in high-speed memory (e.g. cache) existing approaches store selected stages of the computation, and recompute missing values from these checkpoints on an as-needed basis.
Here we present an optimal checkpointing strategy, and demonstrate its utility with pairwise local sequence alignment of sequences of length 10 000.
Sample C++-code for optimal backtrace is available in the Supplementary Materials.
Tandem mass spectrometry (MS/MS) is an indispensable technology for identification of proteins from complex mixtures. Proteins are digested to peptides that are then identified by their fragmentation patterns in the mass spectrometer. Thus, at its core, MS/MS protein identification relies on the relative predictability of peptide fragmentation. Unfortunately, peptide fragmentation is complex and not fully understood, and what is understood is not always exploited by peptide identification algorithms.
We use a hybrid dynamic Bayesian network (DBN)/support vector machine (SVM) approach to address these two problems. We train a set of DBNs on high-confidence peptide-spectrum matches. These DBNs, known collectively as Riptide, comprise a probabilistic model of peptide fragmentation chemistry. Examination of the distributions learned by Riptide allows identification of new trends, such as prevalent a-ion fragmentation at peptide cleavage sites C-term to hydrophobic residues. In addition, Riptide can be used to produce likelihood scores that indicate whether a given peptide-spectrum match is correct. A vector of such scores is evaluated by an SVM, which produces a final score to be used in peptide identification. Using Riptide in this way yields improved discrimination when compared to other state-of-the-art MS/MS identification algorithms, increasing the number of positive identifications by as much as 12% at a 1% false discovery rate.
Python and C source code are available upon request from the authors. The curated training sets are available at http://noble.gs.washington.edu/proj/intense/. The Graphical Model Tool Kit (GMTK) is freely available at http://ssli.ee.washington.edu/bilmes/gmtk.
Mutating residues into alanine (alanine scanning) is one of the fastest experimental means of probing hypotheses about protein function. Alanine scans can reveal functional hot spots, i.e. residues that alter function upon mutation. In vitro mutagenesis is cumbersome and costly: probing all residues in a protein is typically as impossible as substituting by all non-native amino acids. In contrast, such exhaustive mutagenesis is feasible in silico.
Previously, we developed SNAP to predict functional changes due to non-synonymous single nucleotide polymorphisms. Here, we applied SNAP to all experimental mutations in the ASEdb database of alanine scans; we identified 70% of the hot spots (≥1kCal/mol change in binding energy); more severe changes were predicted more accurately. Encouraged, we carried out a complete all-against-all in silico mutagenesis for human glucokinase. Many of the residues predicted as functionally important have indeed been confirmed in the literature, others await experimental verification, and our method is ready to aid in the design of in vitro mutagenesis.
ASEdb and glucokinase scores are available at http://www.rostlab.org/services/SNAP. For submissions of large/whole proteins for processing please contact the author.
A typical PSI-BLAST search consists of iterative scanning and alignment of a large sequence database during which a scoring profile is progressively built and refined. Such a profile can also be stored and used to search against a different database of sequences. Using it to search against a database of consensus rather than native sequences is a simple add-on that boosts performance surprisingly well. The improvement comes at a price: we hypothesized that random alignment score statistics would differ between native and consensus sequences. Thus PSI-BLAST-based profile searches against consensus sequences might incorrectly estimate statistical significance of alignment scores. In addition, iterative searches against consensus databases may fail. Here, we addressed these challenges in an attempt to harness the full power of the combination of PSI-BLAST and consensus sequences.
We studied alignment score statistics for various types of consensus sequences. In general, the score distribution parameters of profile-based consensus sequence alignments differed significantly from those derived for the native sequences. PSI-BLAST partially compensated for the parameter variation. We have identified a protocol for building specialized consensus sequences that significantly improved search sensitivity and preserved score distribution parameters. As a result, PSI-BLAST profiles can be used to search specialized consensus sequences without sacrificing estimates of statistical significance. We also provided results indicating that iterative PSI-BLAST searches against consensus sequences could work very well. Overall, we showed how a widely popular and effective method could be used to identify significantly more relevant similarities among protein sequences.
Recent advances in genotyping technology has made data acquisition for whole-genome association study cost effective, and a current active area of research is developing efficient methods to analyze such large-scale data sets. Most sophisticated association mapping methods that are currently available take phased haplotype data as input. However, phase information is not readily available from sequencing methods and inferring the phase via computational approaches is time-consuming, taking days to phase a single chromosome.
In this paper, we devise an efficient method for scanning unphased whole-genome data for association. Our approach combines a recently found linear-time algorithm for phasing genotypes on trees with a recently proposed tree-based method for association mapping. From unphased genotype data, our algorithm builds local phylogenies along the genome, and scores each tree according to the clustering of cases and controls. We assess the performance of our new method on both simulated and real biological data sets.
Existing linkage-analysis methods address binary or quantitative traits. However, many complex diseases and human conditions, particularly behavioral disorders, are rated on ordinal scales. Herein, we introduce, LOT, a tool that performs linkage analysis of ordinal traits for pedigree data. It implements a latent-variable proportional-odds logistic model that relates inheritance patterns to the distribution of the ordinal trait. The likelihood-ratio test is used for testing evidence of linkage.
Three-dimensional RNA structure prediction and folding is of significant interest in the biological research community. Here, we present iFoldRNA, a novel web-based methodology for RNA structure prediction with near atomic resolution accuracy and analysis of RNA folding thermodynamics. iFoldRNA rapidly explores RNA conformations using discrete molecular dynamics simulations of input RNA sequences. Starting from simplified linear-chain conformations, RNA molecules (<50 nucleotides) fold to native-like structures within half an hour of simulation, facilitating rapid RNA structure prediction. All-atom reconstruction of energetically stable conformations generates iFoldRNA predicted RNA structures. The predicted RNA structures are within 2–5 Angstrom root mean square deviations from corresponding experimentally derived structures. RNA folding parameters including specific heat, contact maps, simulation trajectories, gyration radii, root mean square deviations from native state, fraction of native-like contacts are accessible from iFoldRNA. We expect iFoldRNA will serve as a useful resource for RNA structure prediction and folding thermodynamic analyses.
Microarray experiments typically analyze thousands to tens of thousands of genes from small numbers of biological replicates. The fact that genes are normally expressed in functionally relevant patterns suggests that gene-expression data can be stratified and clustered into relatively homogenous groups. Cluster-wise dimensionality reduction should make it feasible to improve screening power while minimizing information loss.
We propose a powerful and computationally simple method for finding differentially expressed genes in small microarray experiments. The method incorporates a novel stratification-based tight clustering algorithm, principal component analysis and information pooling. Comprehensive simulations show that our method is substantially more powerful than the popular SAM and eBayes approaches. We applied the method to three real microarray datasets: one from a Populus nitrogen stress experiment with 3 biological replicates; and two from public microarray datasets of human cancers with 10 to 40 biological replicates. In all three analyses, our method proved more robust than the popular alternatives for identification of differentially expressed genes.
The C++ code to implement the proposed method is available upon request for academic use.
Accurately predicting the binding affinities of large sets of diverse protein-ligand complexes is an extremely challenging task. The scoring functions that attempt such computational prediction are essential for analysing the outputs of Molecular Docking, which is in turn an important technique for drug discovery, chemical biology and structural biology. Each scoring function assumes a predetermined theory-inspired functional form for the relationship between the variables that characterise the complex, which also include parameters fitted to experimental or simulation data, and its predicted binding affinity. The inherent problem of this rigid approach is that it leads to poor predictivity for those complexes that do not conform to the modelling assumptions. Moreover, resampling strategies, such as cross-validation or bootstrapping, are still not systematically used to guard against the overfitting of calibration data in parameter estimation for scoring functions.
We propose a novel scoring function (RF-Score) that circumvents the need for problematic modelling assumptions via non-parametric machine learning. In particular, Random Forest was used to implicitly capture binding effects that are hard to model explicitly. RF-Score is compared with the state of the art on the demanding PDBbind benchmark. Results show that RF-Score is a very competitive scoring function. Importantly, RF-Score’s performance was shown to improve dramatically with training set size and hence the future availability of more high quality structural and interaction data is expected to lead to improved versions of RF-Score.
The statistical power or multiple Type II error rate in large scale multiple testing problems as, for example, in gene expression microarray experiments, depends on typically unknown parameters and is therefore difficult to assess a priori. However, it has been suggested to estimate the multiple Type II error rate post-hoc, based on the observed data.
We consider a class of post-hoc estimators that are functions of the estimated proportion of true null hypotheses among all hypotheses. Numerous estimators for this proportion have been proposed and we investigate the statistical properties of the derived multiple Type II error rate estimators in an extensive simulation study.
The performance of the estimators in terms of the mean squared error depends sensitively on the distributional scenario. Estimators based on empirical distributions of the null hypotheses are superior in the presence of strongly correlated test statistics.
Accurate prognosis of breast cancer can spare a significant number of breast cancer patients from receiving unnecessary adjuvant systemic treatment and its related expensive medical costs. Recent studies have demonstrated the potential value of gene expression signatures in assessing the risk of post-surgical disease recurrence. However, these studies all attempt to develop genetic marker-based prognostic systems to replace the existing clinical criteria, while ignoring the rich information contained in established clinical markers. Given the complexity of breast cancer prognosis, a more practical strategy would be to utilize both clinical and genetic marker information that may be complementary.
A computational study is performed on publicly available microarray data, which has spawned a 70-gene prognostic signature. The recently proposed I-RELIEF algorithm is used to identify a hybrid signature through the combination of both genetic and clinical markers. A rigorous experimental protocol is used to estimate the prognostic performance of the hybrid signature and other prognostic approaches. Survival data analyses is performed to compare different prognostic approaches.
The hybrid signature performs significantly better than other methods, including the 70-gene signature, clinical makers alone and the St. Gallen consensus criterion. At the 90% sensitivity level, the hybrid signature achieves 67% specificity, as compared to 47% for the 70-gene signature and 48% for the clinical makers. The odds ratio of the hybrid signature for developing distant meta-stases within five years between the patients with a good prognosis signature and the patients with a bad prognosis is 21.0 (95% CI: 6.5–68.3), far higher than either genetic or clinical markers alone.
The living cell array quantifies the contribution of activated transcription factors upon the expression levels of their target genes. The direct manipulation of the regulatory mechanisms offers enormous possibilities for deciphering the machinery that activates and controls gene expression. We propose a novel bi-clustering algorithm for generating non-overlapping clusters of reporter genes and conditions and demonstrate how this information can be interpreted in order to assist in the construction of transcription factor interaction networks.
Traditional phylogenetic methods assume tree-like evolutionary models and are likely to perform poorly when provided with sequence data from fast-evolving, recombining viruses. Furthermore, these methods assume that all the sequence data are from contemporaneous taxa, which is not valid for serially-sampled data. A more general approach is proposed here, referred to as the Sliding MinPD method, that reconstructs evolutionary networks for serially-sampled sequences in the presence of recombination.
Sliding MinPD combines distance-based phylogenetic methods with automated recombination detection based on the best-known sliding window approaches to reconstruct serial evolutionary networks. Its performance was evaluated through comprehensive simulation studies and was also applied to a set of serially-sampled HIV sequences from a single patient. The resulting network organizations reveal unique patterns of viral evolution and may help explain the emergence of disease-associated mutants and drug-resistant strains with implications for patient prognosis and treatment strategies.
Transcription regulation is a fundamental process in biology, and it is important to model the dynamic behavior of gene regulation networks. Many approaches have been proposed to specify the network structure. However, finding the network connectivity is not sufficient to understand the network dynamics. Instead, one needs to model the regulation reactions, usually with a set of ordinary differential equations (ODEs). Because some of the parameters involved in these ODEs are unknown, their values need to be inferred from the observed data.
In this article, we introduce the generalized profiling method to estimate ODE parameters in a gene regulation network from microarray gene expression data which can be rather noisy. Because numerically solving ODEs is computationally expensive, we apply the penalized smoothing technique, a fast and stable computational method to approximate ODE solutions. The ODE solutions with our parameter estimates fit the data well. A goodness-of-fit test of dynamic models is developed to identify gene regulation networks.
When analyzing microarray data, non-biological variation introduces uncertainty in the analysis and interpretation. In this paper we focus on the validation of significant differences in gene expression levels, or normalized channel intensity levels with respect to different experimental conditions and with replicated measurements. A myriad of methods have been proposed to study differences in gene expression levels and to assign significance values as a measure of confidence. In this paper we compare several methods, including SAM, regularized t-test, mixture modeling, Wilk’s lambda score and variance stabilization. From this comparison we developed a weighted resampling approach and applied it to gene deletions in Mycobacterium bovis.
We discuss the assumptions, model structure, computational complexity and applicability to microarray data. The results of our study justified the theoretical basis of the weighted resampling approach, which clearly outperforms the others.
Algorithms were implemented using the statistical programming language R and available on the author’s web-page.
Genome-wide association studies, which produce huge volumes of data, are now being carried out by many groups around the world, creating a need for user friendly tools for data quality control and analysis. One critical aspect of GWAS quality control is evaluating genotype cluster plots to verify sensible genotype calling in putatively associated SNPs. Evoker is a tool for visualizing genotype cluster plots, and provides a solution to the computational and storage problems related to working with such large datasets.
The rate at which gene-related findings appear in the scientific literature makes it difficult if not impossible for biomedical scientists to keep fully informed and up to date. The importance of these findings argues for the development of automated methods that can find, extract and summarize this information. This article reports on methods for determining the molecular function claims that are being made in a scientific article, specifically those that are backed by experimental evidence.
The most significant result is that for molecular function claims based on direct assays, our methods achieved recall of 70.7% and precision of 65.7%. Furthermore, our methods correctly identified in the text 44.6% of the specific molecular function claims backed up by direct assays, but with a precision of only 0.92%, a disappointing outcome that led to an examination of the different kinds of errors. These results were based on an analysis of 1823 articles from the literature of Saccharomyces cerevisiae (budding yeast).
The annotation files for S.cerevisiae are available from ftp://genome-ftp.stanford.edu/pub/yeast/data_download/literature_curation/gene_association.sgd.gz. The draft protocol vocabulary is available by request from the first author.
GO::TermFinder comprises a set of object-oriented Perl modules for accessing Gene Ontology (GO) information and evaluating and visualizing the collective annotation of a list of genes to GO terms. It can be used to draw conclusions from microarray and other biological data, calculating the statistical significance of each annotation. GO::TermFinder can be used on any system on which Perl can be run, either as a command line application, in single or batch mode, or as a web-based CGI script.
The full source code and documentation for GO::TermFinder are freely available from http://search.cpan.org/dist/GO-TermFinder/
The advent of sequencing and structural genomics projects has provided a dramatic boost in the number of protein structures and sequences. Due to the high-throughput nature of these projects, many of the molecules are uncharacterised and their functions unknown. This, in turn, has led to the need for a greater number and diversity of tools and databases providing annotation through transfer based on homology and prediction methods. Though many such tools to annotate protein sequence and structure exist, they are spread throughout the world, often with dedicated individual web pages. This situation does not provide a consensus view of the data and hinders comparison between methods. Integration of these methods is needed. So far this has not been possible since there was no common vocabulary available that could be used as a standard language. A variety of terms could be used to describe any particular feature ranging from different spellings to completely different terms. The Protein Feature Ontology (http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=BS) is a structured controlled vocabulary for features of a protein sequence or structure. It provides a common language for tools and methods to use, so that integration and comparison of their annotations is possible. The Protein Feature Ontology comprises approximately 100 positional terms (located in a particular region of the sequence), which have been integrated into the Sequence Ontology (SO). 40 non-positional terms which describe general protein properties have also been defined and, in addition, post-translational modifications are described by using an already existing ontology, the Protein Modification Ontology (MOD). The Protein Feature Ontology has been used by the BioSapiens Network of Excellence, a consortium comprising 19 partner sites in 14 European countries generating over 150 distinct annotation types for protein sequences and structures.
Visualizing relations among biological information to facilitate understanding is crucial to biological research during the post-genomic era. Although different systems have been developed to view gene-phenotype relations for specific databases, very few have been designed specifically as a general flexible tool for visualizing multidimensional genotypic and phenotypic information together. Our goal is to develop a method for visualizing multidimensional genotypic and phenotypic information and a model that unifies different biological databases in order to present the integrated knowledge using a uniform interface.
We developed a novel, flexible and generalizable visualization tool, called PhenoGenesviewer (PGviewer), which in this paper was used to display gene-phenotype relations from a human-curated database (OMIM) and from an automatic method using a Natural Language Processing tool called BioMedLEE. Data obtained from multiple databases were first integrated into a uniform structure and then organized by PGviewer. PGviewer provides a flexible query interface that allows dynamic selection and ordering of any desired dimension in the databases. Based on users’ queries, results can be visualized using hierarchical expandable trees that present views specified by users according to their research interests. We believe that this method, which allows users to dynamically organize and visualize multiple dimensions, is a potentially powerful and promising tool that should substantially facilitate biological research.
There is extensive interest in automating the collection, organization, and analysis of biological data. Data in the form of images in online literature present special challenges for such efforts. The first steps in understanding the contents of a figure are decomposing it into panels and determining the type of each panel. In biological literature, panel types include many kinds of images collected by different techniques, such as photographs of gels or images from microscopes. We have previously described the SLIF system (http://slif.cbi.cmu.edu) that identifies panels containing fluorescence microscope images among figures in online journal articles as a prelude to further analysis of the subcellular patterns in such images. This system contains a pretrained classifier that uses image features to assign a type (class) to each separate panel. However, the types of panels in a figure are often correlated, so that we can consider the class of a panel to be dependent not only on its own features but also on the types of the other panels in a figure.
In this paper, we introduce the use of a type of probabilistic graphical model, a factor graph, to represent the structured information about the images in a figure, and permit more robust and accurate inference about their types. We obtain significant improvement over results for considering panels separately.
The code and data used for the experiments described here are available from http://murphylab.web.cmu.edu/software. Contact: firstname.lastname@example.org
BMapBuilder builds maps of pairwise linkage disequilibrium (LD) in either two or three dimensions. The optimized resolution allows for graphical display of LD for single nucleotide polymorphisms (SNPs) in a whole chromosome.
Despite advances in the gene annotation process, the functions of a large portion of the gene products remain insufficiently characterized. In addition, the “in silico” prediction of novel Gene Ontology (GO) annotations for partially characterized gene functions or processes is highly dependent on reverse genetic or function genomics approaches.
We propose a novel approach, Information Theory-based Semantic Similarity (ITSS), to automatically predict molecular functions of genes based on Gene Ontology annotations. We have demonstrated using a 10-fold cross-validation that the ITSS algorithm obtains prediction accuracies (Precision 97%, Recall 77%) comparable to other machine learning algorithms when applied to similarly dense annotated portions of the GO datasets. In addition, such method can generate highly accurate predictions in sparsely annotated portions of GO, in which previous algorithm failed to do so. As a result, our technique generates an order of magnitude more gene function predictions than previous methods. Further, this paper presents the first historical rollback validation for the predicted GO annotations, which may represent more realistic conditions for an evaluation than generally used cross-validations type of evaluations. By manually assessing a random sample of 100 predictions conducted in a historical roll-back evaluation, we estimate that a minimum precision of 51% (95% confidence interval: 43%–58%) can be achieved for the human GO Annotation file dated 2003.
The program is available on request. The 97,732 positive predictions of novel gene annotations from the 2005 GO Annotation dataset are available at http://phenos.bsd.uchicago.edu/mphenogo/prediction_result_2005.txt.