Search tips
Search criteria

Results 1-13 (13)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  EXACT2: the semantics of biomedical protocols 
BMC Bioinformatics  2014;15(Suppl 14):S5.
The reliability and reproducibility of experimental procedures is a cornerstone of scientific practice. There is a pressing technological need for the better representation of biomedical protocols to enable other agents (human or machine) to better reproduce results. A framework that ensures that all information required for the replication of experimental protocols is essential to achieve reproducibility.
We have developed the ontology EXACT2 (EXperimental ACTions) that is designed to capture the full semantics of biomedical protocols required for their reproducibility.
To construct EXACT2 we manually inspected hundreds of published and commercial biomedical protocols from several areas of biomedicine. After establishing a clear pattern for extracting the required information we utilized text-mining tools to translate the protocols into a machine amenable format. We have verified the utility of EXACT2 through the successful processing of previously 'unseen' (not used for the construction of EXACT2) protocols.
The paper reports on a fundamentally new version EXACT2 that supports the semantically-defined representation of biomedical protocols. The ability of EXACT2 to capture the semantics of biomedical procedures was verified through a text mining use case. In this EXACT2 is used as a reference model for text mining tools to identify terms pertinent to experimental actions, and their properties, in biomedical protocols expressed in natural language. An EXACT2-based framework for the translation of biomedical protocols to a machine amenable format is proposed.
The EXACT2 ontology is sufficient to record, in a machine processable form, the essential information about biomedical protocols. EXACT2 defines explicit semantics of experimental actions, and can be used by various computer applications. It can serve as a reference model for for the translation of biomedical protocols in natural language into a semantically-defined format.
PMCID: PMC4255744  PMID: 25472549
2.  The Use of Weighted Graphs for Large-Scale Genome Analysis 
PLoS ONE  2014;9(3):e89618.
There is an acute need for better tools to extract knowledge from the growing flood of sequence data. For example, thousands of complete genomes have been sequenced, and their metabolic networks inferred. Such data should enable a better understanding of evolution. However, most existing network analysis methods are based on pair-wise comparisons, and these do not scale to thousands of genomes. Here we propose the use of weighted graphs as a data structure to enable large-scale phylogenetic analysis of networks. We have developed three types of weighted graph for enzymes: taxonomic (these summarize phylogenetic importance), isoenzymatic (these summarize enzymatic variety/redundancy), and sequence-similarity (these summarize sequence conservation); and we applied these types of weighted graph to survey prokaryotic metabolism. To demonstrate the utility of this approach we have compared and contrasted the large-scale evolution of metabolism in Archaea and Eubacteria. Our results provide evidence for limits to the contingency of evolution.
PMCID: PMC3949676  PMID: 24619061
3.  Representation of probabilistic scientific knowledge 
Journal of Biomedical Semantics  2013;4(Suppl 1):S7.
The theory of probability is widely used in biomedical research for data analysis and modelling. In previous work the probabilities of the research hypotheses have been recorded as experimental metadata. The ontology HELO is designed to support probabilistic reasoning, and provides semantic descriptors for reporting on research that involves operations with probabilities. HELO explicitly links research statements such as hypotheses, models, laws, conclusions, etc. to the associated probabilities of these statements being true. HELO enables the explicit semantic representation and accurate recording of probabilities in hypotheses, as well as the inference methods used to generate and update those hypotheses. We demonstrate the utility of HELO on three worked examples: changes in the probability of the hypothesis that sirtuins regulate human life span; changes in the probability of hypotheses about gene functions in the S. cerevisiae aromatic amino acid pathway; and the use of active learning in drug design (quantitative structure activity relation learning), where a strategy for the selection of compounds with the highest probability of improving on the best known compound was used. HELO is open source and available at
PMCID: PMC3632998  PMID: 23734675
ontology; knowledge representation; probabilistic reasoning
4.  Yeast-based automated high-throughput screens to identify anti-parasitic lead compounds 
Open Biology  2013;3(2):120158.
We have developed a robust, fully automated anti-parasitic drug-screening method that selects compounds specifically targeting parasite enzymes and not their host counterparts, thus allowing the early elimination of compounds with potential side effects. Our yeast system permits multiple parasite targets to be assayed in parallel owing to the strains’ expression of different fluorescent proteins. A strain expressing the human target is included in the multiplexed screen to exclude compounds that do not discriminate between host and parasite enzymes. This form of assay has the advantages of using known targets and not requiring the in vitro culture of parasites. We performed automated screens for inhibitors of parasite dihydrofolate reductases, N-myristoyltransferases and phosphoglycerate kinases, finding specific inhibitors of parasite targets. We found that our ‘hits’ have significant structural similarities to compounds with in vitro anti-parasitic activity, validating our screens and suggesting targets for hits identified in parasite-based assays. Finally, we demonstrate a 60 per cent success rate for our hit compounds in killing or severely inhibiting the growth of Trypanosoma brucei, the causative agent of African sleeping sickness.
PMCID: PMC3603448  PMID: 23446112
drug screening; parasites; yeast; automation; tropical diseases
5.  Functional Expression of Parasite Drug Targets and Their Human Orthologs in Yeast 
The exacting nutritional requirements and complicated life cycles of parasites mean that they are not always amenable to high-throughput drug screening using automated procedures. Therefore, we have engineered the yeast Saccharomyces cerevisiae to act as a surrogate for expressing anti-parasitic targets from a range of biomedically important pathogens, to facilitate the rapid identification of new therapeutic agents.
Methodology/Principal Findings
Using pyrimethamine/dihydrofolate reductase (DHFR) as a model parasite drug/drug target system, we explore the potential of engineered yeast strains (expressing DHFR enzymes from Plasmodium falciparum, P. vivax, Homo sapiens, Schistosoma mansoni, Leishmania major, Trypanosoma brucei and T. cruzi) to exhibit appropriate differential sensitivity to pyrimethamine. Here, we demonstrate that yeast strains (lacking the major drug efflux pump, Pdr5p) expressing yeast (ScDFR1), human (HsDHFR), Schistosoma (SmDHFR), and Trypanosoma (TbDHFR and TcDHFR) DHFRs are insensitive to pyrimethamine treatment, whereas yeast strains producing Plasmodium (PfDHFR and PvDHFR) DHFRs are hypersensitive. Reassuringly, yeast strains expressing field-verified, drug-resistant mutants of P. falciparum DHFR (Pfdhfr51I,59R,108N) are completely insensitive to pyrimethamine, further validating our approach to drug screening. We further show the versatility of the approach by replacing yeast essential genes with other potential drug targets, namely phosphoglycerate kinases (PGKs) and N-myristoyl transferases (NMTs).
We have generated a number of yeast strains that can be successfully harnessed for the rapid and selective identification of urgently needed anti-parasitic agents.
Author Summary
Parasites kill millions of people every year and leave countless others with chronic debilitating disease. These diseases, which include malaria and sleeping sickness, mainly affect people in developing countries. For this reason, few drugs have been developed to treat them. To make matters worse, many parasites are developing resistance to the drugs that are available. Thus, there is an urgent need to develop new drugs, but this is hampered by the fact that most parasites are difficult or impossible to grow in the laboratory. To address this, we have engineered baker's yeast to be dependent on the function of enzymes from either parasites or humans. In all, our engineered yeast constructs encompass six parasites (causing malaria, schistosomiasis, leishmaniasis, sleeping sickness, and Chagas disease) and three different enzymes that are known or potential drug targets. Further, we have increased yeast's sensitivity to drugs by deleting the gene for its major drug efflux pump. Because yeast is robust and easy to grow in the laboratory, we can use a robot to screen for drugs that will kill yeast dependent on a parasite enzyme, but not touch yeast dependent on the equivalent human enzyme.
PMCID: PMC3186757  PMID: 21991399
6.  On the formalization and reuse of scientific research 
The reuse of scientific knowledge obtained from one investigation in another investigation is basic to the advance of science. Scientific investigations should therefore be recorded in ways that promote the reuse of the knowledge they generate. The use of logical formalisms to describe scientific knowledge has potential advantages in facilitating such reuse. Here, we propose a formal framework for using logical formalisms to promote reuse. We demonstrate the utility of this framework by using it in a worked example from biology: demonstrating cycles of investigation formalization [F] and reuse [R] to generate new knowledge. We first used logic to formally describe a Robot scientist investigation into yeast (Saccharomyces cerevisiae) functional genomics [f1]. With Robot scientists, unlike human scientists, the production of comprehensive metadata about their investigations is a natural by-product of the way they work. We then demonstrated how this formalism enabled the reuse of the research in investigating yeast phenotypes [r1 = R(f1)]. This investigation found that the removal of non-essential enzymes generally resulted in enhanced growth. The phenotype investigation was then formally described using the same logical formalism as the functional genomics investigation [f2 = F(r1)]. We then demonstrated how this formalism enabled the reuse of the phenotype investigation to investigate yeast systems-biology modelling [r2 = R(f2)]. This investigation found that yeast flux-balance analysis models fail to predict the observed changes in growth. Finally, the systems biology investigation was formalized for reuse in future investigations [f3 = F(r2)]. These cycles of reuse are a model for the general reuse of scientific knowledge.
PMCID: PMC3163424  PMID: 21490004
semantic web; logic; Saccharomyces cerevisiae; ontology
7.  Further developments towards a genome-scale metabolic model of yeast 
BMC Systems Biology  2010;4:145.
To date, several genome-scale network reconstructions have been used to describe the metabolism of the yeast Saccharomyces cerevisiae, each differing in scope and content. The recent community-driven reconstruction, while rigorously evidenced and well annotated, under-represented metabolite transport, lipid metabolism and other pathways, and was not amenable to constraint-based analyses because of lack of pathway connectivity.
We have expanded the yeast network reconstruction to incorporate many new reactions from the literature and represented these in a well-annotated and standards-compliant manner. The new reconstruction comprises 1102 unique metabolic reactions involving 924 unique metabolites - significantly larger in scope than any previous reconstruction. The representation of lipid metabolism in particular has improved, with 234 out of 268 enzymes linked to lipid metabolism now present in at least one reaction. Connectivity is emphatically improved, with more than 90% of metabolites now reachable from the growth medium constituents. The present updates allow constraint-based analyses to be performed; viability predictions of single knockouts are comparable to results from in vivo experiments and to those of previous reconstructions.
We report the development of the most complete reconstruction of yeast metabolism to date that is based upon reliable literature evidence and richly annotated according to MIRIAM standards. The reconstruction is available in the Systems Biology Markup Language (SBML) and via a publicly accessible database
PMCID: PMC2988745  PMID: 21029416
8.  Towards Robot Scientists for autonomous scientific discovery 
We review the main components of autonomous scientific discovery, and how they lead to the concept of a Robot Scientist. This is a system which uses techniques from artificial intelligence to automate all aspects of the scientific discovery process: it generates hypotheses from a computer model of the domain, designs experiments to test these hypotheses, runs the physical experiments using robotic systems, analyses and interprets the resulting data, and repeats the cycle. We describe our two prototype Robot Scientists: Adam and Eve. Adam has recently proven the potential of such systems by identifying twelve genes responsible for catalysing specific reactions in the metabolic pathways of the yeast Saccharomyces cerevisiae. This work has been formally recorded in great detail using logic. We argue that the reporting of science needs to become fully formalised and that Robot Scientists can help achieve this. This will make scientific information more reproducible and reusable, and promote the integration of computers in scientific reasoning. We believe the greater automation of both the physical and intellectual aspects of scientific investigations to be essential to the future of science. Greater automation improves the accuracy and reliability of experiments, increases the pace of discovery and, in common with conventional laboratory automation, removes tedious and repetitive tasks from the human scientist.
PMCID: PMC2813846  PMID: 20119518
9.  The EXACT description of biomedical protocols 
Bioinformatics  2008;24(13):i295-i303.
Motivation: Many published manuscripts contain experiment protocols which are poorly described or deficient in information. This means that the published results are very hard or impossible to repeat. This problem is being made worse by the increasing complexity of high-throughput/automated methods. There is therefore a growing need to represent experiment protocols in an efficient and unambiguous way.
Results: We have developed the Experiment ACTions (EXACT) ontology as the basis of a method of representing biological laboratory protocols. We provide example protocols that have been formalized using EXACT, and demonstrate the advantages and opportunities created by using this formalization. We argue that the use of EXACT will result in the publication of protocols with increased clarity and usefulness to the scientific community.
Availability: The ontology, examples and code can be downloaded from
Contact: Larisa Soldatova
PMCID: PMC2718634  PMID: 18586727
10.  An ontology of scientific experiments 
The formal description of experiments for efficient analysis, annotation and sharing of results is a fundamental part of the practice of science. Ontologies are required to achieve this objective. A few subject-specific ontologies of experiments currently exist. However, despite the unity of scientific experimentation, no general ontology of experiments exists. We propose the ontology EXPO to meet this need. EXPO links the SUMO (the Suggested Upper Merged Ontology) with subject-specific ontologies of experiments by formalizing the generic concepts of experimental design, methodology and results representation. EXPO is expressed in the W3C standard ontology language OWL-DL. We demonstrate the utility of EXPO and its ability to describe different experimental domains, by applying it to two experiments: one in high-energy physics and the other in phylogenetics. The use of EXPO made the goals and structure of these experiments more explicit, revealed ambiguities, and highlighted an unexpected similarity. We conclude that, EXPO is of general value in describing experiments and a step towards the formalization of science.
PMCID: PMC1885356  PMID: 17015305
ontology; formalization; annotation; artificial intelligence; metadata
11.  Locational distribution of gene functional classes in Arabidopsis thaliana 
BMC Bioinformatics  2007;8:112.
We are interested in understanding the locational distribution of genes and their functions in genomes, as this distribution has both functional and evolutionary significance. Gene locational distribution is known to be affected by various evolutionary processes, with tandem duplication thought to be the main process producing clustering of homologous sequences. Recent research has found clustering of protein structural families in the human genome, even when genes identified as tandem duplicates have been removed from the data. However, this previous research was hindered as they were unable to analyse small sample sizes. This is a challenge for bioinformatics as more specific functional classes have fewer examples and conventional statistical analyses of these small data sets often produces unsatisfactory results.
We have developed a novel bioinformatics method based on Monte Carlo methods and Greenwood's spacing statistic for the computational analysis of the distribution of individual functional classes of genes (from GO). We used this to make the first comprehensive statistical analysis of the relationship between gene functional class and location on a genome. Analysis of the distribution of all genes except tandem duplicates on the five chromosomes of A. thaliana reveals that the distribution on chromosomes I, II, IV and V is clustered at P = 0.001. Many functional classes are clustered, with the degree of clustering within an individual class generally consistent across all five chromosomes. A novel and surprising result was that the locational distribution of some functional classes were significantly more evenly spaced than would be expected by chance.
Analysis of the A. thaliana genome reveals evidence of unexplained order in the locational distribution of genes. The same general analysis method can be applied to any genome, and indeed any sequential data involving classes.
PMCID: PMC1855069  PMID: 17397552
12.  Homology Induction: the use of machine learning to improve sequence similarity searches 
BMC Bioinformatics  2002;3:11.
The inference of homology between proteins is a key problem in molecular biology The current best approaches only identify ~50% of homologies (with a false positive rate set at 1/1000).
We present Homology Induction (HI), a new approach to inferring homology. HI uses machine learning to bootstrap from standard sequence similarity search methods. First a standard method is run, then HI learns rules which are true for sequences of high similarity to the target (assumed homologues) and not true for general sequences, these rules are then used to discriminate sequences in the twilight zone. To learn the rules HI describes the sequences in a novel way based on a bioinformatic knowledge base, and the machine learning method of inductive logic programming. To evaluate HI we used the PDB40D benchmark which lists sequences of known homology but low sequence similarity. We compared the HI methodoly with PSI-BLAST alone and found HI performed significantly better. In addition, Receiver Operating Characteristic (ROC) curve analysis showed that these improvements were robust for all reasonable error costs. The predictive homology rules learnt by HI by can be interpreted biologically to provide insight into conserved features of homologous protein families.
HI is a new technique for the detection of remote protein homolgy – a central bioinformatic problem. HI with PSI-BLAST is shown to outperform PSI-BLAST for all error costs. It is expect that similar improvements would be obtained using HI with any sequence similarity method.
PMCID: PMC107726  PMID: 11972320
13.  Accurate Prediction of Protein Functional Class From Sequence in the Mycobacterium Tuberculosis and Escherichia Coli Genomes Using Data Mining 
Yeast (Chichester, England)  2000;17(4):283-293.
The analysis of genomics data needs to become as automated as its generation. Here we present a novel data-mining approach to predicting protein functional class from sequence. This method is based on a combination of inductive logic programming clustering and rule learning. We demonstrate the effectiveness of this approach on the M. tuberculosis and E. coli genomes, and identify biologically interpretable rules which predict protein functional class from information only available from the sequence. These rules predict 65% of the ORFs with no assigned function in M. tuberculosis and 24% of those in E. coli, with an estimated accuracy of 60–80% (depending on the level of functional assignment). The rules are founded on a combination of detection of remote homology, convergent evolution and horizontal gene transfer. We identify rules that predict protein functional class even in the absence of detectable sequence or structural homology. These rules give insight into the evolutionary history of M. tuberculosis and E. coli.
PMCID: PMC2448385  PMID: 11119305

Results 1-13 (13)