Text mining applications typically involve statistical models that require accessing and updating model parameters in an iterative fashion. With the growing size of the data, such models become extremely parameter rich, and naive parallel implementations fail to address the scalability problem of maintaining a distributed look-up table that maps model parameters to their values.
We evaluate several existing alternatives to provide coordination among worker nodes in Hadoop  clusters, and suggest a new multi-layered look-up architecture that is specifically optimized for certain problem domains. Our solution exploits the power-law distribution characteristics of the phrase or n-gram counts in large corpora while utilizing a Bloom Filter , in-memory cache, and an HBase  cluster at varying levels of abstraction.
Whereas countless highly penetrant variants have been associated with Mendelian disorders, the genetic etiologies underlying complex diseases remain largely unresolved. Here, we examine the extent to which Mendelian variation contributes to complex disease risk by mining the medical records of over 110 million patients. We detect thousands of associations between Mendelian and complex diseases, revealing a non-degenerate, phenotypic code that links each complex disorder to a unique collection of Mendelian loci. Using genome-wide association results, we demonstrate that common variants associated with complex diseases are enriched in the genes indicated by this “Mendelian code.” Finally, we detect hundreds of comorbidity associations among Mendelian disorders, and we use probabilistic genetic modeling to demonstrate that Mendelian variants likely contribute non-additively to the risk for a subset of complex diseases. Overall, this study illustrates a complementary approach for mapping complex disease loci and provides unique predictions concerning the etiologies of specific diseases.
Synonymous relationships among biomedical terms are extensively annotated within specialized terminologies, implying that synonymy is important for practical computational applications within this field. It remains unclear, however, whether text mining actually benefits from documented synonymy and whether existing biomedical thesauri provide adequate coverage of these linguistic relationships. In this study, we examine the impact and extent of undocumented synonymy within a very large compendium of biomedical thesauri. First, we demonstrate that missing synonymy has a significant negative impact on named entity normalization, an important problem within the field of biomedical text mining. To estimate the amount synonymy currently missing from thesauri, we develop a probabilistic model for the construction of synonym terminologies that is capable of handling a wide range of potential biases, and we evaluate its performance using the broader domain of near-synonymy among general English words. Our model predicts that over 90% of these relationships are currently undocumented, a result that we support experimentally through “crowd-sourcing.” Finally, we apply our model to biomedical terminologies and predict that they are missing the vast majority (>90%) of the synonymous relationships they intend to document. Overall, our results expose the dramatic incompleteness of current biomedical thesauri and suggest the need for “next-generation,” high-coverage lexical terminologies.
Automated systems that extract and integrate information from the research literature have become common in biomedicine. As the same meaning can be expressed in many distinct but synonymous ways, access to comprehensive thesauri may enable such systems to maximize their performance. Here, we establish the importance of synonymy for a specific text-mining task (named-entity normalization), and we suggest that current thesauri may be woefully inadequate in their documentation of this linguistic phenomenon. To test this claim, we develop a model for estimating the amount of missing synonymy. We apply our model to both biomedical terminologies and general-English thesauri, predicting massive amounts of missing synonymy for both lexicons. Furthermore, we verify some of our predictions for the latter domain through “crowd-sourcing.” Overall, our work highlights the dramatic incompleteness of current biomedical thesauri, and to mitigate this issue, we propose the creation of “living” terminologies, which would automatically harvest undocumented synonymy and help smart machines enrich biomedicine.
The DiseaseConnect (http://disease-connect.org) is a web server for analysis and visualization of a comprehensive knowledge on mechanism-based disease connectivity. The traditional disease classification system groups diseases with similar clinical symptoms and phenotypic traits. Thus, diseases with entirely different pathologies could be grouped together, leading to a similar treatment design. Such problems could be avoided if diseases were classified based on their molecular mechanisms. Connecting diseases with similar pathological mechanisms could inspire novel strategies on the effective repositioning of existing drugs and therapies. Although there have been several studies attempting to generate disease connectivity networks, they have not yet utilized the enormous and rapidly growing public repositories of disease-related omics data and literature, two primary resources capable of providing insights into disease connections at an unprecedented level of detail. Our DiseaseConnect, the first public web server, integrates comprehensive omics and literature data, including a large amount of gene expression data, Genome-Wide Association Studies catalog, and text-mined knowledge, to discover disease–disease connectivity via common molecular mechanisms. Moreover, the clinical comorbidity data and a comprehensive compilation of known drug–disease relationships are additionally utilized for advancing the understanding of the disease landscape and for facilitating the mechanism-based development of new drug treatments.
Many factors affect the risks for neurodevelopmental maladies such as autism spectrum disorders (ASD) and intellectual disability (ID). To compare environmental, phenotypic, socioeconomic and state-policy factors in a unified geospatial framework, we analyzed the spatial incidence patterns of ASD and ID using an insurance claims dataset covering nearly one third of the US population. Following epidemiologic evidence, we used the rate of congenital malformations of the reproductive system as a surrogate for environmental exposure of parents to unmeasured developmental risk factors, including toxins. Adjusted for gender, ethnic, socioeconomic, and geopolitical factors, the ASD incidence rates were strongly linked to population-normalized rates of congenital malformations of the reproductive system in males (an increase in ASD incidence by 283% for every percent increase in incidence of malformations, 95% CI: [91%, 576%], p<6×10−5). Such congenital malformations were barely significant for ID (94% increase, 95% CI: [1%, 250%], p = 0.0384). Other congenital malformations in males (excluding those affecting the reproductive system) appeared to significantly affect both phenotypes: 31.8% ASD rate increase (CI: [12%, 52%], p<6×10−5), and 43% ID rate increase (CI: [23%, 67%], p<6×10−5). Furthermore, the state-mandated rigor of diagnosis of ASD by a pediatrician or clinician for consideration in the special education system was predictive of a considerable decrease in ASD and ID incidence rates (98.6%, CI: [28%, 99.99%], p = 0.02475 and 99% CI: [68%, 99.99%], p = 0.00637 respectively). Thus, the observed spatial variability of both ID and ASD rates is associated with environmental and state-level regulatory factors; the magnitude of influence of compound environmental predictors was approximately three times greater than that of state-level incentives. The estimated county-level random effects exhibited marked spatial clustering, strongly indicating existence of as yet unidentified localized factors driving apparent disease incidence. Finally, we found that the rates of ASD and ID at the county level were weakly but significantly correlated (Pearson product-moment correlation 0.0589, p = 0.00101), while for females the correlation was much stronger (0.197, p<2.26×10−16).
Disease clusters are defined as geographically compact areas where a particular disease, such as a cancer, shows a significantly increased rate. It is presently unclear how common such clusters are for neurodevelopmental maladies, such as autism spectrum disorders (ASD) and intellectual disability (ID). In this study, examining data for one third of the whole US population, the authors show that (1) ASD and ID display strong clustering across US counties; (2) counties with high ASD rates also appear to have high ID rates, and (3) the spatial variation of both phenotypes appears to be driven by environmental, and, to a lesser extent, economic incentives at the state level.
Current drug discovery is impossible without sophisticated modeling and computation. In this review we outline previous advances in computational biology and, by tracing the steps involved in pharmaceutical development, explore a range of novel, high-value opportunities for computational innovation in modeling the biological process of disease and the social process of drug discovery. These opportunities include text mining for new drug leads, modeling molecular pathways and predicting the efficacy of drug cocktails, analyzing genetic overlap between diseases and predicting alternative drug use. Computation can also be used to model research teams and innovative regions and to estimate the value of academy–industry links for scientific and human benefit. Attention to these opportunities could promise punctuated advance and will complement the well-established computational work on which drug discovery currently relies.
The theory of probability is widely used in biomedical research for data analysis and modelling. In previous work the probabilities of the research hypotheses have been recorded as experimental metadata. The ontology HELO is designed to support probabilistic reasoning, and provides semantic descriptors for reporting on research that involves operations with probabilities. HELO explicitly links research statements such as hypotheses, models, laws, conclusions, etc. to the associated probabilities of these statements being true. HELO enables the explicit semantic representation and accurate recording of probabilities in hypotheses, as well as the inference methods used to generate and update those hypotheses. We demonstrate the utility of HELO on three worked examples: changes in the probability of the hypothesis that sirtuins regulate human life span; changes in the probability of hypotheses about gene functions in the S. cerevisiae aromatic amino acid pathway; and the use of active learning in drug design (quantitative structure activity relation learning), where a strategy for the selection of compounds with the highest probability of improving on the best known compound was used. HELO is open source and available at https://github.com/larisa-soldatova/HELO
ontology; knowledge representation; probabilistic reasoning
Computational models in biomedicine rely on biological and clinical assumptions. The selection of these assumptions contributes substantially to modeling success or failure. Assumptions used by experts at the cutting edge of research, however, are rarely explicitly described in scientific publications. One can directly collect and assess some of these assumptions through interviews and surveys. Here we investigate diversity in expert views about a complex biological phenomenon, the process of cancer metastasis. We harvested individual viewpoints from 28 experts in clinical and molecular aspects of cancer metastasis and summarized them computationally. While experts predominantly agreed on the definition of individual steps involved in metastasis, no two expert scenarios for metastasis were identical. We computed the probability that any two experts would disagree on k or fewer metastatic stages and found that any two randomly selected experts are likely to disagree about several assumptions. Considering the probability that two or more of these experts review an article or a proposal about metastatic cascades, the probability that they will disagree with elements of a proposed model approaches 1. This diversity of conceptions has clear consequences for advance and deadlock in the field. We suggest that strong, incompatible views are common in biomedicine but largely invisible to biomedical experts themselves. We built a formal Markov model of metastasis to encapsulate expert convergence and divergence regarding the entire sequence of metastatic stages. This model revealed stages of greatest disagreement, including the points at which cancer enters and leaves the bloodstream. The model provides a formal probabilistic hypothesis against which researchers can evaluate data on the process of metastasis. This would enable subsequent improvement of the model through Bayesian probabilistic update. Practically, we propose that model assumptions and hunches be harvested systematically and made available for modelers and scientists.
Mathematical models and scientific theories fail not only from internal inconsistency, but also from the poor selection of basic assumptions. Assumptions in computational models of biomedicine are typically provided by scientists who interact directly with empirical data. If we seek to model the dynamics of cancer metastasis and ask experts regarding valid assumptions, how widely will they agree and on which assumptions? To answer this question, we queried 28 faculty-level experts about the progression of metastasis. We demonstrate an unexpected diversity of assumptions across experts leading to a striking lack of agreement over the basic stages and sequence of metastasis. We suggest a formal model and framework that builds on this diversity and enables researchers to evaluate divergent hypotheses about metastasis with experimental data. We conclude that modeling biomedical processes could be substantially improved by harvesting scientific assumptions and exposing them for formalization and experiment.
The use of structured knowledge representations—ontologies and terminologies—has become standard in biomedicine. Definitions of ontologies vary widely, as do the values and philosophies that underlie them. In seeking to make these views explicit, we conducted and summarized interviews with a dozen leading ontologists. Their views clustered into three broad perspectives that we summarize as mathematics, computer code, and Esperanto. Ontology as mathematics puts the ultimate premium on rigor and logic, symmetry and consistency of representation across scientific subfields, and the inclusion of only established, non-contradictory knowledge. Ontology as computer code focuses on utility and cultivates diversity, fitting ontologies to their purpose. Like computer languages C++, Prolog, and HTML, the code perspective holds that diverse applications warrant custom designed ontologies. Ontology as Esperanto focuses on facilitating cross-disciplinary communication, knowledge cross-referencing, and computation across datasets from diverse communities. We show how these views align with classical divides in science and suggest how a synthesis of their concerns could strengthen the next generation of biomedical ontologies.
Hypotheses are now being automatically produced on an industrial scale by computers in biology, e.g. the annotation of a genome is essentially a large set of hypotheses generated by sequence similarity programs; and robot scientists enable the full automation of a scientific investigation, including generation and testing of research hypotheses.
This paper proposes a logically defined way for recording automatically generated hypotheses in machine amenable way. The proposed formalism allows the description of complete hypotheses sets as specified input and output for scientific investigations. The formalism supports the decomposition of research hypotheses into more specialised hypotheses if that is required by an application. Hypotheses are represented in an operational way – it is possible to design an experiment to test them. The explicit formal description of research hypotheses promotes the explicit formal description of the results and conclusions of an investigation. The paper also proposes a framework for automated hypotheses generation. We demonstrate how the key components of the proposed framework are implemented in the Robot Scientist “Adam”.
A formal representation of automatically generated research hypotheses can help to improve the way humans produce, record, and validate research hypotheses.
Life scientists today cannot hope to read everything relevant to their research. Emerging text-mining tools can help by identifying topics and distilling statements from books and articles with increased accuracy. Researchers often organize these statements into ontologies, consistent systems of reality claims. Like scientific thinking and interchange, however, text-mined information (even when accurately captured) is complex, redundant, sometimes incoherent, and often contradictory: it is rooted in a mixture of only partially consistent ontologies. We review work that models scientific reason and suggest how computational reasoning across ontologies and the broader distribution of textual statements can assess the certainty of statements and the process by which statements become certain. With the emergence of digitized data regarding networks of scientific authorship, institutions, and resources, we explore the possibility of accounting for social dependences and cultural biases in reasoning models. Computational reasoning is starting to fill out ontologies and flag internal inconsistencies in several areas of bioscience. In the not too distant future, scientists may be able to use statements and rich models of the processes that produced them to identify underexplored areas, resurrect forgotten findings and ideas, deconvolute the spaghetti of underlying ontologies, and synthesize novel knowledge and hypotheses.
Biophysics; Computation; Computer Modeling; Drug Design; Epigenetics; Computational Biology; Information Cascade; Sociology of Science; Text Mining
BioPAX (Biological Pathway Exchange) is a standard language to represent biological pathways at the molecular and cellular level. Its major use is to facilitate the exchange of pathway data (http://www.biopax.org). Pathway data captures our understanding of biological processes, but its rapid growth necessitates development of databases and computational tools to aid interpretation. However, the current fragmentation of pathway information across many databases with incompatible formats presents barriers to its effective use. BioPAX solves this problem by making pathway data substantially easier to collect, index, interpret and share. BioPAX can represent metabolic and signaling pathways, molecular and genetic interactions and gene regulation networks. BioPAX was created through a community process. Through BioPAX, millions of interactions organized into thousands of pathways across many organisms, from a growing number of sources, are available. Thus, large amounts of pathway data are available in a computable form to support visualization, analysis and biological discovery.
pathway data integration; pathway database; standard exchange format; ontology; information system
A scientific ontology is a formal representation of knowledge within a domain, typically including central concepts, their properties, and relations. With the rise of computers and high-throughput data collection, ontologies have become essential to data mining and sharing across communities in the biomedical sciences. Powerful approaches exist for testing the internal consistency of an ontology, but not for assessing the fidelity of its domain representation. We introduce a family of metrics that describe the breadth and depth with which an ontology represents its knowledge domain. We then test these metrics using (1) four of the most common medical ontologies with respect to a corpus of medical documents and (2) seven of the most popular English thesauri with respect to three corpora that sample language from medicine, news, and novels. Here we show that our approach captures the quality of ontological representation and guides efforts to narrow the breach between ontology and collective discourse within a domain. Our results also demonstrate key features of medical ontologies, English thesauri, and discourse from different domains. Medical ontologies have a small intersection, as do English thesauri. Moreover, dialects characteristic of distinct domains vary strikingly as many of the same words are used quite differently in medicine, news, and novels. As ontologies are intended to mirror the state of knowledge, our methods to tighten the fit between ontology and domain will increase their relevance for new areas of biomedical science and improve the accuracy and power of inferences computed across them.
An ontology represents the concepts and their interrelation within a knowledge domain. Several ontologies have been developed in biomedicine, which provide standardized vocabularies to describe diseases, genes and gene products, physiological phenotypes, anatomical structures, and many other phenomena. Scientists use them to encode the results of complex experiments and observations and to perform integrative analysis to discover new knowledge. A remaining challenge in ontology development is how to evaluate an ontology's representation of knowledge within its scientific domain. Building on classic measures from information retrieval, we introduce a family of metrics including breadth and depth that capture the conceptual coverage and parsimony of an ontology. We test these measures using (1) four commonly used medical ontologies in relation to a corpus of medical documents and (2) seven popular English thesauri (ontologies of synonyms) with respect to text from medicine, news, and novels. Results demonstrate that both medical ontologies and English thesauri have a small overlap in concepts and relations. Our methods suggest efforts to tighten the fit between ontologies and biomedical knowledge.
Drug discovery today is impossible without sophisticated modeling and computation. In this review we touch on previous advances in computational biology and by tracing the steps involved in pharmaceutical development, we explore a range of novel, high value opportunities for computational innovation in modeling the biological process of disease and the social process of drug discovery. These opportunities include text mining for new drug leads, modeling molecular pathways and predicting the efficacy of drug cocktails, analyzing genetic overlap between diseases and predicting alternative drug use. Computation can also be used to model research teams and innovative regions and to estimate the value of academy-industry ties for scientific and human benefit. Attention to these opportunities could promise punctuated advance, and will complement the well-established computational work on which drug discovery currently relies.
We have generated and made publicly available two very large networks of molecular interactions: 49,493 mouse-specific and 52,518 human-specific interactions. These networks were generated through automated analysis of 368,331 full-text research articles and 8,039,972 article abstracts from the PubMed database, using the GeneWays system. Our networks cover a wide spectrum of molecular interactions, such as bind, phosphorylate, glycosylate, and activate; 207 of these interaction types occur more than 1,000 times in our unfiltered, multi-species data set. Because mouse and human genes are linked through an orthological relationship, human and mouse networks are amenable to straightforward, joint computational analysis. Using our newly generated networks and known associations between mouse genes and cerebellar malformation phenotypes, we predicted a number of new associations between genes and five cerebellar phenotypes (small cerebellum, absent cerebellum, cerebellar degeneration, abnormal foliation, and abnormal vermis). Using a battery of statistical tests, we showed that genes that are associated with cerebellar phenotypes tend to form compact network clusters. Further, we observed that cerebellar malformation phenotypes tend to be associated with highly connected genes. This tendency was stronger for developmental phenotypes and weaker for cerebellar degeneration.
We described and made publicly available the largest existing set of text-mined statements; we also presented its application to an important biological problem. We have extracted and purified two large molecular networks, one for humans and one for mouse. We characterized the data sets, described the methods we used to generate them, and presented a novel biological application of the networks to study the etiology of five cerebellum phenotypes. We demonstrated quantitatively that the development-related malformations differ in their system-level properties from degeneration-related genes. We showed that there is a high degree of overlap among the genes implicated in the developmental malformations, that these genes have a strong tendency to be highly connected within the molecular network, and that they also tend to be clustered together, forming a compact molecular network neighborhood. In contrast, the genes involved in malformations due to degeneration do not have a high degree of connectivity, are not strongly clustered in the network, and do not overlap significantly with the development related genes. In addition, taking into account the above-mentioned system-level properties and the gene-specific network interactions, we made highly confident predictions about novel genes that are likely also involved in the etiology of the analyzed phenotypes.
We constructed a large-scale functional network model in Drosophila melanogaster built around two key transcription factors involved in the process of embryonic segmentation. Analysis of the model allowed the identification of a new role for the ubiquitin E3 ligase complex factor SPOP. In Drosophila, the gene encoding SPOP is a target of segmentation transcription factors. Drosophila SPOP mediates degradation of the Jun-kinase phosphatase Puckered thereby inducing TNF/Eiger dependent apoptosis. In humans we found that SPOP plays a conserved role in TNF-mediated JNK signaling and was highly expressed in 99% of clear cell renal cell carcinoma (RCC), the most prevalent form of kidney cancer. SPOP expression distinguished histological subtypes of RCC and facilitated identification of clear cell RCC as the primary tumor for metastatic lesions.
Tens of thousands of biomedical journals exist, and the deluge of new articles in the biomedical sciences is leading to information overload. Hence, there is much interest in text mining, the use of computational tools to enhance the human ability to parse and understand complex text.
Large-scale annotation efforts typically involve several experts who may disagree with each other. We propose an approach for modeling disagreements among experts that allows providing each annotation with a confidence value (i.e., the posterior probability that it is correct). Our approach allows computing certainty-level for individual annotations, given annotator-specific parameters estimated from data. We developed two probabilistic models for performing this analysis, compared these models using computer simulation, and tested each model's actual performance, based on a large data set generated by human annotators specifically for this study. We show that even in the worst-case scenario, when all annotators disagree, our approach allows us to significantly increase the probability of choosing the correct annotation. Along with this publication we make publicly available a corpus of 10,000 sentences annotated according to several cardinal dimensions that we have introduced in earlier work. The 10,000 sentences were all 3-fold annotated by a group of eight experts, while a 1,000-sentence subset was further 5-fold annotated by five new experts. While the presented data represent a specialized curation task, our modeling approach is general; most data annotation studies could benefit from our methodology.
Data annotation (manual data curation) tasks are at the very heart of modern biology. Experts performing curation obviously differ in their efficiency, attitude, and precision, but directly measuring their performance is not easy. We propose an experimental design schema and associated mathematical models with which to estimate annotator-specific correctness in large multi-annotator efforts. With these, we can compute confidence in every annotation, facilitating the effective use of all annotated data, even when annotations are conflicting. Our approach retains all annotations with computed confidence values, and provides more comprehensive training data for machine learning algorithms than approaches where only perfect-agreement annotations are used. We provide results of independent testing that demonstrate that our methodology works. We believe these models can be applied to and improve upon a wide variety of annotation tasks that involve multiple annotators.
Motivation: Much current research in biomedical text mining is concerned with serving biologists by extracting certain information from scientific text. We note that there is no ‘average biologist’ client; different users have distinct needs. For instance, as noted in past evaluation efforts (BioCreative, TREC, KDD) database curators are often interested in sentences showing experimental evidence and methods. Conversely, lab scientists searching for known information about a protein may seek facts, typically stated with high confidence. Text-mining systems can target specific end-users and become more effective, if the system can first identify text regions rich in the type of scientific content that is of interest to the user, retrieve documents that have many such regions, and focus on fact extraction from these regions. Here, we study the ability to characterize and classify such text automatically. We have recently introduced a multi-dimensional categorization and annotation scheme, developed to be applicable to a wide variety of biomedical documents and scientific statements, while intended to support specific biomedical retrieval and extraction tasks.
Results: The annotation scheme was applied to a large corpus in a controlled effort by eight independent annotators, where three individual annotators independently tagged each sentence. We then trained and tested machine learning classifiers to automatically categorize sentence fragments based on the annotation. We discuss here the issues involved in this task, and present an overview of the results. The latter strongly suggest that automatic annotation along most of the dimensions is highly feasible, and that this new framework for scientific sentence categorization is applicable in practice.
What makes an article high impact?
Our analysis highlights common statistical features of high-impact articles; we also show how information flows among various publication types.
Reliable and comprehensive maps of molecular pathways are indispensable for guiding complex biomedical experiments. Such maps are typically assembled from myriads of disparate research reports and are replete with inconsistencies due to variations in experimental conditions and/or errors. It is often an intractable task to manually verify internal consistency over a large collection of experimental statements. To automate large-scale reconciliation efforts, we propose a random-arcs-and-nodes model where both nodes (tissue-specific states of biological molecules) and arcs (interactions between them) are represented with random variables. We show how to obtain a non-contradictory model of a molecular network by computing the joint distribution for arc and node variables, and then apply our methodology to a realistic network, generating a set of experimentally testable hypotheses. This network, derived from an automated analysis of over 3,000 full-text research articles, includes genes that have been hypothetically linked to four neurological disorders: Alzheimer's disease, autism, bipolar disorder, and schizophrenia. We estimated that approximately 10% of the published molecular interactions are logically incompatible. Our approach can be directly applied to an array of diverse problems including those encountered in molecular biology, ecology, economics, politics, and sociology.
Text-mining algorithms make mistakes in extracting facts from natural-language texts. In biomedical applications, which rely on use of text-mined data, it is critical to assess the quality (the probability that the message is correctly extracted) of individual facts—to resolve data conflicts and inconsistencies. Using a large set of almost 100,000 manually produced evaluations (most facts were independently reviewed more than once, producing independent evaluations), we implemented and tested a collection of algorithms that mimic human evaluation of facts provided by an automated information-extraction system. The performance of our best automated classifiers closely approached that of our human evaluators (ROC score close to 0.95). Our hypothesis is that, were we to use a larger number of human experts to evaluate any given sentence, we could implement an artificial-intelligence curator that would perform the classification job at least as accurately as an average individual human evaluator. We illustrated our analysis by visualizing the predicted accuracy of the text-mined relations involving the term cocaine.
Current automated approaches for extracting biologically important facts from scientific articles are imperfect: while being capable of efficient, fast, and inexpensive analysis of enormous quantities of scientific prose, they make errors. To emulate the human experts evaluating the quality of the automatically extracted facts, we have developed an artificial intelligence program (“a robotic curator”) that closely approaches human experts in the quality of distinguishing the correctly extracted facts from the incorrectly extracted ones.
While biomedical text mining is emerging as an important research area, practical results have proven difficult to achieve. We believe that an important first step towards more accurate text-mining lies in the ability to identify and characterize text that satisfies various types of information needs. We report here the results of our inquiry into properties of scientific text that have sufficient generality to transcend the confines of a narrow subject area, while supporting practical mining of text for factual information. Our ultimate goal is to annotate a significant corpus of biomedical text and train machine learning methods to automatically categorize such text along certain dimensions that we have defined.
We have identified five qualitative dimensions that we believe characterize a broad range of scientific sentences, and are therefore useful for supporting a general approach to text-mining: focus, polarity, certainty, evidence, and directionality. We define these dimensions and describe the guidelines we have developed for annotating text with regard to them.
To examine the effectiveness of the guidelines, twelve annotators independently annotated the same set of 101 sentences that were randomly selected from current biomedical periodicals. Analysis of these annotations shows 70–80% inter-annotator agreement, suggesting that our guidelines indeed present a well-defined, executable and reproducible task.
We present our guidelines defining a text annotation task, along with annotation results from multiple independently produced annotations, demonstrating the feasibility of the task. The annotation of a very large corpus of documents along these guidelines is currently ongoing. These annotations form the basis for the categorization of text along multiple dimensions, to support viable text mining for experimental results, methodology statements, and other forms of information. We are currently developing machine learning methods, to be trained and tested on the annotated corpus, that would allow for the automatic categorization of biomedical text along the general dimensions that we have presented. The guidelines in full detail, along with annotated examples, are publicly available.
Power distributions appear in numerous biological, physical and other contexts, which appear to be fundamentally different. In biology, power laws have been claimed to describe the distributions of the connections of enzymes and metabolites in metabolic networks, the number of interactions partners of a given protein, the number of members in paralogous families, and other quantities. In network analysis, power laws imply evolution of the network with preferential attachment, i.e. a greater likelihood of nodes being added to pre-existing hubs. Exploration of different types of evolutionary models in an attempt to determine which of them lead to power law distributions has the potential of revealing non-trivial aspects of genome evolution.
A simple model of evolution of the domain composition of proteomes was developed, with the following elementary processes: i) domain birth (duplication with divergence), ii) death (inactivation and/or deletion), and iii) innovation (emergence from non-coding or non-globular sequences or acquisition via horizontal gene transfer). This formalism can be described as a birth, death and innovation model (BDIM). The formulas for equilibrium frequencies of domain families of different size and the total number of families at equilibrium are derived for a general BDIM. All asymptotics of equilibrium frequencies of domain families possible for the given type of models are found and their appearance depending on model parameters is investigated. It is proved that the power law asymptotics appears if, and only if, the model is balanced, i.e. domain duplication and deletion rates are asymptotically equal up to the second order. It is further proved that any power asymptotic with the degree not equal to -1 can appear only if the hypothesis of independence of the duplication/deletion rates on the size of a domain family is rejected. Specific cases of BDIMs, namely simple, linear, polynomial and rational models, are considered in details and the distributions of the equilibrium frequencies of domain families of different size are determined for each case. We apply the BDIM formalism to the analysis of the domain family size distributions in prokaryotic and eukaryotic proteomes and show an excellent fit between these empirical data and a particular form of the model, the second-order balanced linear BDIM. Calculation of the parameters of these models suggests surprisingly high innovation rates, comparable to the total domain birth (duplication) and elimination rates, particularly for prokaryotic genomes.
We show that a straightforward model of genome evolution, which does not explicitly include selection, is sufficient to explain the observed distributions of domain family sizes, in which power laws appear as asymptotic. However, for the model to be compatible with the data, there has to be a precise balance between domain birth, death and innovation rates, and this is likely to be maintained by selection. The developed approach is oriented at a mathematical description of evolution of domain composition of proteomes, but a simple reformulation could be applied to models of other evolving networks with preferential attachment.