Identifying genetic variants that affect drug response or play a role in disease is an important task for clinicians and researchers. Before individual variants can be explored efficiently for effect on drug response or disease relationships, specific candidate genes must be identified. While many methods rank candidate genes through the use of sequence features and network topology, only a few exploit the information contained in the biomedical literature. In this work, we train and test a classifier on known pharmacogenes from PharmGKB and present a classifier that predicts pharmacogenes on a genome-wide scale using only Gene Ontology annotations and simple features mined from the biomedical literature. Performance of F=0.86, AUC=0.860 is achieved. The top 10 predicted genes are analyzed. Additionally, a set of enriched pharmacogenic Gene Ontology concepts is produced.
Many colleges and universities across the globe now offer bachelors, masters, and doctoral degrees, along with certificate programs in bioinformatics. While there is some consensus surrounding curricula competencies, programs vary greatly in their core foci, with some leaning heavily toward the biological sciences and others toward quantitative areas. This allows prospective students to choose a program that best fits their interests and career goals. In the digital age, most scientific fields are facing an enormous growth of data, and as a consequence, the goals and challenges of bioinformatics are rapidly changing; this requires that bioinformatics education also change. In this workshop, we seek to ascertain current trends in bioinformatics education by asking the question, “What are the core competencies all bioinformaticians should have at the end of their training, and how successful have programs been in placing students in desired careers?”
Ontological concepts are useful for many different biomedical tasks. Concepts are difficult to recognize in text due to a disconnect between what is captured in an ontology and how the concepts are expressed in text. There are many recognizers for specific ontologies, but a general approach for concept recognition is an open problem.
Three dictionary-based systems (MetaMap, NCBO Annotator, and ConceptMapper) are evaluated on eight biomedical ontologies in the Colorado Richly Annotated Full-Text (CRAFT) Corpus. Over 1,000 parameter combinations are examined, and best-performing parameters for each system-ontology pair are presented.
Baselines for concept recognition by three systems on eight biomedical ontologies are established (F-measures range from 0.14–0.83). Out of the three systems we tested, ConceptMapper is generally the best-performing system; it produces the highest F-measure of seven out of eight ontologies. Default parameters are not ideal for most systems on most ontologies; by changing parameters F-measure can be increased by up to 0.4. Not only are best performing parameters presented, but suggestions for choosing the best parameters based on ontology characteristics are presented.
Though the annotation of digital artifacts with metadata has a long history, the bulk of that work focuses on the association of single terms or concepts to single targets. As annotation efforts expand to capture more complex information, annotations will need to be able to refer to knowledge structures formally defined in terms of more atomic knowledge structures. Existing provenance efforts in the Semantic Web domain primarily focus on tracking provenance at the level of whole triples and do not provide enough detail to track how individual triple elements of annotations were derived from triple elements of other annotations.
We present a task- and domain-independent ontological model for capturing annotations and their linkage to their denoted knowledge representations, which can be singular concepts or more complex sets of assertions. We have implemented this model as an extension of the Information Artifact Ontology in OWL and made it freely available, and we show how it can be integrated with several prominent annotation and provenance models. We present several application areas for the model, ranging from linguistic annotation of text to the annotation of disease-associations in genome sequences.
With this model, progressively more complex annotations can be composed from other annotations, and the provenance of compositional annotations can be represented at the annotation level or at the level of individual elements of the RDF triples composing the annotations. This in turn allows for progressively richer annotations to be constructed from previous annotation efforts, the precise provenance recording of which facilitates evidence-based inference and error tracking.
Ontology; Conceptual data modeling; Annotation; Markup; Provenance; OWL; RDF
The AMIA biomedical informatics (BMI) core competencies have been designed to support and guide graduate education in BMI, the core scientific discipline underlying the breadth of the field's research, practice, and education. The core definition of BMI adopted by AMIA specifies that BMI is ‘the interdisciplinary field that studies and pursues the effective uses of biomedical data, information, and knowledge for scientific inquiry, problem solving and decision making, motivated by efforts to improve human health.’ Application areas range from bioinformatics to clinical and public health informatics and span the spectrum from the molecular to population levels of health and biomedicine. The shared core informatics competencies of BMI draw on the practical experience of many specific informatics sub-disciplines. The AMIA BMI analysis highlights the central shared set of competencies that should guide curriculum design and that graduate students should be expected to master.
President and CEO; preparedness; wireless; preferences; population health; primary care; collaborative technologies; knowledge representations; knowledge acquisition and knowledge management; controlled terminologies and vocabularies; ontologies; AMIA
This study investigated the utility of advanced computational techniques to large scale-genome based data to identify novel genes that govern murine pancreatic development.
An expression dataset for mouse pancreatic development was complemented with “Hanalyzer” (high-throughput data analyzer) to identify and prioritize novel genes. Quantitative-real time polymerase chain reaction, in situ hybridization and immunohistochemistry was used to validate selected genes.
Four new genes whose roles in the development of murine pancreas have not previously been established were identified; Cystathionine beta synthase (Cbs), Meis homeobox 1, Growth factor independent 1 and Aldehyde dehydrogenase 18 family, member A1. Their temporal expression during development was documented. Cbs was localized in the cytoplasm of the tip cells of the epithelial chords of the undifferentiated progenitor cells at E12.5 and was co expressed with the Pancreatic and duodenal homeobox 1 and Pancreas specific transcription factor, 1a positive cells. In the adult pancreas, Cbs was localized primarily within the acinar compartment.
In silico analysis of high-throughput microarray data in combination with background knowledge about genes provides an additional reliable method of identifying novel genes. To our knowledge, the expression and localization of Cbs has not been previously documented during mouse pancreatic development.
Pancreas; Development; Microarray; Bioinformatics; Hanalyzer
Hepatic oxidative stress and subsequent lipid peroxidation are well-recognized consequences of sustained ethanol consumption. The covalent adduction of nucleophilic amino acid side-chains by lipid electrophiles is significantly increased in patients with alcoholic liver disease (ALD); a global assessment of in vivo protein targets and the consequences of these modifications, however, has not been conducted. In this report, we describe identification of novel protein targets for covalent adduction in a 6-week murine model for ALD. Ethanol-fed mice displayed a 2-fold increase in hepatic TBARS while immunohistochemical analysis for the reactive aldehydes 4-hydroxynonenal (4-HNE), 4-oxononenal (4-ONE), acrolein (ACR) and malondialdehyde (MDA) revealed a marked increase in the staining of modified proteins in the ethanol-treated mice. Increased protein carbonyl content was confirmed utilizing subcellular fractionation of liver homogenates followed by biotin-tagging through hydrazide chemistry, where approximately a 2-fold increase in modified proteins was observed in microsomal and cytosolic fractions. To determine targets of protein carbonylation, a secondary hydrazide method coupled to a highly sensitive 2-dimensional liquid chromatography tandem mass spectrometry (2D LC-MS/MS or MuDPIT) technique was utilized. Our results have identified 414 protein targets for modification by reactive aldehydes in ALD. The presence of novel in vivo sites of protein modification by 4-HNE (2), 4-ONE (4) and ACR (2) was also confirmed in our data set. While the precise impact of protein carbonylation in ALD remains unknown, a bioinformatic analysis of the data set has revealed key pathways associated with disease progression, including fatty acid metabolism, drug metabolism, oxidative phosphorylation and the TCA cycle. These data suggest a major role for aldehyde adduction in the pathogenesis of ALD.
ROS; ALD; 4-HNE; 4-hydroxynonenal; 4-ONE; 4-oxononenal; acrolein; MDA; malondialdehyde; MuDPIT; 2D-LC-MS/MS; DAVID; protein adducts; protein carbonyl; TBARS; lipid peroxidation; electrophile; biotin hydrazide; in vivo adduct; GRP78; bioinformatics; steatosis; fatty liver; ethanol
Adipose tissue located in the viscera is considered to be functionally and metabolically different from that found in the subcutaneous depot. However, subcutaneous adipose tissue in generalized regions is considered to be homogeneous in nature. Affymetrix GeneChip Human Exon 1.0 ST Arrays were used to determine differential gene expression in four subcutaneous adipose depots (upper abdomen, lower abdomen, flank and hip) in normal weight women. A total of 2890/24,409 transcripts were differentially expressed between all sites. When comparing the hip and flank to the lower abdomen, 248 and 83 genes were differentially expressed, respectively. When comparing the hip and flank to the upper abdomen, 2480 and 79 genes were differentially expressed, respectively. No genes were significantly different when the lower abdomen was compared to the upper abdomen and the hip to the flank. Genes involved in the complement and coagulation cascades and immune responses showed increased expression in the lower abdomen compared to the flank. In addition, two genes involved in the complement and coagulation cascade, CR1 and C7, were expressed more highly in the lower abdomen compared to the hip. Genes involved in basic biochemical metabolism including insulin signaling, the urea cycle, glutamate metabolism, arginine and proline metabolism and aminosugar metabolism had higher expression in the lower abdomen compared to the hip. These results in normal weight healthy women provide a new perspective on regional differences in subcutaneous adipose tissue biology that may have pathophysiologic implications when adiposity increases.
Text mining for translational bioinformatics is a new field with tremendous research potential. It is a subfield of biomedical natural language processing that concerns itself directly with the problem of relating basic biomedical research to clinical practice, and vice versa. Applications of text mining fall both into the category of T1 translational research—translating basic science results into new interventions—and T2 translational research, or translational research for public health. Potential use cases include better phenotyping of research subjects, and pharmacogenomic research. A variety of methods for evaluating text mining applications exist, including corpora, structured test suites, and post hoc judging. Two basic principles of linguistic structure are relevant for building text mining applications. One is that linguistic structure consists of multiple levels. The other is that every level of linguistic structure is characterized by ambiguity. There are two basic approaches to text mining: rule-based, also known as knowledge-based; and machine-learning-based, also known as statistical. Many systems are hybrids of the two approaches. Shared tasks have had a strong effect on the direction of the field. Like all translational bioinformatics software, text mining software for translational bioinformatics can be considered health-critical and should be subject to the strictest standards of quality assurance and software testing.
The biomedical text mining community has focused on developing techniques to automatically extract important relations between biological components and semantic events involving genes or proteins from literature. In this paper, we propose a novel approach for mining relations and events in the biomedical literature using approximate subgraph matching. Extraction of such knowledge is performed by searching for an approximate subgraph isomorphism between key contextual dependencies and input sentence graphs. Our approach significantly increases the chance of retrieving relations or events encoded within complex dependency contexts by introducing error tolerance into the graph matching process, while maintaining the extraction precision at a high level. When evaluated on practical tasks, it achieves a 51.12% F-score in extracting nine types of biological events on the GE task of the BioNLP-ST 2011 and an 84.22% F-score in detecting protein-residue associations. The performance is comparable to the reported systems across these tasks, and thus demonstrates the generalizability of our proposed approach.
In this Commentary, we describe a cryptographic method for returning research results to individuals who participate in clinical studies. Controlled use of this method, which relaxes the typical anonymization guarantee, can ensure that clinically actionable results reach participants while also addressing most privacy concerns.
Advances in concept recognition and natural language parsing have led to the development of various tools that enable the identification of biomedical entities and relationships between them in text. The aim of the Genotype-Phenotype-Drug Relationship Extraction from Text workshop (or GPD-Rx workshop) is to examine the current state of art and discuss the next steps for making the extraction of relationships between biomedical entities integral to the curation and knowledge management workflow in Pharmacogenomics. The workshop will focus particularly on the extraction of Genotype-Phenotype, Genotype-Drug, and Phenotype-Drug relationships that are of interest to Pharmacogenomics. Extracting and structuring such text-mined relationships is a key to support the evaluation and the validation of multiple hypotheses that emerge from high throughput translational studies spanning multiple measurement modalities. In order to advance this agenda, it is essential that existing relationship extraction methods be compared to one another and that a community wide benchmark corpus emerges; against which future methods can be compared. The workshop aims to bring together researchers working on the automatic or semi-automatic extraction of relationships between biomedical entities from research literature in order to identify the key groups interested in creating such a benchmark.
NLP; Pharmacogenomics; Entity Recognition; Event Extraction; Genotype-Phenotype-Drug Relationships
Alcoholic liver disease (ALD) is a prominent cause of morbidity and mortality in the United States. Alterations in protein folding occur in numerous disease states, including ALD. The endoplasmic reticulum (ER) is the primary site of post-translational modifications (PTM) within the cell. Glycosylation, the most abundant PTM, affects protein stability, structure, localization and activity. Decreases in hepatic glycosylation machinery have been observed in rodent models of ALD, but specific protein targets have not been identified. Utilizing two-dimensional gel electrophoresis and liquid chromatography tandem mass spectrometry, glycoproteins were identified in hepatic microsomal fractions from control and ethanol-fed mice. This study reports for the first time a global decrease in ER glycosylation. Additionally, the identification of 30 glycoproteins within this fraction elucidates pathway-specific alterations in ALD impaired glycosylation. Among the identified proteins, triacylglycerol hydrolase (TGH) is positively affected by glycosylation, showing increased activity following the addition of sugar moieties. Impaired TGH activity is associated with increased cellular storage of lipids and provides a potential mechanism for the observed pathologies associated with ALD.
Alcoholic liver disease; glycoproteomics; UPR; ER stress; 4-HNE; DAVID
The heterogeneous and chaotic nature of osteosarcoma has confounded accurate molecular classification, prognosis, and prediction for this tumor. The occurrence of spontaneous osteosarcoma is largely confined to humans and dogs. While the clinical features are remarkably similar in both species, the organization of dogs into defined breeds provides a more homogeneous genetic background that may increase the likelihood to uncover molecular subtypes for this complex disease. We thus hypothesized that molecular profiles derived from canine osteosarcoma would aid in molecular subclassification of this disease when applied to humans. To test the hypothesis, we performed genome wide gene expression profiling in a cohort of dogs with osteosarcoma, primarily from high-risk breeds. To further reduce inter-sample heterogeneity, we assessed tumor-intrinsic properties through use of an extensive panel of osteosarcoma-derived cell lines. We observed strong differential gene expression that segregated samples into two groups with differential survival probabilities. Groupings were characterized by the inversely correlated expression of genes associated with G2/M transition and DNA damage checkpoint and microenvironment-interaction categories. This signature was preserved in data from whole tumor samples of three independent dog osteosarcoma cohorts, with stratification into the two expected groups. Significantly, this restricted signature partially overlapped a previously defined, predictive signature for soft tissue sarcomas, and it unmasked orthologous molecular subtypes and their corresponding natural histories in five independent data sets from human patients with osteosarcoma. Our results indicate that the narrower genetic diversity of dogs can be utilized to group complex human osteosarcoma into biologically and clinically relevant molecular subtypes. This in turn may enhance prognosis and prediction, and identify relevant therapeutic targets.
osteosarcoma; gene expression profiling; cell cycle; DNA damage checkpoint; tumor microenvironment; canine
We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus.
Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data.
The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.
Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text.
This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement.
As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.
Exact analytic expressions are developed for the average power of the Benjamini and Hochberg false discovery control procedure. The result is based on explicit computation of the joint probability distribution of the total number of rejections and the number of false rejections, and expressed in terms of the cumulative distribution functions of the p-values of the hypotheses. An example of analytic evaluation of the average power is given. The result is confirmed by numerical experiments and applied to a meta-analysis of three clinical studies in mammography.
hypothesis testing; multiple comparisons; false discovery; distribution of rejections; meta-analysis
The contents of parentheses in biomedical text have many potential uses in text mining applications. However, making use of them requires the ability to determine what class of contents they are. A system that automatically classifies parenthesized text into one of 20 categories is presented and evaluated here. It performs at a micro-averaged accuracy of 68% and a macro-averaged accuracy of 60% on an annotated corpus. The application is available as a Java class and as a Perl module.
The etiology of hemangiosarcoma remains incompletely understood. Its common occurrence in dogs suggests predisposing factors favor its development in this species. These factors could represent a constellation of heritable characteristics that promote transformation events and/or facilitate the establishment of a microenvironment that is conducive for survival of malignant blood vessel-forming cells. The hypothesis for this study was that characteristic molecular features distinguish hemangiosarcoma from non-malignant endothelial cells, and that such features are informative for the etiology of this disease.
We first investigated mutations of VHL and Ras family genes that might drive hemangiosarcoma by sequencing tumor DNA and mRNA (cDNA). Protein expression was examined using immunostaining. Next, we evaluated genome-wide gene expression profiling using the Affymetrix Canine 2.0 platform as a global approach to test the hypothesis. Data were evaluated using routine bioinformatics and validation was done using quantitative real time RT-PCR.
Each of 10 tumor and four non-tumor samples analyzed had wild type sequences for these genes. At the genome wide level, hemangiosarcoma cells clustered separately from non-malignant endothelial cells based on a robust signature that included genes involved in inflammation, angiogenesis, adhesion, invasion, metabolism, cell cycle, signaling, and patterning. This signature did not simply reflect a cancer-associated angiogenic phenotype, as it also distinguished hemangiosarcoma from non-endothelial, moderately to highly angiogenic bone marrow-derived tumors (lymphoma, leukemia, osteosarcoma).
The data show that inflammation and angiogenesis are important processes in the pathogenesis of vascular tumors, but a definitive ontogeny of the cells that give rise to these tumors remains to be established. The data do not yet distinguish whether functional or ontogenetic plasticity creates this phenotype, although they suggest that cells which give rise to hemangiosarcoma modulate their microenvironment to promote tumor growth and survival. We propose that the frequent occurrence of canine hemangiosarcoma in defined dog breeds, as well as its similarity to homologous tumors in humans, offers unique models to solve the dilemma of stem cell plasticity and whether angiogenic endothelial cells and hematopoietic cells originate from a single cell or from distinct progenitor cells.
An increase in work on the full text of journal articles and the growth of PubMedCentral have the opportunity to create a major paradigm shift in how biomedical text mining is done. However, until now there has been no comprehensive characterization of how the bodies of full text journal articles differ from the abstracts that until now have been the subject of most biomedical text mining research.
We examined the structural and linguistic aspects of abstracts and bodies of full text articles, the performance of text mining tools on both, and the distribution of a variety of semantic classes of named entities between them. We found marked structural differences, with longer sentences in the article bodies and much heavier use of parenthesized material in the bodies than in the abstracts. We found content differences with respect to linguistic features. Three out of four of the linguistic features that we examined were statistically significantly differently distributed between the two genres. We also found content differences with respect to the distribution of semantic features. There were significantly different densities per thousand words for three out of four semantic classes, and clear differences in the extent to which they appeared in the two genres. With respect to the performance of text mining tools, we found that a mutation finder performed equally well in both genres, but that a wide variety of gene mention systems performed much worse on article bodies than they did on abstracts. POS tagging was also more accurate in abstracts than in article bodies.
Aspects of structure and content differ markedly between article abstracts and article bodies. A number of these differences may pose problems as the text mining field moves more into the area of processing full-text articles. However, these differences also present a number of opportunities for the extraction of data types, particularly that found in parenthesized text, that is present in article bodies but not in article abstracts.
We introduce a system developed for the BioCreativeII.5 community evaluation of information extraction of proteins and protein interactions. The paper focuses primarily on the gene normalization task of recognizing protein mentions in text and mapping them to the appropriate database identifiers based on contextual clues. We outline a “fuzzy” dictionary lookup approach to protein mention detection that matches regularized text to similarly-regularized dictionary entries. We describe several different strategies for gene normalization that focus on species or organism mentions in the text, both globally throughout the document and locally in the immediate vicinity of a protein mention, and present the results of experimentation with a series of system variations that explore the effectiveness of the various normalization strategies, as well as the role of external knowledge sources. While our system was neither the best nor the worst performing system in the evaluation, the gene normalization strategies show promise and the system affords the opportunity to explore some of the variables affecting performance on the BCII.5 tasks.
biomedical natural language processing; information extraction; gene normalization; text mining