This study investigated the utility of advanced computational techniques to large scale-genome based data to identify novel genes that govern murine pancreatic development.
An expression dataset for mouse pancreatic development was complemented with “Hanalyzer” (high-throughput data analyzer) to identify and prioritize novel genes. Quantitative-real time polymerase chain reaction, in situ hybridization and immunohistochemistry was used to validate selected genes.
Four new genes whose roles in the development of murine pancreas have not previously been established were identified; Cystathionine beta synthase (Cbs), Meis homeobox 1, Growth factor independent 1 and Aldehyde dehydrogenase 18 family, member A1. Their temporal expression during development was documented. Cbs was localized in the cytoplasm of the tip cells of the epithelial chords of the undifferentiated progenitor cells at E12.5 and was co expressed with the Pancreatic and duodenal homeobox 1 and Pancreas specific transcription factor, 1a positive cells. In the adult pancreas, Cbs was localized primarily within the acinar compartment.
In silico analysis of high-throughput microarray data in combination with background knowledge about genes provides an additional reliable method of identifying novel genes. To our knowledge, the expression and localization of Cbs has not been previously documented during mouse pancreatic development.
Pancreas; Development; Microarray; Bioinformatics; Hanalyzer
Hepatic oxidative stress and subsequent lipid peroxidation are well-recognized consequences of sustained ethanol consumption. The covalent adduction of nucleophilic amino acid side-chains by lipid electrophiles is significantly increased in patients with alcoholic liver disease (ALD); a global assessment of in vivo protein targets and the consequences of these modifications, however, has not been conducted. In this report, we describe identification of novel protein targets for covalent adduction in a 6-week murine model for ALD. Ethanol-fed mice displayed a 2-fold increase in hepatic TBARS while immunohistochemical analysis for the reactive aldehydes 4-hydroxynonenal (4-HNE), 4-oxononenal (4-ONE), acrolein (ACR) and malondialdehyde (MDA) revealed a marked increase in the staining of modified proteins in the ethanol-treated mice. Increased protein carbonyl content was confirmed utilizing subcellular fractionation of liver homogenates followed by biotin-tagging through hydrazide chemistry, where approximately a 2-fold increase in modified proteins was observed in microsomal and cytosolic fractions. To determine targets of protein carbonylation, a secondary hydrazide method coupled to a highly sensitive 2-dimensional liquid chromatography tandem mass spectrometry (2D LC-MS/MS or MuDPIT) technique was utilized. Our results have identified 414 protein targets for modification by reactive aldehydes in ALD. The presence of novel in vivo sites of protein modification by 4-HNE (2), 4-ONE (4) and ACR (2) was also confirmed in our data set. While the precise impact of protein carbonylation in ALD remains unknown, a bioinformatic analysis of the data set has revealed key pathways associated with disease progression, including fatty acid metabolism, drug metabolism, oxidative phosphorylation and the TCA cycle. These data suggest a major role for aldehyde adduction in the pathogenesis of ALD.
ROS; ALD; 4-HNE; 4-hydroxynonenal; 4-ONE; 4-oxononenal; acrolein; MDA; malondialdehyde; MuDPIT; 2D-LC-MS/MS; DAVID; protein adducts; protein carbonyl; TBARS; lipid peroxidation; electrophile; biotin hydrazide; in vivo adduct; GRP78; bioinformatics; steatosis; fatty liver; ethanol
Adipose tissue located in the viscera is considered to be functionally and metabolically different from that found in the subcutaneous depot. However, subcutaneous adipose tissue in generalized regions is considered to be homogeneous in nature. Affymetrix GeneChip Human Exon 1.0 ST Arrays were used to determine differential gene expression in four subcutaneous adipose depots (upper abdomen, lower abdomen, flank and hip) in normal weight women. A total of 2890/24,409 transcripts were differentially expressed between all sites. When comparing the hip and flank to the lower abdomen, 248 and 83 genes were differentially expressed, respectively. When comparing the hip and flank to the upper abdomen, 2480 and 79 genes were differentially expressed, respectively. No genes were significantly different when the lower abdomen was compared to the upper abdomen and the hip to the flank. Genes involved in the complement and coagulation cascades and immune responses showed increased expression in the lower abdomen compared to the flank. In addition, two genes involved in the complement and coagulation cascade, CR1 and C7, were expressed more highly in the lower abdomen compared to the hip. Genes involved in basic biochemical metabolism including insulin signaling, the urea cycle, glutamate metabolism, arginine and proline metabolism and aminosugar metabolism had higher expression in the lower abdomen compared to the hip. These results in normal weight healthy women provide a new perspective on regional differences in subcutaneous adipose tissue biology that may have pathophysiologic implications when adiposity increases.
Text mining for translational bioinformatics is a new field with tremendous research potential. It is a subfield of biomedical natural language processing that concerns itself directly with the problem of relating basic biomedical research to clinical practice, and vice versa. Applications of text mining fall both into the category of T1 translational research—translating basic science results into new interventions—and T2 translational research, or translational research for public health. Potential use cases include better phenotyping of research subjects, and pharmacogenomic research. A variety of methods for evaluating text mining applications exist, including corpora, structured test suites, and post hoc judging. Two basic principles of linguistic structure are relevant for building text mining applications. One is that linguistic structure consists of multiple levels. The other is that every level of linguistic structure is characterized by ambiguity. There are two basic approaches to text mining: rule-based, also known as knowledge-based; and machine-learning-based, also known as statistical. Many systems are hybrids of the two approaches. Shared tasks have had a strong effect on the direction of the field. Like all translational bioinformatics software, text mining software for translational bioinformatics can be considered health-critical and should be subject to the strictest standards of quality assurance and software testing.
The biomedical text mining community has focused on developing techniques to automatically extract important relations between biological components and semantic events involving genes or proteins from literature. In this paper, we propose a novel approach for mining relations and events in the biomedical literature using approximate subgraph matching. Extraction of such knowledge is performed by searching for an approximate subgraph isomorphism between key contextual dependencies and input sentence graphs. Our approach significantly increases the chance of retrieving relations or events encoded within complex dependency contexts by introducing error tolerance into the graph matching process, while maintaining the extraction precision at a high level. When evaluated on practical tasks, it achieves a 51.12% F-score in extracting nine types of biological events on the GE task of the BioNLP-ST 2011 and an 84.22% F-score in detecting protein-residue associations. The performance is comparable to the reported systems across these tasks, and thus demonstrates the generalizability of our proposed approach.
In this Commentary, we describe a cryptographic method for returning research results to individuals who participate in clinical studies. Controlled use of this method, which relaxes the typical anonymization guarantee, can ensure that clinically actionable results reach participants while also addressing most privacy concerns.
Advances in concept recognition and natural language parsing have led to the development of various tools that enable the identification of biomedical entities and relationships between them in text. The aim of the Genotype-Phenotype-Drug Relationship Extraction from Text workshop (or GPD-Rx workshop) is to examine the current state of art and discuss the next steps for making the extraction of relationships between biomedical entities integral to the curation and knowledge management workflow in Pharmacogenomics. The workshop will focus particularly on the extraction of Genotype-Phenotype, Genotype-Drug, and Phenotype-Drug relationships that are of interest to Pharmacogenomics. Extracting and structuring such text-mined relationships is a key to support the evaluation and the validation of multiple hypotheses that emerge from high throughput translational studies spanning multiple measurement modalities. In order to advance this agenda, it is essential that existing relationship extraction methods be compared to one another and that a community wide benchmark corpus emerges; against which future methods can be compared. The workshop aims to bring together researchers working on the automatic or semi-automatic extraction of relationships between biomedical entities from research literature in order to identify the key groups interested in creating such a benchmark.
NLP; Pharmacogenomics; Entity Recognition; Event Extraction; Genotype-Phenotype-Drug Relationships
Alcoholic liver disease (ALD) is a prominent cause of morbidity and mortality in the United States. Alterations in protein folding occur in numerous disease states, including ALD. The endoplasmic reticulum (ER) is the primary site of post-translational modifications (PTM) within the cell. Glycosylation, the most abundant PTM, affects protein stability, structure, localization and activity. Decreases in hepatic glycosylation machinery have been observed in rodent models of ALD, but specific protein targets have not been identified. Utilizing two-dimensional gel electrophoresis and liquid chromatography tandem mass spectrometry, glycoproteins were identified in hepatic microsomal fractions from control and ethanol-fed mice. This study reports for the first time a global decrease in ER glycosylation. Additionally, the identification of 30 glycoproteins within this fraction elucidates pathway-specific alterations in ALD impaired glycosylation. Among the identified proteins, triacylglycerol hydrolase (TGH) is positively affected by glycosylation, showing increased activity following the addition of sugar moieties. Impaired TGH activity is associated with increased cellular storage of lipids and provides a potential mechanism for the observed pathologies associated with ALD.
Alcoholic liver disease; glycoproteomics; UPR; ER stress; 4-HNE; DAVID
The heterogeneous and chaotic nature of osteosarcoma has confounded accurate molecular classification, prognosis, and prediction for this tumor. The occurrence of spontaneous osteosarcoma is largely confined to humans and dogs. While the clinical features are remarkably similar in both species, the organization of dogs into defined breeds provides a more homogeneous genetic background that may increase the likelihood to uncover molecular subtypes for this complex disease. We thus hypothesized that molecular profiles derived from canine osteosarcoma would aid in molecular subclassification of this disease when applied to humans. To test the hypothesis, we performed genome wide gene expression profiling in a cohort of dogs with osteosarcoma, primarily from high-risk breeds. To further reduce inter-sample heterogeneity, we assessed tumor-intrinsic properties through use of an extensive panel of osteosarcoma-derived cell lines. We observed strong differential gene expression that segregated samples into two groups with differential survival probabilities. Groupings were characterized by the inversely correlated expression of genes associated with G2/M transition and DNA damage checkpoint and microenvironment-interaction categories. This signature was preserved in data from whole tumor samples of three independent dog osteosarcoma cohorts, with stratification into the two expected groups. Significantly, this restricted signature partially overlapped a previously defined, predictive signature for soft tissue sarcomas, and it unmasked orthologous molecular subtypes and their corresponding natural histories in five independent data sets from human patients with osteosarcoma. Our results indicate that the narrower genetic diversity of dogs can be utilized to group complex human osteosarcoma into biologically and clinically relevant molecular subtypes. This in turn may enhance prognosis and prediction, and identify relevant therapeutic targets.
osteosarcoma; gene expression profiling; cell cycle; DNA damage checkpoint; tumor microenvironment; canine
We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus.
Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data.
The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.
Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text.
This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement.
As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.
There have been several reports on the varying rates of progression among Alzheimer's Disease (AD) patients; however, there has been no quantitative study of the amount of heterogeneity in AD. Obtaining a reliable quantitative measure of AD progression rates and their variances among the patients for each stage of AD is essential for evaluating results of any clinical study. The Global Deterioration Scale (GDS) and Functional Assessment Staging procedure (FAST) characterize seven stages in the course of AD from normal aging to severe dementia. Each GDS/FAST stage has a published mean duration, but the variance is unknown. We use statistical analysis to reconstruct GDS/FAST stage durations in a cohort of 648 AD patients with an average follow-up time of 4.78 years. Calculations for GDS/FAST stages 4–6 reveal that the standard deviations for stage durations are comparable with their mean values, indicating the presence of large variations in the AD progression among patients. Such amount of heterogeneity in the course of progression of AD is consistent with the existence of several sub-groups of AD patients, which differ by their patterns of decline.
In recent decades, our understanding of Alzheimer's disease (AD) has increased; however, some basic questions still remain unresolved. One of them is: how homogeneous is AD? Is the course of progression more or less the same for most patients, or are there large variations? Our paper studies a large cohort of AD patients which comes from a 23-year-long study, and performs a statistical analysis of progression speed. We quantify the amount of spread in GDS/FAST stage durations (a staging system widely used by clinicians). We arrive at an astonishing conclusion that the mean length of AD stages is comparable with their standard deviation! This means that individual courses of AD progression may differ very much from each other, and from the textbook mean values. This has implications both for clinical trials (how do we assess if a new drug is effective, if the amount of natural spread is so large in untreated patients?), and for our understanding of this disease, which appears to be comprised of sub-diseases with different patterns of decline.
Exact analytic expressions are developed for the average power of the Benjamini and Hochberg false discovery control procedure. The result is based on explicit computation of the joint probability distribution of the total number of rejections and the number of false rejections, and expressed in terms of the cumulative distribution functions of the p-values of the hypotheses. An example of analytic evaluation of the average power is given. The result is confirmed by numerical experiments and applied to a meta-analysis of three clinical studies in mammography.
hypothesis testing; multiple comparisons; false discovery; distribution of rejections; meta-analysis
The contents of parentheses in biomedical text have many potential uses in text mining applications. However, making use of them requires the ability to determine what class of contents they are. A system that automatically classifies parenthesized text into one of 20 categories is presented and evaluated here. It performs at a micro-averaged accuracy of 68% and a macro-averaged accuracy of 60% on an annotated corpus. The application is available as a Java class and as a Perl module.
The etiology of hemangiosarcoma remains incompletely understood. Its common occurrence in dogs suggests predisposing factors favor its development in this species. These factors could represent a constellation of heritable characteristics that promote transformation events and/or facilitate the establishment of a microenvironment that is conducive for survival of malignant blood vessel-forming cells. The hypothesis for this study was that characteristic molecular features distinguish hemangiosarcoma from non-malignant endothelial cells, and that such features are informative for the etiology of this disease.
We first investigated mutations of VHL and Ras family genes that might drive hemangiosarcoma by sequencing tumor DNA and mRNA (cDNA). Protein expression was examined using immunostaining. Next, we evaluated genome-wide gene expression profiling using the Affymetrix Canine 2.0 platform as a global approach to test the hypothesis. Data were evaluated using routine bioinformatics and validation was done using quantitative real time RT-PCR.
Each of 10 tumor and four non-tumor samples analyzed had wild type sequences for these genes. At the genome wide level, hemangiosarcoma cells clustered separately from non-malignant endothelial cells based on a robust signature that included genes involved in inflammation, angiogenesis, adhesion, invasion, metabolism, cell cycle, signaling, and patterning. This signature did not simply reflect a cancer-associated angiogenic phenotype, as it also distinguished hemangiosarcoma from non-endothelial, moderately to highly angiogenic bone marrow-derived tumors (lymphoma, leukemia, osteosarcoma).
The data show that inflammation and angiogenesis are important processes in the pathogenesis of vascular tumors, but a definitive ontogeny of the cells that give rise to these tumors remains to be established. The data do not yet distinguish whether functional or ontogenetic plasticity creates this phenotype, although they suggest that cells which give rise to hemangiosarcoma modulate their microenvironment to promote tumor growth and survival. We propose that the frequent occurrence of canine hemangiosarcoma in defined dog breeds, as well as its similarity to homologous tumors in humans, offers unique models to solve the dilemma of stem cell plasticity and whether angiogenic endothelial cells and hematopoietic cells originate from a single cell or from distinct progenitor cells.
An increase in work on the full text of journal articles and the growth of PubMedCentral have the opportunity to create a major paradigm shift in how biomedical text mining is done. However, until now there has been no comprehensive characterization of how the bodies of full text journal articles differ from the abstracts that until now have been the subject of most biomedical text mining research.
We examined the structural and linguistic aspects of abstracts and bodies of full text articles, the performance of text mining tools on both, and the distribution of a variety of semantic classes of named entities between them. We found marked structural differences, with longer sentences in the article bodies and much heavier use of parenthesized material in the bodies than in the abstracts. We found content differences with respect to linguistic features. Three out of four of the linguistic features that we examined were statistically significantly differently distributed between the two genres. We also found content differences with respect to the distribution of semantic features. There were significantly different densities per thousand words for three out of four semantic classes, and clear differences in the extent to which they appeared in the two genres. With respect to the performance of text mining tools, we found that a mutation finder performed equally well in both genres, but that a wide variety of gene mention systems performed much worse on article bodies than they did on abstracts. POS tagging was also more accurate in abstracts than in article bodies.
Aspects of structure and content differ markedly between article abstracts and article bodies. A number of these differences may pose problems as the text mining field moves more into the area of processing full-text articles. However, these differences also present a number of opportunities for the extraction of data types, particularly that found in parenthesized text, that is present in article bodies but not in article abstracts.
We introduce a system developed for the BioCreativeII.5 community evaluation of information extraction of proteins and protein interactions. The paper focuses primarily on the gene normalization task of recognizing protein mentions in text and mapping them to the appropriate database identifiers based on contextual clues. We outline a “fuzzy” dictionary lookup approach to protein mention detection that matches regularized text to similarly-regularized dictionary entries. We describe several different strategies for gene normalization that focus on species or organism mentions in the text, both globally throughout the document and locally in the immediate vicinity of a protein mention, and present the results of experimentation with a series of system variations that explore the effectiveness of the various normalization strategies, as well as the role of external knowledge sources. While our system was neither the best nor the worst performing system in the evaluation, the gene normalization strategies show promise and the system affords the opportunity to explore some of the variables affecting performance on the BCII.5 tasks.
biomedical natural language processing; information extraction; gene normalization; text mining
Copy number variants (CNVs) create a major source of variation among individuals and populations. Array-based comparative genomic hybridisation (aCGH) is a powerful method used to detect and compare the copy numbers of DNA sequences at high resolution along the genome. In recent years, several informatics tools for accurate and efficient CNV detection and assessment have been developed. In this paper, most of the well known algorithms, analysis software and the limitations of that software will be briefly reviewed.
copy number variants; CNV; deletion; insertion; duplication; aCGH
Summary: The Unstructured Information Management Architecture (UIMA) framework and web services are emerging as useful tools for integrating biomedical text mining tools. This note describes our work, which wraps the National Center for Biomedical Ontology (NCBO) Annotator—an ontology-based annotation service—to make it available as a component in UIMA workflows.
Availability: This wrapper is freely available on the web at http://bionlp-uima.sourceforge.net/ as part of the UIMA tools distribution from the Center for Computational Pharmacology (CCP) at the University of Colorado School of Medicine. It has been implemented in Java for support on Mac OS X, Linux and MS Windows.
In recent years, there has been an explosion in the range of software available for annotation enrichment analysis. Three classes of enrichment algorithms and their associated software implementations are introduced here.
Their limitations and caveats are discussed, and direction for tool selection is given.
software; enrichment; gene enrichment analysis; GSEA; pathway analysis; gene ontology; gene list
Orofacial malformations resulting from genetic and/or environmental causes are frequent human birth defects yet their etiology is often unclear because of insufficient information concerning the molecular, cellular and morphogenetic processes responsible for normal facial development. We have, therefore, derived a comprehensive expression dataset for mouse orofacial development, interrogating three distinct regions – the mandibular, maxillary and frontonasal prominences. To capture the dynamic changes in the transcriptome during face formation, we sampled five time points between E10.5–E12.5, spanning the developmental period from establishment of the prominences to their fusion to form the mature facial platform. Seven independent biological replicates were used for each sample ensuring robustness and quality of the dataset. Here, we provide a general overview of the dataset, characterizing aspects of gene expression changes at both the spatial and temporal level. Considerable coordinate regulation occurs across the three prominences during this period of facial growth and morphogenesis, with a switch from expression of genes involved in cell proliferation to those associated with differentiation. An accompanying shift in the expression of polycomb and trithorax genes presumably maintains appropriate patterns of gene expression in precursor or differentiated cells, respectively. Superimposed on the many coordinated changes are prominence-specific differences in the expression of genes encoding transcription factors, extracellular matrix components, and signaling molecules. Thus, the elaboration of each prominence will be driven by particular combinations of transcription factors coupled with specific cell:cell and cell:matrix interactions. The dataset also reveals several prominence-specific genes not previously associated with orofacial development, a subset of which we externally validate. Several of these latter genes are components of bidirectional transcription units that likely share cis-acting sequences with well-characterized genes. Overall, our studies provide a valuable resource for probing orofacial development and a robust dataset for bioinformatic analysis of spatial and temporal gene expression changes during embryogenesis.
Gapped and ungapped sequence alignment were tested as possible methods to classify proteins into the functional classes defined by the International Enzyme Commission (EC). We exhaustively tested all 15,208 proteins labeled with any EC class in a recent release of the SwissProt database, evaluating all 1,327 relevant EC classes. We effectively tested all possible similarity thresholds that could be used for this assignment through the use of the ROC statistic. Approximately 60% of Enzyme Commission classes containing two or more proteins could not be perfectly discriminated by sequence similarity at any threshold. An analysis of the errors indicates that false positive matches dominate, and that various error mechanisms can be identified, including the multidomain nature of many proteins and polyproteins, convergent evolution, variation in enzyme specificity, and other factors. Many of the putatively false positives are in fact biologically relevant. This work strongly suggests that functional assignment of enzymes should attempt to delimit functionally significant subregions, or domains, before matching to EC classes.
protein function; enzyme classes; sequence alignment; pairwise; Enzyme Commission; receiver operating characteristic