Search tips
Search criteria

Results 1-5 (5)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
author:("hakalau, Kai")
1.  An expanded evaluation of protein function prediction methods shows an improvement in accuracy 
Jiang, Yuxiang | Oron, Tal Ronnen | Clark, Wyatt T. | Bankapur, Asma R. | D’Andrea, Daniel | Lepore, Rosalba | Funk, Christopher S. | Kahanda, Indika | Verspoor, Karin M. | Ben-Hur, Asa | Koo, Da Chen Emily | Penfold-Brown, Duncan | Shasha, Dennis | Youngs, Noah | Bonneau, Richard | Lin, Alexandra | Sahraeian, Sayed M. E. | Martelli, Pier Luigi | Profiti, Giuseppe | Casadio, Rita | Cao, Renzhi | Zhong, Zhaolong | Cheng, Jianlin | Altenhoff, Adrian | Skunca, Nives | Dessimoz, Christophe | Dogan, Tunca | Hakala, Kai | Kaewphan, Suwisa | Mehryary, Farrokh | Salakoski, Tapio | Ginter, Filip | Fang, Hai | Smithers, Ben | Oates, Matt | Gough, Julian | Törönen, Petri | Koskinen, Patrik | Holm, Liisa | Chen, Ching-Tai | Hsu, Wen-Lian | Bryson, Kevin | Cozzetto, Domenico | Minneci, Federico | Jones, David T. | Chapman, Samuel | BKC, Dukka | Khan, Ishita K. | Kihara, Daisuke | Ofer, Dan | Rappoport, Nadav | Stern, Amos | Cibrian-Uhalte, Elena | Denny, Paul | Foulger, Rebecca E. | Hieta, Reija | Legge, Duncan | Lovering, Ruth C. | Magrane, Michele | Melidoni, Anna N. | Mutowo-Meullenet, Prudence | Pichler, Klemens | Shypitsyna, Aleksandra | Li, Biao | Zakeri, Pooya | ElShal, Sarah | Tranchevent, Léon-Charles | Das, Sayoni | Dawson, Natalie L. | Lee, David | Lees, Jonathan G. | Sillitoe, Ian | Bhat, Prajwal | Nepusz, Tamás | Romero, Alfonso E. | Sasidharan, Rajkumar | Yang, Haixuan | Paccanaro, Alberto | Gillis, Jesse | Sedeño-Cortés, Adriana E. | Pavlidis, Paul | Feng, Shou | Cejuela, Juan M. | Goldberg, Tatyana | Hamp, Tobias | Richter, Lothar | Salamov, Asaf | Gabaldon, Toni | Marcet-Houben, Marina | Supek, Fran | Gong, Qingtian | Ning, Wei | Zhou, Yuanpeng | Tian, Weidong | Falda, Marco | Fontana, Paolo | Lavezzo, Enrico | Toppo, Stefano | Ferrari, Carlo | Giollo, Manuel | Piovesan, Damiano | Tosatto, Silvio C.E. | del Pozo, Angela | Fernández, José M. | Maietta, Paolo | Valencia, Alfonso | Tress, Michael L. | Benso, Alfredo | Di Carlo, Stefano | Politano, Gianfranco | Savino, Alessandro | Rehman, Hafeez Ur | Re, Matteo | Mesiti, Marco | Valentini, Giorgio | Bargsten, Joachim W. | van Dijk, Aalt D. J. | Gemovic, Branislava | Glisic, Sanja | Perovic, Vladmir | Veljkovic, Veljko | Veljkovic, Nevena | Almeida-e-Silva, Danillo C. | Vencio, Ricardo Z. N. | Sharan, Malvika | Vogel, Jörg | Kansakar, Lakesh | Zhang, Shanshan | Vucetic, Slobodan | Wang, Zheng | Sternberg, Michael J. E. | Wass, Mark N. | Huntley, Rachael P. | Martin, Maria J. | O’Donovan, Claire | Robinson, Peter N. | Moreau, Yves | Tramontano, Anna | Babbitt, Patricia C. | Brenner, Steven E. | Linial, Michal | Orengo, Christine A. | Rost, Burkhard | Greene, Casey S. | Mooney, Sean D. | Friedberg, Iddo | Radivojac, Predrag
Genome Biology  2016;17(1):184.
A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging.
We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2.
The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-016-1037-6) contains supplementary material, which is available to authorized users.
PMCID: PMC5015320  PMID: 27604469
Protein function prediction; Disease gene prioritization
2.  Filtering large-scale event collections using a combination of supervised and unsupervised learning for event trigger classification 
Biomedical event extraction is one of the key tasks in biomedical text mining, supporting various applications such as database curation and hypothesis generation. Several systems, some of which have been applied at a large scale, have been introduced to solve this task.
Past studies have shown that the identification of the phrases describing biological processes, also known as trigger detection, is a crucial part of event extraction, and notable overall performance gains can be obtained by solely focusing on this sub-task. In this paper we propose a novel approach for filtering falsely identified triggers from large-scale event databases, thus improving the quality of knowledge extraction.
Our method relies on state-of-the-art word embeddings, event statistics gathered from the whole biomedical literature, and both supervised and unsupervised machine learning techniques. We focus on EVEX, an event database covering the whole PubMed and PubMed Central Open Access literature containing more than 40 million extracted events. The top most frequent EVEX trigger words are hierarchically clustered, and the resulting cluster tree is pruned to identify words that can never act as triggers regardless of their context. For rarely occurring trigger words we introduce a supervised approach trained on the combination of trigger word classification produced by the unsupervised clustering method and manual annotation.
The method is evaluated on the official test set of BioNLP Shared Task on Event Extraction. The evaluation shows that the method can be used to improve the performance of the state-of-the-art event extraction systems. This successful effort also translates into removing 1,338,075 of potentially incorrect events from EVEX, thus greatly improving the quality of the data. The method is not solely bound to the EVEX resource and can be thus used to improve the quality of any event extraction system or database.
The data and source code for this work are available at:
Electronic supplementary material
The online version of this article (doi:10.1186/s13326-016-0070-4) contains supplementary material, which is available to authorized users.
PMCID: PMC4864999  PMID: 27175227
BioNLP; Event extraction; Trigger detection; Word embeddings
3.  Application of the EVEX resource to event extraction and network construction: Shared Task entry and result analysis 
BMC Bioinformatics  2015;16(Suppl 16):S3.
Modern methods for mining biomolecular interactions from literature typically make predictions based solely on the immediate textual context, in effect a single sentence. No prior work has been published on extending this context to the information automatically gathered from the whole biomedical literature. Thus, our motivation for this study is to explore whether mutually supporting evidence, aggregated across several documents can be utilized to improve the performance of the state-of-the-art event extraction systems.
In this paper, we describe our participation in the latest BioNLP Shared Task using the large-scale text mining resource EVEX. We participated in the Genia Event Extraction (GE) and Gene Regulation Network (GRN) tasks with two separate systems. In the GE task, we implemented a re-ranking approach to improve the precision of an existing event extraction system, incorporating features from the EVEX resource. In the GRN task, our system relied solely on the EVEX resource and utilized a rule-based conversion algorithm between the EVEX and GRN formats.
In the GE task, our re-ranking approach led to a modest performance increase and resulted in the first rank of the official Shared Task results with 50.97% F-score. Additionally, in this paper we explore and evaluate the usage of distributed vector representations for this challenge.
In the GRN task, we ranked fifth in the official results with a strict/relaxed SER score of 0.92/0.81 respectively. To try and improve upon these results, we have implemented a novel machine learning based conversion system and benchmarked its performance against the original rule-based system.
For the GRN task, we were able to produce a gene regulatory network from the EVEX data, warranting the use of such generic large-scale text mining data in network biology settings. A detailed performance and error analysis provides more insight into the relatively low recall rates.
In the GE task we demonstrate that both the re-ranking approach and the word vectors can provide slight performance improvement. A manual evaluation of the re-ranking results pinpoints some of the challenges faced in applying large-scale text mining knowledge to event extraction.
PMCID: PMC4642107  PMID: 26551766
Text mining; Event extraction; Network construction; Large-scale data; Distributed vector representations of words
4.  Large-Scale Event Extraction from Literature with Multi-Level Gene Normalization 
PLoS ONE  2013;8(4):e55814.
Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access ( Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from, under the Creative Commons – Attribution – Share Alike (CC BY-SA) license.
PMCID: PMC3629104  PMID: 23613707
5.  Exploring Biomolecular Literature with EVEX: Connecting Genes through Events, Homology, and Indirect Associations 
Advances in Bioinformatics  2012;2012:582765.
Technological advancements in the field of genetics have led not only to an abundance of experimental data, but also caused an exponential increase of the number of published biomolecular studies. Text mining is widely accepted as a promising technique to help researchers in the life sciences deal with the amount of available literature. This paper presents a freely available web application built on top of 21.3 million detailed biomolecular events extracted from all PubMed abstracts. These text mining results were generated by a state-of-the-art event extraction system and enriched with gene family associations and abstract generalizations, accounting for lexical variants and synonymy. The EVEX resource locates relevant literature on phosphorylation, regulation targets, binding partners, and several other biomolecular events and assigns confidence values to these events. The search function accepts official gene/protein symbols as well as common names from all species. Finally, the web application is a powerful tool for generating homology-based hypotheses as well as novel, indirect associations between genes and proteins such as coregulators.
PMCID: PMC3375141  PMID: 22719757

Results 1-5 (5)