Search tips
Search criteria

Results 1-14 (14)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
author:("Prins, piotr")
1.  Community-driven development for computational biology at Sprints, Hackathons and Codefests 
BMC Bioinformatics  2014;15(Suppl 14):S7.
Computational biology comprises a wide range of technologies and approaches. Multiple technologies can be combined to create more powerful workflows if the individuals contributing the data or providing tools for its interpretation can find mutual understanding and consensus. Much conversation and joint investigation are required in order to identify and implement the best approaches.
Traditionally, scientific conferences feature talks presenting novel technologies or insights, followed up by informal discussions during coffee breaks. In multi-institution collaborations, in order to reach agreement on implementation details or to transfer deeper insights in a technology and practical skills, a representative of one group typically visits the other. However, this does not scale well when the number of technologies or research groups is large.
Conferences have responded to this issue by introducing Birds-of-a-Feather (BoF) sessions, which offer an opportunity for individuals with common interests to intensify their interaction. However, parallel BoF sessions often make it hard for participants to join multiple BoFs and find common ground between the different technologies, and BoFs are generally too short to allow time for participants to program together.
This report summarises our experience with computational biology Codefests, Hackathons and Sprints, which are interactive developer meetings. They are structured to reduce the limitations of traditional scientific meetings described above by strengthening the interaction among peers and letting the participants determine the schedule and topics. These meetings are commonly run as loosely scheduled "unconferences" (self-organized identification of participants and topics for meetings) over at least two days, with early introductory talks to welcome and organize contributors, followed by intensive collaborative coding sessions. We summarise some prominent achievements of those meetings and describe differences in how these are organised, how their audience is addressed, and their outreach to their respective communities.
Hackathons, Codefests and Sprints share a stimulating atmosphere that encourages participants to jointly brainstorm and tackle problems of shared interest in a self-driven proactive environment, as well as providing an opportunity for new participants to get involved in collaborative projects.
PMCID: PMC4255748  PMID: 25472764
2.  BioHackathon series in 2011 and 2012: penetration of ontology and linked data in life science domains 
Katayama, Toshiaki | Wilkinson, Mark D | Aoki-Kinoshita, Kiyoko F | Kawashima, Shuichi | Yamamoto, Yasunori | Yamaguchi, Atsuko | Okamoto, Shinobu | Kawano, Shin | Kim, Jin-Dong | Wang, Yue | Wu, Hongyan | Kano, Yoshinobu | Ono, Hiromasa | Bono, Hidemasa | Kocbek, Simon | Aerts, Jan | Akune, Yukie | Antezana, Erick | Arakawa, Kazuharu | Aranda, Bruno | Baran, Joachim | Bolleman, Jerven | Bonnal, Raoul JP | Buttigieg, Pier Luigi | Campbell, Matthew P | Chen, Yi-an | Chiba, Hirokazu | Cock, Peter JA | Cohen, K Bretonnel | Constantin, Alexandru | Duck, Geraint | Dumontier, Michel | Fujisawa, Takatomo | Fujiwara, Toyofumi | Goto, Naohisa | Hoehndorf, Robert | Igarashi, Yoshinobu | Itaya, Hidetoshi | Ito, Maori | Iwasaki, Wataru | Kalaš, Matúš | Katoda, Takeo | Kim, Taehong | Kokubu, Anna | Komiyama, Yusuke | Kotera, Masaaki | Laibe, Camille | Lapp, Hilmar | Lütteke, Thomas | Marshall, M Scott | Mori, Takaaki | Mori, Hiroshi | Morita, Mizuki | Murakami, Katsuhiko | Nakao, Mitsuteru | Narimatsu, Hisashi | Nishide, Hiroyo | Nishimura, Yosuke | Nystrom-Persson, Johan | Ogishima, Soichi | Okamura, Yasunobu | Okuda, Shujiro | Oshita, Kazuki | Packer, Nicki H | Prins, Pjotr | Ranzinger, Rene | Rocca-Serra, Philippe | Sansone, Susanna | Sawaki, Hiromichi | Shin, Sung-Ho | Splendiani, Andrea | Strozzi, Francesco | Tadaka, Shu | Toukach, Philip | Uchiyama, Ikuo | Umezaki, Masahito | Vos, Rutger | Whetzel, Patricia L | Yamada, Issaku | Yamasaki, Chisato | Yamashita, Riu | York, William S | Zmasek, Christian M | Kawamoto, Shoko | Takagi, Toshihisa
The application of semantic technologies to the integration of biological data and the interoperability of bioinformatics analysis and visualization tools has been the common theme of a series of annual BioHackathons hosted in Japan for the past five years. Here we provide a review of the activities and outcomes from the BioHackathons held in 2011 in Kyoto and 2012 in Toyama. In order to efficiently implement semantic technologies in the life sciences, participants formed various sub-groups and worked on the following topics: Resource Description Framework (RDF) models for specific domains, text mining of the literature, ontology development, essential metadata for biological databases, platforms to enable efficient Semantic Web technology development and interoperability, and the development of applications for Semantic Web data. In this review, we briefly introduce the themes covered by these sub-groups. The observations made, conclusions drawn, and software development projects that emerged from these activities are discussed.
PMCID: PMC3978116  PMID: 24495517
BioHackathon; Bioinformatics; Semantic Web; Web services; Ontology; Visualization; Knowledge representation; Databases; Semantic interoperability; Data models; Data sharing; Data integration
3.  Fast probabilistic file fingerprinting for big data 
BMC Genomics  2013;14(Suppl 2):S8.
Biological data acquisition is raising new challenges, both in data analysis and handling. Not only is it proving hard to analyze the data at the rate it is generated today, but simply reading and transferring data files can be prohibitively slow due to their size. This primarily concerns logistics within and between data centers, but is also important for workstation users in the analysis phase. Common usage patterns, such as comparing and transferring files, are proving computationally expensive and are tying down shared resources.
We present an efficient method for calculating file uniqueness for large scientific data files, that takes less computational effort than existing techniques. This method, called Probabilistic Fast File Fingerprinting (PFFF), exploits the variation present in biological data and computes file fingerprints by sampling randomly from the file instead of reading it in full. Consequently, it has a flat performance characteristic, correlated with data variation rather than file size. We demonstrate that probabilistic fingerprinting can be as reliable as existing hashing techniques, with provably negligible risk of collisions. We measure the performance of the algorithm on a number of data storage and access technologies, identifying its strengths as well as limitations.
Probabilistic fingerprinting may significantly reduce the use of computational resources when comparing very large files. Utilisation of probabilistic fingerprinting techniques can increase the speed of common file-related workflows, both in the data center and for workbench analysis. The implementation of the algorithm is available as an open-source tool named pfff, as a command-line tool as well as a C library. The tool can be downloaded from
PMCID: PMC3582436  PMID: 23445565
4.  The 3rd DBCLS BioHackathon: improving life science data integration with Semantic Web technologies 
BioHackathon 2010 was the third in a series of meetings hosted by the Database Center for Life Sciences (DBCLS) in Tokyo, Japan. The overall goal of the BioHackathon series is to improve the quality and accessibility of life science research data on the Web by bringing together representatives from public databases, analytical tool providers, and cyber-infrastructure researchers to jointly tackle important challenges in the area of in silico biological research.
The theme of BioHackathon 2010 was the 'Semantic Web', and all attendees gathered with the shared goal of producing Semantic Web data from their respective resources, and/or consuming or interacting those data using their tools and interfaces. We discussed on topics including guidelines for designing semantic data and interoperability of resources. We consequently developed tools and clients for analysis and visualization.
We provide a meeting report from BioHackathon 2010, in which we describe the discussions, decisions, and breakthroughs made as we moved towards compliance with Semantic Web technologies - from source provider, through middleware, to the end-consumer.
PMCID: PMC3598643  PMID: 23398680
BioHackathon; Open source; Software; Semantic Web; Databases; Data integration; Data visualization; Web services; Interfaces
5.  Biogem: an effective tool-based approach for scaling up open source software development in bioinformatics 
Bioinformatics  2012;28(7):1035-1037.
Summary: Biogem provides a software development environment for the Ruby programming language, which encourages community-based software development for bioinformatics while lowering the barrier to entry and encouraging best practices.
Biogem, with its targeted modular and decentralized approach, software generator, tools and tight web integration, is an improved general model for scaling up collaborative open source software development in bioinformatics.
Availability: Biogem and modules are free and are OSS. Biogem runs on all systems that support recent versions of Ruby, including Linux, Mac OS X and Windows. Further information at A tutorial is available at
PMCID: PMC3315718  PMID: 22332238
6.  xQTL workbench: a scalable web environment for multi-level QTL analysis 
Bioinformatics  2012;28(7):1042-1044.
Summary: xQTL workbench is a scalable web platform for the mapping of quantitative trait loci (QTLs) at multiple levels: for example gene expression (eQTL), protein abundance (pQTL), metabolite abundance (mQTL) and phenotype (phQTL) data. Popular QTL mapping methods for model organism and human populations are accessible via the web user interface. Large calculations scale easily on to multi-core computers, clusters and Cloud. All data involved can be uploaded and queried online: markers, genotypes, microarrays, NGS, LC-MS, GC-MS, NMR, etc. When new data types come available, xQTL workbench is quickly customized using the Molgenis software generator.
Availability: xQTL workbench runs on all common platforms, including Linux, Mac OS X and Windows. An online demo system, installation guide, tutorials, software and source code are available under the LGPL3 license from
PMCID: PMC3315722  PMID: 22308096
7.  R/qtl: high-throughput multiple QTL mapping 
Bioinformatics  2010;26(23):2990-2992.
Motivation: R/qtl is free and powerful software for mapping and exploring quantitative trait loci (QTL). R/qtl provides a fully comprehensive range of methods for a wide range of experimental cross types. We recently added multiple QTL mapping (MQM) to R/qtl. MQM adds higher statistical power to detect and disentangle the effects of multiple linked and unlinked QTL compared with many other methods. MQM for R/qtl adds many new features including improved handling of missing data, analysis of 10 000 s of molecular traits, permutation for determining significance thresholds for QTL and QTL hot spots, and visualizations for cis–trans and QTL interaction effects. MQM for R/qtl is the first free and open source implementation of MQM that is multi-platform, scalable and suitable for automated procedures and large genetical genomics datasets.
Availability: R/qtl is free and open source multi-platform software for the statistical language R, and is made available under the GPLv3 license. R/qtl can be installed from R/qtl queries should be directed at the mailing list, see
PMCID: PMC2982156  PMID: 20966004
8.  Identification of imprinted genes subject to parent-of-origin specific expression in Arabidopsis thaliana seeds 
BMC Plant Biology  2011;11:113.
Epigenetic regulation of gene dosage by genomic imprinting of some autosomal genes facilitates normal reproductive development in both mammals and flowering plants. While many imprinted genes have been identified and intensively studied in mammals, smaller numbers have been characterized in flowering plants, mostly in Arabidopsis thaliana. Identification of additional imprinted loci in flowering plants by genome-wide screening for parent-of-origin specific uniparental expression in seed tissues will facilitate our understanding of the origins and functions of imprinted genes in flowering plants.
cDNA-AFLP can detect allele-specific expression that is parent-of-origin dependent for expressed genes in which restriction site polymorphisms exist in the transcripts derived from each allele. Using a genome-wide cDNA-AFLP screen surveying allele-specific expression of 4500 transcript-derived fragments, we report the identification of 52 maternally expressed genes (MEGs) displaying parent-of-origin dependent expression patterns in Arabidopsis siliques containing F1 hybrid seeds (3, 4 and 5 days after pollination). We identified these MEGs by developing a bioinformatics tool (GenFrag) which can directly determine the identities of transcript-derived fragments from (i) their size and (ii) which selective nucleotides were added to the primers used to generate them. Hence, GenFrag facilitates increased throughput for genome-wide cDNA-AFLP fragment analyses. The 52 MEGs we identified were further filtered for high expression levels in the endosperm relative to the seed coat to identify the candidate genes most likely representing novel imprinted genes expressed in the endosperm of Arabidopsis thaliana. Expression in seed tissues of the three top-ranked candidate genes, ATCDC48, PDE120 and MS5-like, was confirmed by Laser-Capture Microdissection and qRT-PCR analysis. Maternal-specific expression of these genes in Arabidopsis thaliana F1 seeds was confirmed via allele-specific transcript analysis across a range of different accessions. Differentially methylated regions were identified adjacent to ATCDC48 and PDE120, which may represent candidate imprinting control regions. Finally, we demonstrate that expression levels of these three genes in vegetative tissues are MET1-dependent, while their uniparental maternal expression in the seed is not dependent on MET1.
Using a cDNA-AFLP transcriptome profiling approach, we have identified three genes, ATCDC48, PDE120 and MS5-like which represent novel maternally expressed imprinted genes in the Arabidopsis thaliana seed. The extent of overlap between our cDNA-AFLP screen for maternally expressed imprinted genes, and other screens for imprinted and endosperm-expressed genes is discussed.
PMCID: PMC3174879  PMID: 21838868
9.  Bioinformatics tools and database resources for systems genetics analysis in mice—a short review and an evaluation of future needs 
Briefings in Bioinformatics  2011;13(2):135-142.
During a meeting of the SYSGENET working group ‘Bioinformatics’, currently available software tools and databases for systems genetics in mice were reviewed and the needs for future developments discussed. The group evaluated interoperability and performed initial feasibility studies. To aid future compatibility of software and exchange of already developed software modules, a strong recommendation was made by the group to integrate HAPPY and R/qtl analysis toolboxes, GeneNetwork and XGAP database platforms, and TIQS and xQTL processing platforms. R should be used as the principal computer language for QTL data analysis in all platforms and a ‘cloud’ should be used for software dissemination to the community. Furthermore, the working group recommended that all data models and software source code should be made visible in public repositories to allow a coordinated effort on the use of common data structures and file formats.
PMCID: PMC3294237  PMID: 22396485
QTL mapping; database; mouse; systems genetics
10.  A genome-wide genetic map of NB-LRR disease resistance loci in potato 
Like all plants, potato has evolved a surveillance system consisting of a large array of genes encoding for immune receptors that confer resistance to pathogens and pests. The majority of these so-called resistance or R proteins belong to the super-family that harbour a nucleotide binding and a leucine-rich-repeat domain (NB-LRR). Here, sequence information of the conserved NB domain was used to investigate the genome-wide genetic distribution of the NB-LRR resistance gene loci in potato. We analysed the sequences of 288 unique BAC clones selected using filter hybridisation screening of a BAC library of the diploid potato clone RH89-039-16 (S. tuberosum ssp. tuberosum) and a physical map of this BAC library. This resulted in the identification of 738 partial and full-length NB-LRR sequences. Based on homology of these sequences with known resistance genes, 280 and 448 sequences were classified as TIR-NB-LRR (TNL) and CC-NB-LRR (CNL) sequences, respectively. Genetic mapping revealed the presence of 15 TNL and 32 CNL loci. Thirty-six are novel, while three TNL loci and eight CNL loci are syntenic with previously identified functional resistance genes. The genetic map was complemented with 68 universal CAPS markers and 82 disease resistance trait loci described in literature, providing an excellent template for genetic studies and applied research in potato.
Electronic supplementary material
The online version of this article (doi:10.1007/s00122-011-1602-z) contains supplementary material, which is available to authorized users.
PMCID: PMC3135832  PMID: 21590328
11.  BioRuby: bioinformatics software for the Ruby programming language 
Bioinformatics  2010;26(20):2617-2619.
Summary: The BioRuby software toolkit contains a comprehensive set of free development tools and libraries for bioinformatics and molecular biology, written in the Ruby programming language. BioRuby has components for sequence analysis, pathway analysis, protein modelling and phylogenetic analysis; it supports many widely used data formats and provides easy access to databases, external programs and public web services, including BLAST, KEGG, GenBank, MEDLINE and GO. BioRuby comes with a tutorial, documentation and an interactive environment, which can be used in the shell, and in the web browser.
Availability: BioRuby is free and open source software, made available under the Ruby license. BioRuby runs on all platforms that support Ruby, including Linux, Mac OS X and Windows. And, with JRuby, BioRuby runs on the Java Virtual Machine. The source code is available from
PMCID: PMC2951089  PMID: 20739307
12.  The DBCLS BioHackathon: standardization and interoperability for bioinformatics web services and workflows. The DBCLS BioHackathon Consortium* 
Web services have become a key technology for bioinformatics, since life science databases are globally decentralized and the exponential increase in the amount of available data demands for efficient systems without the need to transfer entire databases for every step of an analysis. However, various incompatibilities among database resources and analysis services make it difficult to connect and integrate these into interoperable workflows. To resolve this situation, we invited domain specialists from web service providers, client software developers, Open Bio* projects, the BioMoby project and researchers of emerging areas where a standard exchange data format is not well established, for an intensive collaboration entitled the BioHackathon 2008. The meeting was hosted by the Database Center for Life Science (DBCLS) and Computational Biology Research Center (CBRC) and was held in Tokyo from February 11th to 15th, 2008. In this report we highlight the work accomplished and the common issues arisen from this event, including the standardization of data exchange formats and services in the emerging fields of glycoinformatics, biological interaction networks, text mining, and phyloinformatics. In addition, common shared object development based on BioSQL, as well as technical challenges in large data management, asynchronous services, and security are discussed. Consequently, we improved interoperability of web services in several fields, however, further cooperation among major database centers and continued collaborative efforts between service providers and software developers are still necessary for an effective advance in bioinformatics web service technologies.
PMCID: PMC2939597  PMID: 20727200
13.  Mapping Determinants of Gene Expression Plasticity by Genetical Genomics in C. elegans 
PLoS Genetics  2006;2(12):e222.
Recent genetical genomics studies have provided intimate views on gene regulatory networks. Gene expression variations between genetically different individuals have been mapped to the causal regulatory regions, termed expression quantitative trait loci. Whether the environment-induced plastic response of gene expression also shows heritable difference has not yet been studied. Here we show that differential expression induced by temperatures of 16 °C and 24 °C has a strong genetic component in Caenorhabditis elegans recombinant inbred strains derived from a cross between strains CB4856 (Hawaii) and N2 (Bristol). No less than 59% of 308 trans-acting genes showed a significant eQTL-by-environment interaction, here termed plasticity quantitative trait loci. In contrast, only 8% of an estimated 188 cis-acting genes showed such interaction. This indicates that heritable differences in plastic responses of gene expression are largely regulated in trans. This regulation is spread over many different regulators. However, for one group of trans-genes we found prominent evidence for a common master regulator: a transband of 66 coregulated genes appeared at 24 °C. Our results suggest widespread genetic variation of differential expression responses to environmental impacts and demonstrate the potential of genetical genomics for mapping the molecular determinants of phenotypic plasticity.
It is widely documented that environmental changes will induce differential expression of genes, yet it is unknown how these patterns of environment-induced expression plasticity are inherited and how they differ between genetically divergent individuals of a biological species. In this paper the authors used recombinant inbred lines of the nematode worm C. elegans that were derived from parental lines originally collected in Bristol (United Kingdom) and Hawaii, and measured genome-wide gene expression at two different temperatures. Using statistical analysis tools developed for quantitative trait locus mapping, they found genes with genetically determined differences in their plastic response to temperature changes. A majority of them were found to be regulated by genes at a different genome position (regulated in trans). A striking observation was a group of 66 genes that share a common potential regulator and may be related to differences in fertility plasticity. These results show that differential responses of different genotypes to environmental changes are widespread. Because all species are subjected to environmental change, both at individual and evolutionary time scales, the authors' work calls for studying the heritable component of plasticity of gene regulation in other organisms to enhance understanding of the environmental forces that drive evolutionary adaptation.
PMCID: PMC1756913  PMID: 17196041
14.  GenEST, a powerful bidirectional link between cDNA sequence data and gene expression profiles generated by cDNA-AFLP 
Nucleic Acids Research  2001;29(7):1616-1622.
The release of vast quantities of DNA sequence data by large-scale genome and expressed sequence tag (EST) projects underlines the necessity for the development of efficient and inexpensive ways to link sequence databases with temporal and spatial expression profiles. Here we demonstrate the power of linking cDNA sequence data (including EST sequences) with transcript profiles revealed by cDNA-AFLP, a highly reproducible differential display method based on restriction enzyme digests and selective amplification under high stringency conditions. We have developed a computer program (GenEST) that predicts the sizes of virtual transcript-derived fragments (TDFs) of in silico-digested cDNA sequences retrieved from databases. The vast majority of the resulting virtual TDFs could be traced back among the thousands of TDFs displayed on cDNA-AFLP gels. Sequencing of the corresponding bands excised from cDNA-AFLP gels revealed no inconsistencies. As a consequence, cDNA sequence databases can be screened very efficiently to identify genes with relevant expression profiles. The other way round, it is possible to switch from cDNA-AFLP gels to sequences in the databases. Using the restriction enzyme recognition sites, the primer extensions and the estimated TDF size as identifiers, the DNA sequence(s) corresponding to a TDF with an interesting expression pattern can be identified. In this paper we show examples in both directions by analyzing the plant parasitic nematode Globodera rostochiensis. Various novel pathogenicity factors were identified by combining ESTs from the infective stage juveniles with expression profiles of ∼4000 genes in five developmental stages produced by cDNA-AFLP.
PMCID: PMC31277  PMID: 11266565

Results 1-14 (14)