Search tips
Search criteria

Results 1-25 (881905)

Clipboard (0)

Related Articles

1.  A System for Information Management in BioMedical Studies—SIMBioMS 
Bioinformatics  2009;25(20):2768-2769.
Summary: SIMBioMS is a web-based open source software system for managing data and information in biomedical studies. It provides a solution for the collection, storage, management and retrieval of information about research subjects and biomedical samples, as well as experimental data obtained using a range of high-throughput technologies, including gene expression, genotyping, proteomics and metabonomics. The system can easily be customized and has proven to be successful in several large-scale multi-site collaborative projects. It is compatible with emerging functional genomics data standards and provides data import and export in accepted standard formats. Protocols for transferring data to durable archives at the European Bioinformatics Institute have been implemented.
Availability: The source code, documentation and initialization scripts are available at
PMCID: PMC2759553  PMID: 19633095
2.  The Ensembl REST API: Ensembl Data for Any Language 
Bioinformatics  2014;31(1):143-145.
Motivation: We present a Web service to access Ensembl data using Representational State Transfer (REST). The Ensembl REST server enables the easy retrieval of a wide range of Ensembl data by most programming languages, using standard formats such as JSON and FASTA while minimizing client work. We also introduce bindings to the popular Ensembl Variant Effect Predictor tool permitting large-scale programmatic variant analysis independent of any specific programming language.
Availability and implementation: The Ensembl REST API can be accessed at and source code is freely available under an Apache 2.0 license from
Contact: or
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4271150  PMID: 25236461
3.  Brain: biomedical knowledge manipulation 
Bioinformatics  2013;29(9):1238-1239.
Summary: Brain is a Java software library facilitating the manipulation and creation of ontologies and knowledge bases represented with the Web Ontology Language (OWL).
Availability and implementation: The Java source code and the library are freely available at and on the Maven Central repository (GroupId: The documentation is available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3634181  PMID: 23505292
4.  The functional therapeutic chemical classification system 
Bioinformatics  2013;30(6):876-883.
Motivation: Drug repositioning is the discovery of new indications for compounds that have already been approved and used in a clinical setting. Recently, some computational approaches have been suggested to unveil new opportunities in a systematic fashion, by taking into consideration gene expression signatures or chemical features for instance. We present here a novel method based on knowledge integration using semantic technologies, to capture the functional role of approved chemical compounds.
Results: In order to computationally generate repositioning hypotheses, we used the Web Ontology Language to formally define the semantics of over 20 000 terms with axioms to correctly denote various modes of action (MoA). Based on an integration of public data, we have automatically assigned over a thousand of approved drugs into these MoA categories. The resulting new resource is called the Functional Therapeutic Chemical Classification System and was further evaluated against the content of the traditional Anatomical Therapeutic Chemical Classification System. We illustrate how the new classification can be used to generate drug repurposing hypotheses, using Alzheimers disease as a use-case.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3957075  PMID: 24177719
5.  Public’s attitudes on participation in a biobank for research: an Italian survey 
BMC Medical Ethics  2014;15(1):81.
The creation of biobanks depends upon people’s willingness to donate their samples for research purposes and to agree to sample storage. Moreover, biobanks are a public good that requires active participation by all interested stakeholders at every stage of development. Therefore, knowing public’s attitudes towards participation in a biobank and biobank management is important and deserves investigation.
A survey was conducted among family members of patients attending the outpatient department of our institute for a geriatric or neurological visit, documenting their willingness to participate in a biobank and their views on the legal-ethical aspects of biobank management. Information regarding subjects’ attitudes on biomedical research in general and genetic research in particular was also collected. Participants’ data on biobanks were compared with data previously collected from the Italian ethics committees (ECs) to evaluate the extent to which lay people and ethics committees share views and concerns regarding biobanks.
One hundred forty-five subjects took part in the survey. The willingness to give biological samples for the constitution of a biobank set up for research purposes was declared by 86% of subjects and was modulated by subjects’ education. People in favour of providing biological samples for a biobank expressed a more positive view on biomedical research than did people who were not in favour; attitude towards genetic research in dementia was the strongest predictor of participation. Different from ECs that prefer specific consent (52%) and do not choose the option of broad consent (8%) for samples collection in a biobank, participants show a clear preference for broad consent (57%), followed by partially restricted consent (16%), specific consent (15%), and multi-layered consent (12%). Almost all of the subjects available to contribute to a biobank desire to receive both individual research results and research results of general value, while around fifty per cent of ECs require results communication.
Family members showed willingness to participate in a biobank for research and expressed a view on the ethical aspects of a biobank management that differ on several issues from the Italian ECs’ opinion. Laypersons’ views should be taken into account in developing biobank regulations.
Electronic supplementary material
The online version of this article (doi:10.1186/1472-6939-15-81) contains supplementary material, which is available to authorized users.
PMCID: PMC4258254  PMID: 25425352
Public attitudes; Biobanks; Genetic research; Bioethics; Ethical policy
6.  SPEX2: automated concise extraction of spatial gene expression patterns from Fly embryo ISH images 
Bioinformatics  2010;26(12):i47-i56.
Motivation: Microarray profiling of mRNA abundance is often ill suited for temporal–spatial analysis of gene expressions in multicellular organisms such as Drosophila. Recent progress in image-based genome-scale profiling of whole-body mRNA patterns via in situ hybridization (ISH) calls for development of accurate and automatic image analysis systems to facilitate efficient mining of complex temporal–spatial mRNA patterns, which will be essential for functional genomics and network inference in higher organisms.
Results: We present SPEX2, an automatic system for embryonic ISH image processing, which can extract, transform, compare, classify and cluster spatial gene expression patterns in Drosophila embryos. Our pipeline for gene expression pattern extraction outputs the precise spatial locations and strengths of the gene expression. We performed experiments on the largest publicly available collection of Drosophila ISH images, and show that our method achieves excellent performance in automatic image annotation, and also finds clusters that are significantly enriched, both for gene ontology functional annotations, and for annotation terms from a controlled vocabulary used by human curators to describe these images.
Availability: Software will be available at
Supplementary information: Supplementary data are avilable at Bioinformatics online.
PMCID: PMC2881357  PMID: 20529936
7.  bammds: a tool for assessing the ancestry of low-depth whole-genome data using multidimensional scaling (MDS) 
Bioinformatics  2014;30(20):2962-2964.
Summary: We present bammds, a practical tool that allows visualization of samples sequenced by second-generation sequencing when compared with a reference panel of individuals (usually genotypes) using a multidimensional scaling algorithm. Our tool is aimed at determining the ancestry of unknown samples—typical of ancient DNA data—particularly when only low amounts of data are available for those samples.
Availability and implementation: The software package is available under GNU General Public License v3 and is freely available together with test datasets It is using R (, parallel (, samtools (
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4184259  PMID: 24974206
8.  graph2tab, a library to convert experimental workflow graphs into tabular formats 
Bioinformatics  2012;28(12):1665-1667.
Motivations: Spreadsheet-like tabular formats are ever more popular in the biomedical field as a mean for experimental reporting. The problem of converting the graph of an experimental workflow into a table-based representation occurs in many such formats and is not easy to solve.
Results: We describe graph2tab, a library that implements methods to realise such a conversion in a size-optimised way. Our solution is generic and can be adapted to specific cases of data exporters or data converters that need to be implemented.
Availability and Implementation: The library source code and documentation are available at
Supplementary Information: A supplementary document describes the theoretical and technical details about the library implementation.
PMCID: PMC3371871  PMID: 22556367
9.  The SAIL Databank: building a national architecture for e-health research and evaluation 
Vast quantities of electronic data are collected about patients and service users as they pass through health service and other public sector organisations, and these data present enormous potential for research and policy evaluation. The Health Information Research Unit (HIRU) aims to realise the potential of electronically-held, person-based, routinely-collected data to conduct and support health-related studies. However, there are considerable challenges that must be addressed before such data can be used for these purposes, to ensure compliance with the legislation and guidelines generally known as Information Governance.
A set of objectives was identified to address the challenges and establish the Secure Anonymised Information Linkage (SAIL) system in accordance with Information Governance. These were to: 1) ensure data transportation is secure; 2) operate a reliable record matching technique to enable accurate record linkage across datasets; 3) anonymise and encrypt the data to prevent re-identification of individuals; 4) apply measures to address disclosure risk in data views created for researchers; 5) ensure data access is controlled and authorised; 6) establish methods for scrutinising proposals for data utilisation and approving output; and 7) gain external verification of compliance with Information Governance.
The SAIL databank has been established and it operates on a DB2 platform (Data Warehouse Edition on AIX) running on an IBM 'P' series Supercomputer: Blue-C. The findings of an independent internal audit were favourable and concluded that the systems in place provide adequate assurance of compliance with Information Governance. This expanding databank already holds over 500 million anonymised and encrypted individual-level records from a range of sources relevant to health and well-being. This includes national datasets covering the whole of Wales (approximately 3 million population) and local provider-level datasets, with further growth in progress. The utility of the databank is demonstrated by increasing engagement in high quality research studies.
Through the pragmatic approach that has been adopted, we have been able to address the key challenges in establishing a national databank of anonymised person-based records, so that the data are available for research and evaluation whilst meeting the requirements of Information Governance.
PMCID: PMC2744675  PMID: 19732426
10.  Suicide Information Database-Cymru: a protocol for a population-based, routinely collected data linkage study to explore risks and patterns of healthcare contact prior to suicide to identify opportunities for intervention 
BMJ Open  2014;4(11):e006780.
Prevention of suicide is a global public health challenge extending beyond mental health services. Linking routinely collected health and social care system data records for the same individual across different services and over time has enormous potential in suicide research. Most previous research linking suicide mortality data with routinely collected electronic health records involves only one or two domains of healthcare provision such as psychiatric inpatient care. This protocol paper describes the development of a population-based, routinely collected data linkage study: the Suicide Information Database Cymru (SID-Cymru). SID-Cymru aims to contribute to the information available on people who complete suicide.
Methods and analysis
SID-Cymru will facilitate a series of electronic case–control studies based in the Secure Anonymised Information Linkage (SAIL) Databank. We have identified 2664 cases of suicide in Wales between 2003 and 2011 from routinely collected mortality data using International Classification of Diseases, Tenth Revision, codes X60–X84 (intentional self-harm) and Y10–Y34 (undetermined intent). Each case will be matched by age and sex to at least five controls. Records will be collated and linked from routinely collected health and social data in Wales for each individual. Conditional logistic regression will be applied to produce crude and confounder (including general practice, socioeconomic status) adjusted ORs.
Ethics and dissemination
The SAIL Databank has the required ethical permissions in place to analyse anonymised data. Ethical approval has been granted by the Information Governance Review Panel (IGRP). Findings will be disseminated through peer-reviewed publications, consultations with stakeholders and national/international conference presentations. The improved understanding of the prior health, nature of previous contacts with services and wider social circumstances of those who complete suicide will assist in prevention policy, service organisation and delivery. SID-Cymru is funded through the National Institute for Social Care and Health Research, Welsh Government (RFS-12-25).
PMCID: PMC4248097  PMID: 25424996
11.  BioServices: a common Python package to access biological Web Services programmatically 
Bioinformatics  2013;29(24):3241-3242.
Motivation: Web interfaces provide access to numerous biological databases. Many can be accessed to in a programmatic way thanks to Web Services. Building applications that combine several of them would benefit from a single framework.
Results: BioServices is a comprehensive Python framework that provides programmatic access to major bioinformatics Web Services (e.g. KEGG, UniProt, BioModels, ChEMBLdb). Wrapping additional Web Services based either on Representational State Transfer or Simple Object Access Protocol/Web Services Description Language technologies is eased by the usage of object-oriented programming.
Availability and implementation: BioServices releases and documentation are available at under a GPL-v3 license.
Contact: or
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3842755  PMID: 24064416
12.  Drug susceptibility prediction against a panel of drugs using kernelized Bayesian multitask learning 
Bioinformatics  2014;30(17):i556-i563.
Motivation: Human immunodeficiency virus (HIV) and cancer require personalized therapies owing to their inherent heterogeneous nature. For both diseases, large-scale pharmacogenomic screens of molecularly characterized samples have been generated with the hope of identifying genetic predictors of drug susceptibility. Thus, computational algorithms capable of inferring robust predictors of drug responses from genomic information are of great practical importance. Most of the existing computational studies that consider drug susceptibility prediction against a panel of drugs formulate a separate learning problem for each drug, which cannot make use of commonalities between subsets of drugs.
Results: In this study, we propose to solve the problem of drug susceptibility prediction against a panel of drugs in a multitask learning framework by formulating a novel Bayesian algorithm that combines kernel-based non-linear dimensionality reduction and binary classification (or regression). The main novelty of our method is the joint Bayesian formulation of projecting data points into a shared subspace and learning predictive models for all drugs in this subspace, which helps us to eliminate off-target effects and drug-specific experimental noise. Another novelty of our method is the ability of handling missing phenotype values owing to experimental conditions and quality control reasons. We demonstrate the performance of our algorithm via cross-validation experiments on two benchmark drug susceptibility datasets of HIV and cancer. Our method obtains statistically significantly better predictive performance on most of the drugs compared with baseline single-task algorithms that learn drug-specific models. These results show that predicting drug susceptibility against a panel of drugs simultaneously within a multitask learning framework improves overall predictive performance over single-task learning approaches.
Availability and implementation: Our Matlab implementations for binary classification and regression are available at
Supplementary Information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4147917  PMID: 25161247
13.  SlideToolkit: An Assistive Toolset for the Histological Quantification of Whole Slide Images 
PLoS ONE  2014;9(11):e110289.
The demand for accurate and reproducible phenotyping of a disease trait increases with the rising number of biobanks and genome wide association studies. Detailed analysis of histology is a powerful way of phenotyping human tissues. Nonetheless, purely visual assessment of histological slides is time-consuming and liable to sampling variation and optical illusions and thereby observer variation, and external validation may be cumbersome. Therefore, within our own biobank, computerized quantification of digitized histological slides is often preferred as a more precise and reproducible, and sometimes more sensitive approach. Relatively few free toolkits are, however, available for fully digitized microscopic slides, usually known as whole slides images. In order to comply with this need, we developed the slideToolkit as a fast method to handle large quantities of low contrast whole slides images using advanced cell detecting algorithms. The slideToolkit has been developed for modern personal computers and high-performance clusters (HPCs) and is available as an open-source project on We here illustrate the power of slideToolkit by a repeated measurement of 303 digital slides containing CD3 stained (DAB) abdominal aortic aneurysm tissue from a tissue biobank. Our workflow consists of four consecutive steps. In the first step (acquisition), whole slide images are collected and converted to TIFF files. In the second step (preparation), files are organized. The third step (tiles), creates multiple manageable tiles to count. In the fourth step (analysis), tissue is analyzed and results are stored in a data set. Using this method, two consecutive measurements of 303 slides showed an intraclass correlation of 0.99. In conclusion, slideToolkit provides a free, powerful and versatile collection of tools for automated feature analysis of whole slide images to create reproducible and meaningful phenotypic data sets.
PMCID: PMC4220929  PMID: 25372389
14.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies 
Bioinformatics  2014;30(9):1312-1313.
Motivation: Phylogenies are increasingly used in all fields of medical and biological research. Moreover, because of the next-generation sequencing revolution, datasets used for conducting phylogenetic analyses grow at an unprecedented pace. RAxML (Randomized Axelerated Maximum Likelihood) is a popular program for phylogenetic analyses of large datasets under maximum likelihood. Since the last RAxML paper in 2006, it has been continuously maintained and extended to accommodate the increasingly growing input datasets and to serve the needs of the user community.
Results: I present some of the most notable new features and extensions of RAxML, such as a substantial extension of substitution models and supported data types, the introduction of SSE3, AVX and AVX2 vector intrinsics, techniques for reducing the memory requirements of the code and a plethora of operations for conducting post-analyses on sets of trees. In addition, an up-to-date 50-page user manual covering all new RAxML options is available.
Availability and implementation: The code is available under GNU GPL at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3998144  PMID: 24451623
15.  ChromoZoom: a flexible, fluid, web-based genome browser 
Bioinformatics  2012;29(3):384-386.
Summary: Current web-based genome browsers require repetitious user input to scroll over long distances, alter the drawing density of elements or zoom through multiple orders of magnitude. Generally, either the server or the client is responsible for the majority of data processing, resulting in either servers having to receive and handle data relevant only to one user, or clients redundantly processing widely viewed data. ChromoZoom pre-renders and caches general-use tracks into tiled images on the server and serves them in an interactive web interface with inertial scrolling and precise, fluent zooming via the mouse wheel or trackpad. Custom tracks in several formats can be rendered by client-side code alongside the pre-rendered tracks, minimizing server load because of user-specific rendering and eliminating the need to transmit private data. ChromoZoom thereby enables rapid and simultaneous exploration of curated, experimental and personal genomic datasets.
Availability: Human and yeast genome researchers may browse recent assemblies within ChromoZoom at Source code is available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3562068  PMID: 23220575
16.  GINI: From ISH Images to Gene Interaction Networks 
PLoS Computational Biology  2013;9(10):e1003227.
Accurate inference of molecular and functional interactions among genes, especially in multicellular organisms such as Drosophila, often requires statistical analysis of correlations not only between the magnitudes of gene expressions, but also between their temporal-spatial patterns. The ISH (in-situ-hybridization)-based gene expression micro-imaging technology offers an effective approach to perform large-scale spatial-temporal profiling of whole-body mRNA abundance. However, analytical tools for discovering gene interactions from such data remain an open challenge due to various reasons, including difficulties in extracting canonical representations of gene activities from images, and in inference of statistically meaningful networks from such representations. In this paper, we present GINI, a machine learning system for inferring gene interaction networks from Drosophila embryonic ISH images. GINI builds on a computer-vision-inspired vector-space representation of the spatial pattern of gene expression in ISH images, enabled by our recently developed system; and a new multi-instance-kernel algorithm that learns a sparse Markov network model, in which, every gene (i.e., node) in the network is represented by a vector-valued spatial pattern rather than a scalar-valued gene intensity as in conventional approaches such as a Gaussian graphical model. By capturing the notion of spatial similarity of gene expression, and at the same time properly taking into account the presence of multiple images per gene via multi-instance kernels, GINI is well-positioned to infer statistically sound, and biologically meaningful gene interaction networks from image data. Using both synthetic data and a small manually curated data set, we demonstrate the effectiveness of our approach in network building. Furthermore, we report results on a large publicly available collection of Drosophila embryonic ISH images from the Berkeley Drosophila Genome Project, where GINI makes novel and interesting predictions of gene interactions. Software for GINI is available at
Author Summary
As high-throughput technologies for molecular abundance profiling are becoming more inexpensive and accessible, computational inference of gene interaction networks from such data based on well-founded statistical principles is imperative to advance the understanding of regulatory mechanisms in various biological systems. Reverse engineering of gene networks has traditionally relied on analysis of whole-genome microarray data; here we present a new method, GINI, to infer gene networks from ISH images, thereby enabling exploration of spatial characteristics of gene expressions for network inference. Our method generates a Markov network, which encapsulates globally meaningful statistical-dependencies from vector-valued gene spatial patterns. In other words, we advance the state-of-art in both the usage of richer forms of expression data, and the employment of principled statistical methodology for sound network inference on such new form of data. Our results show that analyzing the spatial distribution of gene expression enables us to capture information not available from microarray data. Such an analysis is especially important in analyzing genes involved in embryonic development of Drosophila to reveal specific spatial patterning that determines the development of the 14 segments of the adult fly.
PMCID: PMC3794902  PMID: 24130465
17.  Metingear: a development environment for annotating genome-scale metabolic models 
Bioinformatics  2013;29(17):2213-2215.
Summary: Genome-scale metabolic models often lack annotations that would allow them to be used for further analysis. Previous efforts have focused on associating metabolites in the model with a cross reference, but this can be problematic if the reference is not freely available, multiple resources are used or the metabolite is added from a literature review. Associating each metabolite with chemical structure provides unambiguous identification of the components and a more detailed view of the metabolism. We have developed an open-source desktop application that simplifies the process of adding database cross references and chemical structures to genome-scale metabolic models. Annotated models can be exported to the Systems Biology Markup Language open interchange format.
Availability: Source code, binaries, documentation and tutorials are freely available at The application is implemented in Java with bundles available for MS Windows and Macintosh OS X.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3740624  PMID: 23766418
18.  Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor 
Bioinformatics  2010;26(16):2069-2070.
Summary: A tool to predict the effect that newly discovered genomic variants have on known transcripts is indispensible in prioritizing and categorizing such variants. In Ensembl, a web-based tool (the SNP Effect Predictor) and API interface can now functionally annotate variants in all Ensembl and Ensembl Genomes supported species.
Availability: The Ensembl SNP Effect Predictor can be accessed via the Ensembl website at The Ensembl API ( for installation instructions) is open source software.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2916720  PMID: 20562413
19.  Data exploration, quality control and testing in single-cell qPCR-based gene expression experiments 
Bioinformatics  2012;29(4):461-467.
Motivation: Cell populations are never truly homogeneous; individual cells exist in biochemical states that define functional differences between them. New technology based on microfluidic arrays combined with multiplexed quantitative polymerase chain reactions now enables high-throughput single-cell gene expression measurement, allowing assessment of cellular heterogeneity. However, few analytic tools have been developed specifically for the statistical and analytical challenges of single-cell quantitative polymerase chain reactions data.
Results: We present a statistical framework for the exploration, quality control and analysis of single-cell gene expression data from microfluidic arrays. We assess accuracy and within-sample heterogeneity of single-cell expression and develop quality control criteria to filter unreliable cell measurements. We propose a statistical model accounting for the fact that genes at the single-cell level can be on (and a continuous expression measure is recorded) or dichotomously off (and the recorded expression is zero). Based on this model, we derive a combined likelihood ratio test for differential expression that incorporates both the discrete and continuous components. Using an experiment that examines treatment-specific changes in expression, we show that this combined test is more powerful than either the continuous or dichotomous component in isolation, or a t-test on the zero-inflated data. Although developed for measurements from a specific platform (Fluidigm), these tools are generalizable to other multi-parametric measures over large numbers of events.
Availability: All results presented here were obtained using the SingleCellAssay R package available on GitHub (
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3570210  PMID: 23267174
20.  Dietary supplementation and doping-related factors in high-level sailing 
Although dietary supplements (DSs) in sports are considered a natural need resulting from athletes’ increased physical demands, and although they are often consumed by athletes, data on DS usage in Olympic sailing are scarce. The aim of this study was to study the use of and attitudes towards DSs and doping problems in high-level competitive sailing.
The sample consisted of 44 high-level sailing athletes (5 of whom were female; total mean age 24.13 ± 6.67 years) and 34 coaches (1 of whom was female; total mean age 37.01 ± 11.70). An extensive, self-administered questionnaire of substance use was used, and the subjects were asked about sociodemographic data, sport-related factors, DS-related factors (i.e., usage of and knowledge about DSs, sources of information), and doping-related factors. The Kruskal-Wallis ANOVA was used to determine the differences in group characteristics, and Spearman’s rank order correlation and a logistic regression analysis were used to define the relationships between the studied variables.
DS usage is relatively high. More than 77% of athletes consume DSs, and 38% do so on a regular basis (daily). The athletes place a high degree of trust in their coaches and/or physicians regarding DSs and doping. The most important reason for not consuming DSs is the opinion that DSs are useless and a lack of knowledge about DSs. The likelihood of doping is low, and one-third of the subjects believe that doping occurs in sailing (no significant differences between athletes and coaches). The logistic regression found crew number (i.e., single vs. double crew) to be the single significant predictor of DS usage, with a higher probability of DS consumption among single crews.
Because of the high consumption of DSs future investigations should focus on real nutritional needs in sailing sport. Also, since athletes reported that their coaches are the primary source of information about nutrition and DSs, further studies are necessary to determine the knowledge about nutrition, DSs and doping problems among athletes and their support teams (i.e., coaches, physicians, and strength and conditioning specialists).
PMCID: PMC3536606  PMID: 23217197
Nutritional supplementation; Substances; Testing design; Athlete; Coach
21.  Model requirements for Biobank Software Systems 
Bioinformation  2012;8(6):290-292.
Biobanks are essential tools in diagnostics and therapeutics research and development related to personalized medicine. Several international recommendations, standards and guidelines exist that discuss the legal, ethical, technological, and management requirements of biobanks. Today's biobanks are much more than just collections of biospecimens. They also store a huge amount of data related to biological samples which can be either clinical data or data coming from biochemical experiments. A well-designed biobank software system also provides the possibility of finding associations between stored elements. Modern research biobanks are able to manage multicenter sample collections while fulfilling all requirements of data protection and security. While developing several biobanks and analyzing the data stored in them, our research group recognized the need for a well-organized, easy-to-check requirements guideline that can be used to develop biobank software systems. International best practices along with relevant ICT standards were integrated into a comprehensive guideline: The Model Requirements for the Management of Biological Repositories (BioReq), which covers the full range of activities related to biobank development. The guideline is freely available on the Internet for the research community.
The database is available for free at
PMCID: PMC3321242  PMID: 22493540
Biobank Software System; guideline; model requirement; personalized medicine
22.  RAxML-Light: a tool for computing terabyte phylogenies 
Bioinformatics  2012;28(15):2064-2066.
Motivation: Due to advances in molecular sequencing and the increasingly rapid collection of molecular data, the field of phyloinformatics is transforming into a computational science. Therefore, new tools are required that can be deployed in supercomputing environments and that scale to hundreds or thousands of cores.
Results: We describe RAxML-Light, a tool for large-scale phylogenetic inference on supercomputers under maximum likelihood. It implements a light-weight checkpointing mechanism, deploys 128-bit (SSE3) and 256-bit (AVX) vector intrinsics, offers two orthogonal memory saving techniques and provides a fine-grain production-level message passing interface parallelization of the likelihood function. To demonstrate scalability and robustness of the code, we inferred a phylogeny on a simulated DNA alignment (1481 taxa, 20 000 000 bp) using 672 cores. This dataset requires one terabyte of RAM to compute the likelihood score on a single tree.
Code Availability:
Data Availability:
Supplementary Information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3400957  PMID: 22628519
23.  Fast randomization of large genomic datasets while preserving alteration counts 
Bioinformatics  2014;30(17):i617-i623.
Motivation: Studying combinatorial patterns in cancer genomic datasets has recently emerged as a tool for identifying novel cancer driver networks. Approaches have been devised to quantify, for example, the tendency of a set of genes to be mutated in a ‘mutually exclusive’ manner. The significance of the proposed metrics is usually evaluated by computing P-values under appropriate null models. To this end, a Monte Carlo method (the switching-algorithm) is used to sample simulated datasets under a null model that preserves patient- and gene-wise mutation rates. In this method, a genomic dataset is represented as a bipartite network, to which Markov chain updates (switching-steps) are applied. These steps modify the network topology, and a minimal number of them must be executed to draw simulated datasets independently under the null model. This number has previously been deducted empirically to be a linear function of the total number of variants, making this process computationally expensive.
Results: We present a novel approximate lower bound for the number of switching-steps, derived analytically. Additionally, we have developed the R package BiRewire, including new efficient implementations of the switching-algorithm. We illustrate the performances of BiRewire by applying it to large real cancer genomics datasets. We report vast reductions in time requirement, with respect to existing implementations/bounds and equivalent P-value computations. Thus, we propose BiRewire to study statistical properties in genomic datasets, and other data that can be modeled as bipartite networks.
Availability and implementation: BiRewire is available on BioConductor at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4147926  PMID: 25161255
24.  FSelector: a Ruby gem for feature selection 
Bioinformatics  2012;28(21):2851-2852.
Summary: The FSelector package contains a comprehensive list of feature selection algorithms for supporting bioinformatics and machine learning research. FSelector primarily collects and implements the filter type of feature selection techniques, which are computationally efficient for mining large datasets. In particular, FSelector allows ensemble feature selection that takes advantage of multiple feature selection algorithms to yield more robust results. FSelector also provides many useful auxiliary tools, including normalization, discretization and missing data imputation.
Availability: FSelector, written in the Ruby programming language, is free and open-source software that runs on all Ruby supporting platforms, including Windows, Linux and Mac OS X. FSelector is available from and can be installed like a breeze via the command gem install fselector. The source code is available ( and is fully documented (
Contact: or
Supplementary Information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3476337  PMID: 22942017
25.  METAREP: JCVI metagenomics reports—an open source tool for high-performance comparative metagenomics 
Bioinformatics  2010;26(20):2631-2632.
Summary: JCVI Metagenomics Reports (METAREP) is a Web 2.0 application designed to help scientists analyze and compare annotated metagenomics datasets. It utilizes Solr/Lucene, a high-performance scalable search engine, to quickly query large data collections. Furthermore, users can use its SQL-like query syntax to filter and refine datasets. METAREP provides graphical summaries for top taxonomic and functional classifications as well as a GO, NCBI Taxonomy and KEGG Pathway Browser. Users can compare absolute and relative counts of multiple datasets at various functional and taxonomic levels. Advanced comparative features comprise statistical tests as well as multidimensional scaling, heatmap and hierarchical clustering plots. Summaries can be exported as tab-delimited files, publication quality plots in PDF format. A data management layer allows collaborative data analysis and result sharing.
Availability: Web site; source code
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2951084  PMID: 20798169

Results 1-25 (881905)