1.  CheNER: chemical named entity recognizer 
Bioinformatics  2013;30(7):1039-1040.
Motivation: Chemical named entity recognition is used to automatically identify mentions to chemical compounds in text and is the basis for more elaborate information extraction. However, only a small number of applications are freely available to identify such mentions. Particularly challenging and useful is the identification of International Union of Pure and Applied Chemistry (IUPAC) chemical compounds, which due to the complex morphology of IUPAC names requires more advanced techniques than that of brand names.
Results: We present CheNER, a tool for automated identification of systematic IUPAC chemical mentions. We evaluated different systems using an established literature corpus to show that CheNER has a superior performance in identifying IUPAC names specifically, and that it makes better use of computational resources.
Availability and implementation:,
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3967102  PMID: 24227678
2.  RUbioSeq: a suite of parallelized pipelines to automate exome variation and bisulfite-seq analyses 
Bioinformatics  2013;29(13):1687-1689.
Motivation: RUbioSeq has been developed to facilitate the primary and secondary analysis of re-sequencing projects by providing an integrated software suite of parallelized pipelines to detect exome variants (single-nucleotide variants and copy number variations) and to perform bisulfite-seq analyses automatically. RUbioSeq’s variant analysis results have been already validated and published.
PMCID: PMC3694642  PMID: 23630175
3.  EnrichNet: network-based gene set enrichment analysis 
Bioinformatics  2012;28(18):i451-i457.
Motivation: Assessing functional associations between an experimentally derived gene or protein set of interest and a database of known gene/protein sets is a common task in the analysis of large-scale functional genomics data. For this purpose, a frequently used approach is to apply an over-representation-based enrichment analysis. However, this approach has four drawbacks: (i) it can only score functional associations of overlapping gene/proteins sets; (ii) it disregards genes with missing annotations; (iii) it does not take into account the network structure of physical interactions between the gene/protein sets of interest and (iv) tissue-specific gene/protein set associations cannot be recognized.
Results: To address these limitations, we introduce an integrative analysis approach and web-application called EnrichNet. It combines a novel graph-based statistic with an interactive sub-network visualization to accomplish two complementary goals: improving the prioritization of putative functional gene/protein set associations by exploiting information from molecular interaction networks and tissue-specific gene expression data and enabling a direct biological interpretation of the results. By using the approach to analyse sets of genes with known involvement in human diseases, new pathway associations are identified, reflecting a dense sub-network of interactions between their corresponding proteins.
Availability: EnrichNet is freely available at
Contact:, or
Supplementary Information: Supplementary data are available at Bioinformatics Online.
PMCID: PMC3436816  PMID: 22962466
4.  Novel domain combinations in proteins encoded by chimeric transcripts 
Bioinformatics  2012;28(12):i67-i74.
Motivation: Chimeric RNA transcripts are generated by different mechanisms including pre-mRNA trans-splicing, chromosomal translocations and/or gene fusions. It was shown recently that at least some of chimeric transcripts can be translated into functional chimeric proteins.
Results: To gain a better understanding of the design principles underlying chimeric proteins, we have analyzed 7,424 chimeric RNAs from humans. We focused on the specific domains present in these proteins, comparing their permutations with those of known human proteins. Our method uses genomic alignments of the chimeras, identification of the gene–gene junction sites and prediction of the protein domains. We found that chimeras contain complete protein domains significantly more often than in random data sets. Specifically, we show that eight different types of domains are over-represented among all chimeras as well as in those chimeras confirmed by RNA-seq experiments. Moreover, we discovered that some chimeras potentially encode proteins with novel and unique domain combinations. Given the observed prevalence of entire protein domains in chimeras, we predict that certain putative chimeras that lack activation domains may actively compete with their parental proteins, thereby exerting dominant negative effects. More generally, the production of chimeric transcripts enables a combinatorial increase in the number of protein products available, which may disturb the function of parental genes and influence their protein–protein interaction network.
Availability: our scripts are available upon request.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3371848  PMID: 22689780
5.  Bioimage informatics: a new category in Bioinformatics 
Bioinformatics  2012;28(8):1057.
PMCID: PMC3324521  PMID: 22399678
6.  TopoGSA: network topological gene set analysis 
Bioinformatics  2010;26(9):1271-1272.
Summary: TopoGSA (Topology-based Gene Set Analysis) is a web-application dedicated to the computation and visualization of network topological properties for gene and protein sets in molecular interaction networks. Different topological characteristics, such as the centrality of nodes in the network or their tendency to form clusters, can be computed and compared with those of known cellular pathways and processes.
Availability: Freely available at
PMCID: PMC2859135  PMID: 20335277
7.  Determination and validation of principal gene products 
Alternative splicing has the potential to generate a wide range of protein isoforms. For many computational applications and for experimental research, it is important to be able to concentrate on the isoform that retains the core biological function. For many genes this is far from clear.
We have combined five methods into a pipeline that allows us to detect the principal variant for a gene. Most of the methods were based on conservation between species, at the level of both gene and protein. The five methods used were the conservation of exonic structure, the detection of non-neutral evolution, the conservation of functional residues, the existence of a known protein structure and the abundance of vertebrate orthologues. The pipeline was able to determine a principal isoform for 83% of a set of well-annotated genes with multiple variants.
PMCID: PMC2734078  PMID: 18006548
8.  Editorial 
Bioinformatics  2008;24(13):i1.
PMCID: PMC2718667  PMID: 18689809

