1.  Genetic Variations and Diseases in UniProtKB/Swiss-Prot: The Ins and Outs of Expert Manual Curation 
Human Mutation  2014;35(8):927-935.
During the last few years, next-generation sequencing (NGS) technologies have accelerated the detection of genetic variants resulting in the rapid discovery of new disease-associated genes. However, the wealth of variation data made available by NGS alone is not sufficient to understand the mechanisms underlying disease pathogenesis and manifestation. Multidisciplinary approaches combining sequence and clinical data with prior biological knowledge are needed to unravel the role of genetic variants in human health and disease. In this context, it is crucial that these data are linked, organized, and made readily available through reliable online resources. The Swiss-Prot section of the Universal Protein Knowledgebase (UniProtKB/Swiss-Prot) provides the scientific community with a collection of information on protein functions, interactions, biological pathways, as well as human genetic diseases and variants, all manually reviewed by experts. In this article, we present an overview of the information content of UniProtKB/Swiss-Prot to show how this knowledgebase can support researchers in the elucidation of the mechanisms leading from a molecular defect to a disease phenotype.
PMCID: PMC4107114  PMID: 24848695
UniProtKB/Swiss-Prot; database; manual curation; genetic variants; disease; functional annotation; controlled vocabulary
2.  Transcriptional response to cardiac injury in the zebrafish: systematic identification of genes with highly concordant activity across in vivo models 
BMC Genomics  2014;15(1):852.
Zebrafish is a clinically-relevant model of heart regeneration. Unlike mammals, it has a remarkable heart repair capacity after injury, and promises novel translational applications. Amputation and cryoinjury models are key research tools for understanding injury response and regeneration in vivo. An understanding of the transcriptional responses following injury is needed to identify key players of heart tissue repair, as well as potential targets for boosting this property in humans.
We investigated amputation and cryoinjury in vivo models of heart damage in the zebrafish through unbiased, integrative analyses of independent molecular datasets. To detect genes with potential biological roles, we derived computational prediction models with microarray data from heart amputation experiments. We focused on a top-ranked set of genes highly activated in the early post-injury stage, whose activity was further verified in independent microarray datasets. Next, we performed independent validations of expression responses with qPCR in a cryoinjury model. Across in vivo models, the top candidates showed highly concordant responses at 1 and 3 days post-injury, which highlights the predictive power of our analysis strategies and the possible biological relevance of these genes. Top candidates are significantly involved in cell fate specification and differentiation, and include heart failure markers such as periostin, as well as potential new targets for heart regeneration. For example, ptgis and ca2 were overexpressed, while usp2a, a regulator of the p53 pathway, was down-regulated in our in vivo models. Interestingly, a high activity of ptgis and ca2 has been previously observed in failing hearts from rats and humans.
We identified genes with potential critical roles in the response to cardiac damage in the zebrafish. Their transcriptional activities are reproducible in different in vivo models of cardiac injury.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2164-15-852) contains supplementary material, which is available to authorized users.
PMCID: PMC4197235  PMID: 25280539
Myocardial infarction; Zebrafish; Ventricular amputation; Ventricular cryoinjury; Heart regeneration; Transcriptional responses; Transcriptional association networks
3.  An Integrated Ontology Resource to Explore and Study Host-Virus Relationships 
PLoS ONE  2014;9(9):e108075.
Our growing knowledge of viruses reveals how these pathogens manage to evade innate host defenses. A global scheme emerges in which many viruses usurp key cellular defense mechanisms and often inhibit the same components of antiviral signaling. To accurately describe these processes, we have generated a comprehensive dictionary for eukaryotic host-virus interactions. This controlled vocabulary has been detailed in 57 ViralZone resource web pages which contain a global description of all molecular processes. In order to annotate viral gene products with this vocabulary, an ontology has been built in a hierarchy of UniProt Knowledgebase (UniProtKB) keyword terms and corresponding Gene Ontology (GO) terms have been developed in parallel. The results are 65 UniProtKB keywords related to 57 GO terms, which have been used in 14,390 manual annotations; 908,723 automatic annotations and propagated to an estimation of 922,941 GO annotations. ViralZone pages, UniProtKB keywords and GO terms provide complementary tools to users, and the three resources have been linked to each other through host-virus vocabulary.
PMCID: PMC4169452  PMID: 25233094
5.  Extensive remodeling of DC function by rapid maturation-induced transcriptional silencing 
Nucleic Acids Research  2014;42(15):9641-9655.
The activation, or maturation, of dendritic cells (DCs) is crucial for the initiation of adaptive T-cell mediated immune responses. Research on the molecular mechanisms implicated in DC maturation has focused primarily on inducible gene-expression events promoting the acquisition of new functions, such as cytokine production and enhanced T-cell-stimulatory capacity. In contrast, mechanisms that modulate DC function by inducing widespread gene-silencing remain poorly understood. Yet the termination of key functions is known to be critical for the function of activated DCs. Genome-wide analysis of activation-induced histone deacetylation, combined with genome-wide quantification of activation-induced silencing of nascent transcription, led us to identify a novel inducible transcriptional-repression pathway that makes major contributions to the DC-maturation process. This silencing response is a rapid primary event distinct from repression mechanisms known to operate at later stages of DC maturation. The repressed genes function in pivotal processes—including antigen-presentation, extracellular signal detection, intracellular signal transduction and lipid-mediator biosynthesis—underscoring the central contribution of the silencing mechanism to rapid reshaping of DC function. Interestingly, promoters of the repressed genes exhibit a surprisingly high frequency of PU.1-occupied sites, suggesting a novel role for this lineage-specific transcription factor in marking genes poised for inducible repression.
PMCID: PMC4150779  PMID: 25104025
6.  Analysis of Stop-Gain and Frameshift Variants in Human Innate Immunity Genes 
PLoS Computational Biology  2014;10(7):e1003757.
Loss-of-function variants in innate immunity genes are associated with Mendelian disorders in the form of primary immunodeficiencies. Recent resequencing projects report that stop-gains and frameshifts are collectively prevalent in humans and could be responsible for some of the inter-individual variability in innate immune response. Current computational approaches evaluating loss-of-function in genes carrying these variants rely on gene-level characteristics such as evolutionary conservation and functional redundancy across the genome. However, innate immunity genes represent a particular case because they are more likely to be under positive selection and duplicated. To create a ranking of severity that would be applicable to innate immunity genes we evaluated 17,764 stop-gain and 13,915 frameshift variants from the NHLBI Exome Sequencing Project and 1,000 Genomes Project. Sequence-based features such as loss of functional domains, isoform-specific truncation and nonsense-mediated decay were found to correlate with variant allele frequency and validated with gene expression data. We integrated these features in a Bayesian classification scheme and benchmarked its use in predicting pathogenic variants against Online Mendelian Inheritance in Man (OMIM) disease stop-gains and frameshifts. The classification scheme was applied in the assessment of 335 stop-gains and 236 frameshifts affecting 227 interferon-stimulated genes. The sequence-based score ranks variants in innate immunity genes according to their potential to cause disease, and complements existing gene-based pathogenicity scores. Specifically, the sequence-based score improves measurement of functional gene impairment, discriminates across different variants in a given gene and appears particularly useful for analysis of less conserved genes.
Author Summary
There are well-characterized severe immunodeficiencies associated with loss-of-function variants in innate immunity genes. Genome sequencing projects identify rare stop-gain and frameshift variants in innate immunity genes whose phenotype is uncharacterized. Current methods to estimate the severity of rare stop-gains and frameshifts are based on evolutionary conservation of the gene, the likelihood for redundancy in its function or mutational burden. These parameters are not always applicable to innate immunity genes. We evaluated sequence-level characteristics of more than 30'000 stop-gains and frameshifts and prioritized variants according to their predicted functional consequences. Our scoring approach complements existing tools in the prediction of innate immunity OMIM disease variants and associates with functional readouts such as gene expression. In this framework, we show that many individuals do carry highly pathogenic variants in genes participating in antiviral defense. The clinical assessment of these variants is of significant interest.
PMCID: PMC4110073  PMID: 25058640
7.  The mzTab Data Exchange Format: Communicating Mass-spectrometry-based Proteomics and Metabolomics Experimental Results to a Wider Audience* 
Molecular & Cellular Proteomics : MCP  2014;13(10):2765-2775.
The HUPO Proteomics Standards Initiative has developed several standardized data formats to facilitate data sharing in mass spectrometry (MS)-based proteomics. These allow researchers to report their complete results in a unified way. However, at present, there is no format to describe the final qualitative and quantitative results for proteomics and metabolomics experiments in a simple tabular format. Many downstream analysis use cases are only concerned with the final results of an experiment and require an easily accessible format, compatible with tools such as Microsoft Excel or R.
We developed the mzTab file format for MS-based proteomics and metabolomics results to meet this need. mzTab is intended as a lightweight supplement to the existing standard XML-based file formats (mzML, mzIdentML, mzQuantML), providing a comprehensive summary, similar in concept to the supplemental material of a scientific publication. mzTab files can contain protein, peptide, and small molecule identifications together with experimental metadata and basic quantitative information. The format is not intended to store the complete experimental evidence but provides mechanisms to report results at different levels of detail. These range from a simple summary of the final results to a representation of the results including the experimental design. This format is ideally suited to make MS-based proteomics and metabolomics results available to a wider biological community outside the field of MS. Several software tools for proteomics and metabolomics have already adapted the format as an output format. The comprehensive mzTab specification document and extensive additional documentation can be found online.
PMCID: PMC4189001  PMID: 24980485
8.  Fifteen years SIB Swiss Institute of Bioinformatics: life science databases, tools and support 
Nucleic Acids Research  2014;42(Web Server issue):W436-W441.
The SIB Swiss Institute of Bioinformatics ( was created in 1998 as an institution to foster excellence in bioinformatics. It is renowned worldwide for its databases and software tools, such as UniProtKB/Swiss-Prot, PROSITE, SWISS-MODEL, STRING, etc, that are all accessible on, SIB's Bioinformatics Resource Portal. This article provides an overview of the scientific and training resources SIB has consistently been offering to the life science community for more than 15 years.
PMCID: PMC4086091  PMID: 24792157
9.  Genome-wide profiling of the cardiac transcriptome after myocardial infarction identifies novel heart-specific long non-coding RNAs 
European Heart Journal  2014;36(6):353-368.
Heart disease is recognized as a consequence of dysregulation of cardiac gene regulatory networks. Previously, unappreciated components of such networks are the long non-coding RNAs (lncRNAs). Their roles in the heart remain to be elucidated. Thus, this study aimed to systematically characterize the cardiac long non-coding transcriptome post-myocardial infarction and to elucidate their potential roles in cardiac homoeostasis.
Methods and results
We annotated the mouse transcriptome after myocardial infarction via RNA sequencing and ab initio transcript reconstruction, and integrated genome-wide approaches to associate specific lncRNAs with developmental processes and physiological parameters. Expression of specific lncRNAs strongly correlated with defined parameters of cardiac dimensions and function. Using chromatin maps to infer lncRNA function, we identified many with potential roles in cardiogenesis and pathological remodelling. The vast majority was associated with active cardiac-specific enhancers. Importantly, oligonucleotide-mediated knockdown implicated novel lncRNAs in controlling expression of key regulatory proteins involved in cardiogenesis. Finally, we identified hundreds of human orthologues and demonstrate that particular candidates were differentially modulated in human heart disease.
These findings reveal hundreds of novel heart-specific lncRNAs with unique regulatory and functional characteristics relevant to maladaptive remodelling, cardiac function and possibly cardiac regeneration. This new class of molecules represents potential therapeutic targets for cardiac disease. Furthermore, their exquisite correlation with cardiac physiology renders them attractive candidate biomarkers to be used in the clinic.
PMCID: PMC4320320  PMID: 24786300
Myocardial infarction; Heart failure; Transcriptome; Long non-coding RNAs; Next-generation sequencing
10.  The EMPRES-i genetic module: a novel tool linking epidemiological outbreak information and genetic characteristics of influenza viruses 
Combining epidemiological information, genetic characterization and geomapping in the analysis of influenza can contribute to a better understanding and description of influenza epidemiology and ecology, including possible virus reassortment events. Furthermore, integration of information such as agroecological farming system characteristics can provide new knowledge on risk factors of influenza emergence and spread. Integrating viral characteristics into an animal disease information system is therefore expected to provide a unique tool to trace-and-track particular virus strains; generate clade distributions and spatiotemporal clusters; screen for distribution of viruses with specific molecular markers; identify potential risk factors; and analyze or map viral characteristics related to vaccines used for control and/or prevention. For this purpose, a genetic module was developed within EMPRES-i (FAO’s global animal disease information system) linking epidemiological information from influenza events with virus characteristics and enabling combined analysis. An algorithm was developed to act as the interface between EMPRES-i disease event data and publicly available influenza virus sequences in OpenfluDB. This algorithm automatically computes potential links between outbreak event and sequences, which are subsequently manually validated by experts. Subsequently, other virus characteristics such as antiviral resistance can then be associated to outbreak data. To visualize such characteristics on a geographic map, shape files with virus characteristics to overlay on other EMPRES-i map layers (e.g. animal densities) can be generated. The genetic module allows export of associated epidemiological and sequence data for further analysis. FAO has made this tool available for scientists and policy makers. Contributions are expected from users to improve and validate the number of linked influenza events and isolate information as well as the quality of information. Possibilities to interconnect with other influenza sequence databases or to expand the genetic module to other viral diseases (e.g. foot and mouth disease) are being explored.
Database OpenfluDB URL:
Database EMPRES-i URL:
PMCID: PMC3945526  PMID: 24608033
11.  Automated quantitative histology reveals vascular morphodynamics during Arabidopsis hypocotyl secondary growth 
eLife  2014;3:e01567.
Among various advantages, their small size makes model organisms preferred subjects of investigation. Yet, even in model systems detailed analysis of numerous developmental processes at cellular level is severely hampered by their scale. For instance, secondary growth of Arabidopsis hypocotyls creates a radial pattern of highly specialized tissues that comprises several thousand cells starting from a few dozen. This dynamic process is difficult to follow because of its scale and because it can only be investigated invasively, precluding comprehensive understanding of the cell proliferation, differentiation, and patterning events involved. To overcome such limitation, we established an automated quantitative histology approach. We acquired hypocotyl cross-sections from tiled high-resolution images and extracted their information content using custom high-throughput image processing and segmentation. Coupled with automated cell type recognition through machine learning, we could establish a cellular resolution atlas that reveals vascular morphodynamics during secondary growth, for example equidistant phloem pole formation.
eLife digest
Our understanding of the living world has been advanced greatly by studies of ‘model organisms’, such as mice, zebrafish, and fruit flies. Studying these creatures has been crucial to uncovering the genes that control how our bodies develop and grow, and also to discover the genetic basis of diseases such as cancer.
Thale cress—or Arabidopsis thaliana to give its formal name—is the model organism of choice for many plant biologists. This tiny weed has been widely studied because it can complete its lifecycle, from seed to seed, in about 6 weeks, and because its relatively small genome simplifies the search for genes that control specific traits. However, as with other much-studied model systems, understanding the changes that underpin the development of some of the more complex tissues in Arabidopsis has been severely hampered by the shear number of cells involved.
After it has emerged from the seed, the plant’s first stem will develop from a few dozen cells in width to several thousand cells with highly specialized tissues arranged in a complex pattern of concentric circles. Although this stem thickening process represents a major developmental change in many plants—from Arabidopsis to oak trees—it has been under-researched. This is partly because it involves so many different cells, and also because it can only be observed in thin sections cut out of the plant’s stem.
Now Sankar, Nieminen, Ragni et al. have developed a novel approach, termed ‘automated quantitative histology’, to overcome these problems. This strategy involves ‘teaching’ a computer to automatically recognize different plant cells and to measure their important features in high-resolution images of tissue sections. The resulting ‘map’ of the developing stem—which required over 800 hr of computing time to complete—reveals the changes to cells and tissues as they develop that allow the transport of water, sugars and nutrients between the above- and below-ground organs. Sankar, Nieminen, Ragni et al. suggest that their novel approach could, in the future, also be applied to study the development of other tissues and organisms, including animals.
PMCID: PMC3917233  PMID: 24520159
secondary growth; machine learning; image segmentation; hypocotyl; phloem; xylem; Arabidopsis
12.  Database resources for the Tuberculosis community 
Access to online repositories for genomic and associated “-omics” datasets is now an essential part of everyday research activity. It is important therefore that the Tuberculosis community is aware of the databases and tools available to them online, as well as for the database hosts to know what the needs of the research community are. One of the goals of the Tuberculosis Annotation Jamboree, held in Washington DC on March 7th–8th 2012, was therefore to provide an overview of the current status of three key Tuberculosis resources, TubercuList (, TB Database (, and Pathosystems Resource Integration Center (PATRIC, Here we summarize some key updates and upcoming features in TubercuList, and provide an overview of the PATRIC site and its online tools for pathogen RNA-Seq analysis.
PMCID: PMC3592388  PMID: 23332401
13.  Efficient computation of minimal perturbation sets in gene regulatory networks 
In the last few decades, technological and experimental advancements have enabled a more precise understanding of the mode of action of drugs with respect to human cell signaling pathways and have positively influenced the design of new drug compounds. However, as the design of compounds has become increasingly target-specific, the overall effects of a drug on adjacent cellular signaling pathways remain difficult to predict because of the complexity of the interactions involved. Off-target effects of drugs are known to influence their efficacy and safety. Similarly, drugs which are more target-specific also suffer from lack of efficacy because their scope might be too limited in the context of cellular signaling. Even in situations where the signaling pathways targeted by a drug are known, the presence of point mutations in some of the components of the pathways can render a therapy ineffective in a considerable target subpopulation. Some of these issues can be addressed by predicting Minimal Intervention Sets (MIS) of elements of the signaling pathways that when perturbed give rise to a pre-defined cellular phenotype. These minimal gene perturbation sets can then be further used to screen a library of drug compounds in order to discover effective drug therapies. This manuscript describes algorithms that can be used to discover MIS in a gene regulatory network that can lead to a defined cellular phenotype. Algorithms are implemented in our Boolean modeling toolbox, GenYsis. The software binaries of GenYsis are available for download from
PMCID: PMC3867968  PMID: 24391592
boolean modeling; GRN; MIS; miRNA; algorithms; qualitative modeling; T-Helper; cancer pathways
14.  SBML qualitative models: a model representation format and infrastructure to foster interactions between qualitative modelling formalisms and tools 
BMC Systems Biology  2013;7:135.
Qualitative frameworks, especially those based on the logical discrete formalism, are increasingly used to model regulatory and signalling networks. A major advantage of these frameworks is that they do not require precise quantitative data, and that they are well-suited for studies of large networks. While numerous groups have developed specific computational tools that provide original methods to analyse qualitative models, a standard format to exchange qualitative models has been missing.
We present the Systems Biology Markup Language (SBML) Qualitative Models Package (“qual”), an extension of the SBML Level 3 standard designed for computer representation of qualitative models of biological networks. We demonstrate the interoperability of models via SBML qual through the analysis of a specific signalling network by three independent software tools. Furthermore, the collective effort to define the SBML qual format paved the way for the development of LogicalModel, an open-source model library, which will facilitate the adoption of the format as well as the collaborative development of algorithms to analyse qualitative models.
SBML qual allows the exchange of qualitative models among a number of complementary software tools. SBML qual has the potential to promote collaborative work on the development of novel computational approaches, as well as on the specification and the analysis of comprehensive qualitative models of regulatory and signalling networks.
PMCID: PMC3892043  PMID: 24321545
15.  Qualitative modeling identifies IL-11 as a novel regulator in maintaining self-renewal in human pluripotent stem cells 
Pluripotency in human embryonic stem cells (hESCs) and induced pluripotent stem cells (iPSCs) is regulated by three transcription factors—OCT3/4, SOX2, and NANOG. To fully exploit the therapeutic potential of these cells it is essential to have a good mechanistic understanding of the maintenance of self-renewal and pluripotency. In this study, we demonstrate a powerful systems biology approach in which we first expand literature-based network encompassing the core regulators of pluripotency by assessing the behavior of genes targeted by perturbation experiments. We focused our attention on highly regulated genes encoding cell surface and secreted proteins as these can be more easily manipulated by the use of inhibitors or recombinant proteins. Qualitative modeling based on combining boolean networks and in silico perturbation experiments were employed to identify novel pluripotency-regulating genes. We validated Interleukin-11 (IL-11) and demonstrate that this cytokine is a novel pluripotency-associated factor capable of supporting self-renewal in the absence of exogenously added bFGF in culture. To date, the various protocols for hESCs maintenance require supplementation with bFGF to activate the Activin/Nodal branch of the TGFβ signaling pathway. Additional evidence supporting our findings is that IL-11 belongs to the same protein family as LIF, which is known to be necessary for maintaining pluripotency in mouse but not in human ESCs. These cytokines operate through the same gp130 receptor which interacts with Janus kinases. Our finding might explain why mESCs are in a more naïve cell state compared to hESCs and how to convert primed hESCs back to the naïve state. Taken together, our integrative modeling approach has identified novel genes as putative candidates to be incorporated into the expansion of the current gene regulatory network responsible for inducing and maintaining pluripotency.
PMCID: PMC3809568  PMID: 24194720
embryonic stem cells; boolean modeling; regulatory networks; pluripotency; self-renewal
16.  The UniProtKB/Swiss-Prot Tox-Prot program: a central hub of integrated venom protein data 
Toxicon  2012;60(4):551-557.
Animal toxins are of interest to a wide range of scientists, due to their numerous applications in pharmacology, neurology, hematology, medicine, and drug research. This, and to a lesser extent the development of new performing tools in transcriptomics and proteomics, has led to an increase in toxin discovery. In this context, providing publicly available data on animal toxins has become essential. The UniProtKB/Swiss-Prot Tox-Prot program ( plays a crucial role by providing such an access to venom protein sequences and functions from all venomous species. This program has up to now curated more than 5’000 venom proteins to the high-quality standards of UniProtKB/Swiss-Prot (release 2012_02). Proteins targeted by these toxins are also available in the knowledgebase. This paper describes in details the type of information provided by UniProtKB/Swiss-Prot for toxins, as well as the structured format of the knowledgebase.
PMCID: PMC3393831  PMID: 22465017
UniProtKB/Swiss-Prot Tox-Prot program; Database; Curation; Venom protein; Animal toxin; Bioinformatics
17.  Protein Interaction Data Curation - The International Molecular Exchange Consortium (IMEx) 
Nature methods  2012;9(4):345-350.
The IMEx consortium is an international collaboration between major public interaction data providers to share curation effort and make a non-redundant set of protein interactions available in a single search interface on a common website ( Common curation rules have been developed and a central registry is used to manage the selection of articles to enter into the dataset. The advantages of such a service to the user, quality control measures adopted and data distribution practices are discussed.
PMCID: PMC3703241  PMID: 22453911
18.  Hard-wired heterogeneity in blood stem cells revealed using a dynamic regulatory network model 
Bioinformatics  2013;29(13):i80-i88.
Motivation: Combinatorial interactions of transcription factors with cis-regulatory elements control the dynamic progression through successive cellular states and thus underpin all metazoan development. The construction of network models of cis-regulatory elements, therefore, has the potential to generate fundamental insights into cellular fate and differentiation. Haematopoiesis has long served as a model system to study mammalian differentiation, yet modelling based on experimentally informed cis-regulatory interactions has so far been restricted to pairs of interacting factors. Here, we have generated a Boolean network model based on detailed cis-regulatory functional data connecting 11 haematopoietic stem/progenitor cell (HSPC) regulator genes.
Results: Despite its apparent simplicity, the model exhibits surprisingly complex behaviour that we charted using strongly connected components and shortest-path analysis in its Boolean state space. This analysis of our model predicts that HSPCs display heterogeneous expression patterns and possess many intermediate states that can act as ‘stepping stones’ for the HSPC to achieve a final differentiated state. Importantly, an external perturbation or ‘trigger’ is required to exit the stem cell state, with distinct triggers characterizing maturation into the various different lineages. By focusing on intermediate states occurring during erythrocyte differentiation, from our model we predicted a novel negative regulation of Fli1 by Gata1, which we confirmed experimentally thus validating our model. In conclusion, we demonstrate that an advanced mammalian regulatory network model based on experimentally validated cis-regulatory interactions has allowed us to make novel, experimentally testable hypotheses about transcriptional mechanisms that control differentiation of mammalian stem cells.
Contact: or or
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3694641  PMID: 23813012
19.  A 2D/3D image analysis system to track fluorescently labeled structures in rod-shaped cells: application to measure spindle pole asymmetry during mitosis 
Cell Division  2013;8:6.
The yeast Schizosaccharomyces pombe is frequently used as a model for studying the cell cycle. The cells are rod-shaped and divide by medial fission. The process of cell division, or cytokinesis, is controlled by a network of signaling proteins called the Septation Initiation Network (SIN); SIN proteins associate with the SPBs during nuclear division (mitosis). Some SIN proteins associate with both SPBs early in mitosis, and then display strongly asymmetric signal intensity at the SPBs in late mitosis, just before cytokinesis. This asymmetry is thought to be important for correct regulation of SIN signaling, and coordination of cytokinesis and mitosis. In order to study the dynamics of organelles or large protein complexes such as the spindle pole body (SPB), which have been labeled with a fluorescent protein tag in living cells, a number of the image analysis problems must be solved; the cell outline must be detected automatically, and the position and signal intensity associated with the structures of interest within the cell must be determined.
We present a new 2D and 3D image analysis system that permits versatile and robust analysis of motile, fluorescently labeled structures in rod-shaped cells. We have designed an image analysis system that we have implemented as a user-friendly software package allowing the fast and robust image-analysis of large numbers of rod-shaped cells. We have developed new robust algorithms, which we combined with existing methodologies to facilitate fast and accurate analysis. Our software permits the detection and segmentation of rod-shaped cells in either static or dynamic (i.e. time lapse) multi-channel images. It enables tracking of two structures (for example SPBs) in two different image channels. For 2D or 3D static images, the locations of the structures are identified, and then intensity values are extracted together with several quantitative parameters, such as length, width, cell orientation, background fluorescence and the distance between the structures of interest. Furthermore, two kinds of kymographs of the tracked structures can be established, one representing the migration with respect to their relative position, the other representing their individual trajectories inside the cell. This software package, called “RodCellJ”, allowed us to analyze a large number of S. pombe cells to understand the rules that govern SIN protein asymmetry. (Continued on next page)
(Continued from previous page)
“RodCellJ” is freely available to the community as a package of several ImageJ plugins to simultaneously analyze the behavior of a large number of rod-shaped cells in an extensive manner. The integration of different image-processing techniques in a single package, as well as the development of novel algorithms does not only allow to speed up the analysis with respect to the usage of existing tools, but also accounts for higher accuracy. Its utility was demonstrated on both 2D and 3D static and dynamic images to study the septation initiation network of the yeast Schizosaccharomyces pombe. More generally, it can be used in any kind of biological context where fluorescent-protein labeled structures need to be analyzed in rod-shaped cells.
RodCellJ is freely available under
PMCID: PMC3693874  PMID: 23622681
Cell segmentation; Protein tracking; Rod shape; Kymograph; Asymmetry; Fluorescence time-lapse microscopy
20.  Density-based hierarchical clustering of pyro-sequences on a large scale—the case of fungal ITS1 
Bioinformatics  2013;29(10):1268-1274.
Motivation: Analysis of millions of pyro-sequences is currently playing a crucial role in the advance of environmental microbiology. Taxonomy-independent, i.e. unsupervised, clustering of these sequences is essential for the definition of Operational Taxonomic Units. For this application, reproducibility and robustness should be the most sought after qualities, but have thus far largely been overlooked.
Results: More than 1 million hyper-variable internal transcribed spacer 1 (ITS1) sequences of fungal origin have been analyzed. The ITS1 sequences were first properly extracted from 454 reads using generalized profiles. Then, otupipe, cd-hit-454, ESPRIT-Tree and DBC454, a new algorithm presented here, were used to analyze the sequences. A numerical assay was developed to measure the reproducibility and robustness of these algorithms. DBC454 was the most robust, closely followed by ESPRIT-Tree. DBC454 features density-based hierarchical clustering, which complements the other methods by providing insights into the structure of the data.
Availability: An executable is freely available for non-commercial users at It is designed to run under MPI on a cluster of 64-bit Linux machines running Red Hat 4.x, or on a multi-core OSX system.
Contact: or
PMCID: PMC3654712  PMID: 23539304
21.  Application of text-mining for updating protein post-translational modification annotation in UniProtKB 
BMC Bioinformatics  2013;14:104.
The annotation of protein post-translational modifications (PTMs) is an important task of UniProtKB curators and, with continuing improvements in experimental methodology, an ever greater number of articles are being published on this topic. To help curators cope with this growing body of information we have developed a system which extracts information from the scientific literature for the most frequently annotated PTMs in UniProtKB.
The procedure uses a pattern-matching and rule-based approach to extract sentences with information on the type and site of modification. A ranked list of protein candidates for the modification is also provided. For PTM extraction, precision varies from 57% to 94%, and recall from 75% to 95%, according to the type of modification. The procedure was used to track new publications on PTMs and to recover potential supporting evidence for phosphorylation sites annotated based on the results of large scale proteomics experiments.
The information retrieval and extraction method we have developed in this study forms the basis of a simple tool for the manual curation of protein post-translational modifications in UniProtKB/Swiss-Prot. Our work demonstrates that even simple text-mining tools can be effectively adapted for database curation tasks, providing that a thorough understanding of the working process and requirements are first obtained. This system can be accessed at
PMCID: PMC3660268  PMID: 23517090
22.  pfsearchV3: a code acceleration and heuristic to search PROSITE profiles 
Bioinformatics  2013;29(9):1215-1217.
Summary: The PROSITE resource provides a rich and well annotated source of signatures in the form of generalized profiles that allow protein domain detection and functional annotation. One of the major limiting factors in the application of PROSITE in genome and metagenome annotation pipelines is the time required to search protein sequence databases for putative matches. We describe an improved and optimized implementation of the PROSITE search tool pfsearch that, combined with a newly developed heuristic, addresses this limitation. On a modern x86_64 hyper-threaded quad-core desktop computer, the new pfsearchV3 is two orders of magnitude faster than the original algorithm.
Availability and implementation: Source code and binaries of pfsearchV3 are freely available for download at, implemented in C and supported on Linux. PROSITE generalized profiles including the heuristic cut-off scores are available at the same address.
PMCID: PMC3634184  PMID: 23505298
23.  Evolution of the Ferric Reductase Domain (FRD) Superfamily: Modularity, Functional Diversification, and Signature Motifs 
PLoS ONE  2013;8(3):e58126.
A heme-containing transmembrane ferric reductase domain (FRD) is found in bacterial and eukaryotic protein families, including ferric reductases (FRE), and NADPH oxidases (NOX). The aim of this study was to understand the phylogeny of the FRD superfamily. Bacteria contain FRD proteins consisting only of the ferric reductase domain, such as YedZ and short bFRE proteins. Full length FRE and NOX enzymes are mostly found in eukaryotic cells and all possess a dehydrogenase domain, allowing them to catalyze electron transfer from cytosolic NADPH to extracellular metal ions (FRE) or oxygen (NOX). Metazoa possess YedZ-related STEAP proteins, possibly derived from bacteria through horizontal gene transfer. Phylogenetic analyses suggests that FRE enzymes appeared early in evolution, followed by a transition towards EF-hand containing NOX enzymes (NOX5- and DUOX-like). An ancestral gene of the NOX(1-4) family probably lost the EF-hands and new regulatory mechanisms of increasing complexity evolved in this clade. Two signature motifs were identified: NOX enzymes are distinguished from FRE enzymes through a four amino acid motif spanning from transmembrane domain 3 (TM3) to TM4, and YedZ/STEAP proteins are identified by the replacement of the first canonical heme-spanning histidine by a highly conserved arginine. The FRD superfamily most likely originated in bacteria.
PMCID: PMC3591440  PMID: 23505460
24.  Plant species distributions along environmental gradients: do belowground interactions with fungi matter? 
The distribution of plants along environmental gradients is constrained by abiotic and biotic factors. Cumulative evidence attests of the impact of biotic factors on plant distributions, but only few studies discuss the role of belowground communities. Soil fungi, in particular, are thought to play an important role in how plant species assemble locally into communities. We first review existing evidence, and then test the effect of the number of soil fungal operational taxonomic units (OTUs) on plant species distributions using a recently collected dataset of plant and metagenomic information on soil fungi in the Western Swiss Alps. Using species distribution models (SDMs), we investigated whether the distribution of individual plant species is correlated to the number of OTUs of two important soil fungal classes known to interact with plants: the Glomeromycetes, that are obligatory symbionts of plants, and the Agaricomycetes, that may be facultative plant symbionts, pathogens, or wood decayers. We show that including the fungal richness information in the models of plant species distributions improves predictive accuracy. Number of fungal OTUs is especially correlated to the distribution of high elevation plant species. We suggest that high elevation soil show greater variation in fungal assemblages that may in turn impact plant turnover among communities. We finally discuss how to move beyond correlative analyses, through the design of field experiments manipulating plant and fungal communities along environmental gradients.
PMCID: PMC3857535  PMID: 24339830
fungal communities; plant assemblage; elevation; 454 pyrosequencing; species distribution models
25.  ViralZone: recent updates to the virus knowledge resource 
Nucleic Acids Research  2012;41(Database issue):D579-D583.
ViralZone ( is a knowledge repository that allows users to learn about viruses including their virion structure, replication cycle and host–virus interactions. The information is divided into viral fact sheets that describe virion shape, molecular biology and epidemiology for each viral genus, with links to the corresponding annotated proteomes of UniProtKB. Each viral genus page contains detailed illustrations, text and PubMed references. This new update provides a linked view of viral molecular biology through 133 new viral ontology pages that describe common steps of viral replication cycles shared by several viral genera. This viral cell-cycle ontology is also represented in UniProtKB in the form of annotated keywords. In this way, users can navigate from the description of a replication-cycle event, to the viral genus concerned, and the associated UniProtKB protein records.
PMCID: PMC3531065  PMID: 23193299

