Katayama, Toshiaki | Wilkinson, Mark D | Micklem, Gos | Kawashima, Shuichi | Yamaguchi, Atsuko | Nakao, Mitsuteru | Yamamoto, Yasunori | Okamoto, Shinobu | Oouchida, Kenta | Chun, Hong-Woo | Aerts, Jan | Afzal, Hammad | Antezana, Erick | Arakawa, Kazuharu | Aranda, Bruno | Belleau, Francois | Bolleman, Jerven | Bonnal, Raoul JP | Chapman, Brad | Cock, Peter JA | Eriksson, Tore | Gordon, Paul MK | Goto, Naohisa | Hayashi, Kazuhiro | Horn, Heiko | Ishiwata, Ryosuke | Kaminuma, Eli | Kasprzyk, Arek | Kawaji, Hideya | Kido, Nobuhiro | Kim, Young Joo | Kinjo, Akira R | Konishi, Fumikazu | Kwon, Kyung-Hoon | Labarga, Alberto | Lamprecht, Anna-Lena | Lin, Yu | Lindenbaum, Pierre | McCarthy, Luke | Morita, Hideyuki | Murakami, Katsuhiko | Nagao, Koji | Nishida, Kozo | Nishimura, Kunihiro | Nishizawa, Tatsuya | Ogishima, Soichi | Ono, Keiichiro | Oshita, Kazuki | Park, Keun-Joon | Prins, Pjotr | Saito, Taro L | Samwald, Matthias | Satagopam, Venkata P | Shigemoto, Yasumasa | Smith, Richard | Splendiani, Andrea | Sugawara, Hideaki | Taylor, James | Vos, Rutger A | Withers, David | Yamasaki, Chisato | Zmasek, Christian M | Kawamoto, Shoko | Okubo, Kosaku | Asai, Kiyoshi | Takagi, Toshihisa
Background
BioHackathon 2010 was the third in a series of meetings hosted by the Database Center for Life Sciences (DBCLS) in Tokyo, Japan. The overall goal of the BioHackathon series is to improve the quality and accessibility of life science research data on the Web by bringing together representatives from public databases, analytical tool providers, and cyber-infrastructure researchers to jointly tackle important challenges in the area of in silico biological research.
Results
The theme of BioHackathon 2010 was the 'Semantic Web', and all attendees gathered with the shared goal of producing Semantic Web data from their respective resources, and/or consuming or interacting those data using their tools and interfaces. We discussed on topics including guidelines for designing semantic data and interoperability of resources. We consequently developed tools and clients for analysis and visualization.
Conclusion
We provide a meeting report from BioHackathon 2010, in which we describe the discussions, decisions, and breakthroughs made as we moved towards compliance with Semantic Web technologies - from source provider, through middleware, to the end-consumer.
doi:10.1186/2041-1480-4-6
PMCID: PMC3598643
PMID: 23398680
BioHackathon; Open source; Software; Semantic Web; Databases; Data integration; Data visualization; Web services; Interfaces
Yao, Fei | Ariyaratne, Pramila N. | Hillmer, Axel M. | Lee, Wah Heng | Li, Guoliang | Teo, Audrey S. M. | Woo, Xing Yi | Zhang, Zhenshui | Chen, Jieqi P. | Poh, Wan Ting | Zawack, Kelson F. B. | Chan, Chee Seng | Leong, See Ting | Neo, Say Chuan | Choi, Poh Sum D. | Gao, Song | Nagarajan, Niranjan | Thoreau, Hervé | Shahab, Atif | Ruan, Xiaoan | Cacheux-Rataboul, Valère | Wei, Chia-Lin | Bourque, Guillaume | Sung, Wing-Kin | Liu, Edison T. | Ruan, Yijun | Aerts, Jan
Structural variations (SVs) contribute significantly to the variability of the human genome and extensive genomic rearrangements are a hallmark of cancer. While genomic DNA paired-end-tag (DNA-PET) sequencing is an attractive approach to identify genomic SVs, the current application of PET sequencing with short insert size DNA can be insufficient for the comprehensive mapping of SVs in low complexity and repeat-rich genomic regions. We employed a recently developed procedure to generate PET sequencing data using large DNA inserts of 10–20 kb and compared their characteristics with short insert (1 kb) libraries for their ability to identify SVs. Our results suggest that although short insert libraries bear an advantage in identifying small deletions, they do not provide significantly better breakpoint resolution. In contrast, large inserts are superior to short inserts in providing higher physical genome coverage for the same sequencing cost and achieve greater sensitivity, in practice, for the identification of several classes of SVs, such as copy number neutral and complex events. Furthermore, our results confirm that large insert libraries allow for the identification of SVs within repetitive sequences, which cannot be spanned by short inserts. This provides a key advantage in studying rearrangements in cancer, and we show how it can be used in a fusion-point-guided-concatenation algorithm to study focally amplified regions in cancer.
doi:10.1371/journal.pone.0046152
PMCID: PMC3461012
PMID: 23029419
Sifrim, Alejandro | Van Houdt, Jeroen KJ | Tranchevent, Leon-Charles | Nowakowska, Beata | Sakai, Ryo | Pavlopoulos, Georgios A | Devriendt, Koen | Vermeesch, Joris R | Moreau, Yves | Aerts, Jan
The increasing size and complexity of exome/genome sequencing data requires new tools for clinical geneticists to discover disease-causing variants. Bottlenecks in identifying the causative variation include poor cross-sample querying, constantly changing functional annotation and not considering existing knowledge concerning the phenotype. We describe a methodology that facilitates exploration of patient sequencing data towards identification of causal variants under different genetic hypotheses. Annotate-it facilitates handling, analysis and interpretation of high-throughput single nucleotide variant data. We demonstrate our strategy using three case studies. Annotate-it is freely available and test data are accessible to all users at http://www.annotate-it.org.
doi:10.1186/gm374
PMCID: PMC3580443
PMID: 23013645
Background
The University of California, Santa Cruz (UCSC) genome database is among the most used sources of genomic annotation in human and other organisms. The database offers an excellent web-based graphical user interface (the UCSC genome browser) and several means for programmatic queries. A simple application programming interface (API) in a scripting language aimed at the biologist was however not yet available. Here, we present the Ruby UCSC API, a library to access the UCSC genome database using Ruby.
Results
The API is designed as a BioRuby plug-in and built on the ActiveRecord 3 framework for the object-relational mapping, making writing SQL statements unnecessary. The current version of the API supports databases of all organisms in the UCSC genome database including human, mammals, vertebrates, deuterostomes, insects, nematodes, and yeast.
The API uses the bin index—if available—when querying for genomic intervals. The API also supports genomic sequence queries using locally downloaded *.2bit files that are not stored in the official MySQL database. The API is implemented in pure Ruby and is therefore available in different environments and with different Ruby interpreters (including JRuby).
Conclusions
Assisted by the straightforward object-oriented design of Ruby and ActiveRecord, the Ruby UCSC API will facilitate biologists to query the UCSC genome database programmatically. The API is available through the RubyGem system. Source code and documentation are available at https://github.com/misshie/bioruby-ucsc-api/ under the Ruby license. Feedback and help is provided via the website at http://rubyucscapi.userecho.com/.
doi:10.1186/1471-2105-13-240
PMCID: PMC3542311
PMID: 22994508
The emergence of benchtop sequencers has made clinical genetic testing using next-generation sequencing more feasible. Ion Torrent's PGMTM is one such benchtop sequencer that shows clinical promise in detecting single nucleotide variations (SNVs) and microindel variations (indels). However, the large number of false positive indels caused by the high frequency of homopolymer sequencing errors has impeded PGMTM's usage for clinical genetic testing. An extensive analysis of PGMTM data from the sequencing reads of the well-characterized genome of the Escherichia coli DH10B strain and sequences of the BRCA1 and BRCA2 genes from six germline samples was done. Three commonly used variant detection tools, SAMtools, Dindel, and GATK's Unified Genotyper, all had substantial false positive rates for indels. By incorporating filters on two major measures we could dramatically improve false positive rates without sacrificing sensitivity. The two measures were: B-Allele Frequency (BAF) and VARiation of the Width of gaps and inserts (VARW) per indel position. A BAF threshold applied to indels detected by UnifiedGenotyper removed ∼99% of the indel errors detected in both the DH10B and BRCA sequences. The optimum BAF threshold for BRCA sequences was determined by requiring 100% detection sensitivity and minimum false discovery rate, using variants detected from Sanger sequencing as reference. This resulted in 15 indel errors remaining, of which 7 indel errors were removed by selecting a VARW threshold of zero. VARW specific errors increased in frequency with higher read depth in the BRCA datasets, suggesting that homopolymer-associated indel errors cannot be reduced by increasing the depth of coverage. Thus, using a VARW threshold is likely to be important in reducing indel errors from data with higher coverage. In conclusion, BAF and VARW thresholds provide simple and effective filtering criteria that can improve the specificity of indel detection in PGMTM data without compromising sensitivity.
doi:10.1371/journal.pone.0045798
PMCID: PMC3446914
PMID: 23029247
Kalay, Ersan | Yigit, Gökhan | Aslan, Yakup | Brown, Karen E | Pohl, Esther | Bicknell, Louise S | Kayserili, Hülya | Li, Yun | Tüysüz, Beyhan | Nürnberg, Gudrun | Kiess, Wieland | Koegl, Manfred | Baessmann, Ingelore | Buruk, Kurtulus | Toraman, Bayram | Kayipmaz, Saadettin | Kul, Sibel | Ikbal, Mevlit | Turner, Daniel J | Taylor, Martin S | Aerts, Jan | Scott, Carol | Milstein, Karen | Dollfus, Helene | Wieczorek, Dagmar | Brunner, Han G | Hurles, Matthew | Jackson, Andrew P | Rauch, Anita | Nürnberg, Peter | Karagüzel, Ahmet | Wollnik, Bernd
Functional impairment of DNA damage response pathways leads to increased genomic instability. Here we describe the centrosomal protein CEP152 as a new regulator of genomic integrity and cellular response to DNA damage. Using homozygosity mapping and exome sequencing, we identified CEP152 mutations in Seckel syndrome and showed that impaired CEP152 function leads to accumulation of genomic defects resulting from replicative stress through enhanced activation of ATM signaling and increased H2AX phosphorylation.
doi:10.1038/ng.725
PMCID: PMC3430850
PMID: 21131973
Salmona, Jordi | Salamolard, Marc | Fouillot, Damien | Ghestemme, Thomas | Larose, Jerry | Centon, Jean-François | Sousa, Vitor | Dawson, Deborah A. | Thebaud, Christophe | Chikhi, Lounès | Aerts, Jan
The exceptional biodiversity of Reunion Island is threatened by anthropogenic landscape changes that took place during the 350 years of human colonization. During this period the human population size increased dramatically from 250 to 800,000. The arrival of humans together with the development of agriculture, invasive species such as rats and cats, and deforestation has lead to the extinction of more than half of the original vertebrate species of the island. For the remaining species, significant work is being carried out to identify threats and conservation status, but little genetic work has been carried on some of the most endangered species. In the last decade theoretical studies have shown the ability of neutral genetic markers to infer the demographic history of endangered species and identify and date past population size changes (expansions or bottlenecks). In this study we provide the first genetic data on the critically endangered species the Reunion cuckoo-shrike Coracina newtoni. The Reunion cuckoo-shrike is a rare endemic forest bird surviving in a restricted 12-km2 area of forested uplands and mountains. The total known population consists of less than one hundred individuals out of which 45 were genotyped using seventeen polymorphic microsatellite loci. We found a limited level of genetic variability and weak population structure, probably due to the limited geographic distribution. Using Bayesian methods, we identified a strong decline in population size during the Holocene, most likely caused by an ancient climatic or volcanic event around 5000 years ago. This result was surprising as it appeared in apparent contradiction with the accepted theory of recent population collapse due to deforestation and predator introduction. These results suggest that new methods allowing for more complex demographic models are necessary to reconstruct the demographic history of populations.
doi:10.1371/journal.pone.0043524
PMCID: PMC3423348
PMID: 22916272
The complete mitochondrial DNA (mtDNA) of Gracilariopsis lemaneiformis was sequenced (25883 bp) and mapped to a circular model. The A+T composition was 72.5%. Forty six genes and two potentially functional open reading frames were identified. They include 24 protein-coding genes, 2 rRNA genes, 20 tRNA genes and 2 ORFs (orf60, orf142). There is considerable sequence synteny across the five red algal mtDNAs falling into Florideophyceae including Gr. lemaneiformis in this study and previously sequenced species. A long stem-loop and a hairpin structure were identified in intergenic regions of mt genome of Gr. lemaneiformis, which are believed to be involved with transcription and replication. In addition, the mtDNAs of two mutagenic cultivated breeds (“981” and “07-2”) were also sequenced. Compared with the mtDNA of wild Gr. lemaneiformis, the genome size and gene length and order of three strains were completely identical except nine base mutations including eight in the protein-coding genes and one in the tRNA gene. None of the base mutations caused frameshift or a premature stop codon in the mtDNA genes. Phylogenetic analyses based on mitochondrial protein-coding genes and rRNA genes demonstrated Gracilariopsis andersonii had closer phylogenetic relationship with its parasite Gracilariophila oryzoides than Gracilariopsis lemaneiformis which was from the same genus of Gracilariopsis.
doi:10.1371/journal.pone.0040241
PMCID: PMC3386957
PMID: 22768261
As suggested in recent studies, species recognition and disambiguation is one of the most critical and challenging steps in many downstream text-mining applications such as the gene normalization task and protein-protein interaction extraction. We report SR4GN: an open source tool for species recognition and disambiguation in biomedical text. In addition to the species detection function in existing tools, SR4GN is optimized for the Gene Normalization task. As such it is developed to link detected species with corresponding gene mentions in a document. SR4GN achieves 85.42% in accuracy and compares favorably to the other state-of-the-art techniques in benchmark experiments. Finally, SR4GN is implemented as a standalone software tool, thus making it convenient and robust for use in many text-mining applications. SR4GN can be downloaded at: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/downloads/SR4GN
doi:10.1371/journal.pone.0038460
PMCID: PMC3367953
PMID: 22679507
Neylon, Cameron | Aerts, Jan | Brown, C Titus | Coles, Simon J | Hatton, Les | Lemire, Daniel | Millman, K Jarrod | Murray-Rust, Peter | Perez, Fernando | Saunders, Neil | Shah, Nigam | Smith, Arfon | Varoquaux, Gaël | Willighagen, Egon
doi:10.1186/1751-0473-7-2
PMCID: PMC3441321
PMID: 22640749
Johansson, Stefan | Irgens, Henrik | Chudasama, Kishan K. | Molnes, Janne | Aerts, Jan | Roque, Francisco S. | Jonassen, Inge | Levy, Shawn | Lima, Kari | Knappskog, Per M. | Bell, Graeme I. | Molven, Anders | Njølstad, Pål R. | Prokunina-Olsson, Ludmila
Context
Genetic testing for monogenic diabetes is important for patient care. Given the extensive genetic and clinical heterogeneity of diabetes, exome sequencing might provide additional diagnostic potential when standard Sanger sequencing-based diagnostics is inconclusive.
Objective
The aim of the study was to examine the performance of exome sequencing for a molecular diagnosis of MODY in patients who have undergone conventional diagnostic sequencing of candidate genes with negative results.
Research Design and Methods
We performed exome enrichment followed by high-throughput sequencing in nine patients with suspected MODY. They were Sanger sequencing-negative for mutations in the HNF1A, HNF4A, GCK, HNF1B and INS genes. We excluded common, non-coding and synonymous gene variants, and performed in-depth analysis on filtered sequence variants in a pre-defined set of 111 genes implicated in glucose metabolism.
Results
On average, we obtained 45 X median coverage of the entire targeted exome and found 199 rare coding variants per individual. We identified 0–4 rare non-synonymous and nonsense variants per individual in our a priori list of 111 candidate genes. Three of the variants were considered pathogenic (in ABCC8, HNF4A and PPARG, respectively), thus exome sequencing led to a genetic diagnosis in at least three of the nine patients. Approximately 91% of known heterozygous SNPs in the target exomes were detected, but we also found low coverage in some key diabetes genes using our current exome sequencing approach. Novel variants in the genes ARAP1, GLIS3, MADD, NOTCH2 and WFS1 need further investigation to reveal their possible role in diabetes.
Conclusion
Our results demonstrate that exome sequencing can improve molecular diagnostics of MODY when used as a complement to Sanger sequencing. However, improvements will be needed, especially concerning coverage, before the full potential of exome sequencing can be realized.
doi:10.1371/journal.pone.0038050
PMCID: PMC3360646
PMID: 22662265
Bartlett, Christopher W | Yeon Cheong, Soo | Hou, Liping | Paquette, Jesse | Yee Lum, Pek | Jäger, Günter | Battke, Florian | Vehlow, Corinna | Heinrich, Julian | Nieselt, Kay | Sakai, Ryo | Aerts, Jan | Ray, William C
In 2011, the IEEE VisWeek conferences inaugurated a symposium on Biological Data Visualization. Like other domain-oriented Vis symposia, this symposium's purpose was to explore the unique characteristics and requirements of visualization within the domain, and to enhance both the Visualization and Bio/Life-Sciences communities by pushing Biological data sets and domain understanding into the Visualization community, and well-informed Visualization solutions back to the Biological community. Amongst several other activities, the BioVis symposium created a data analysis and visualization contest. Unlike many contests in other venues, where the purpose is primarily to allow entrants to demonstrate tour-de-force programming skills on sample problems with known solutions, the BioVis contest was intended to whet the participants' appetites for a tremendously challenging biological domain, and simultaneously produce viable tools for a biological grand challenge domain with no extant solutions. For this purpose expression Quantitative Trait Locus (eQTL) data analysis was selected. In the BioVis 2011 contest, we provided contestants with a synthetic eQTL data set containing real biological variation, as well as a spiked-in gene expression interaction network influenced by single nucleotide polymorphism (SNP) DNA variation and a hypothetical disease model. Contestants were asked to elucidate the pattern of SNPs and interactions that predicted an individual's disease state. 9 teams competed in the contest using a mixture of methods, some analytical and others through visual exploratory methods. Independent panels of visualization and biological experts judged entries. Awards were given for each panel's favorite entry, and an overall best entry agreed upon by both panels. Three special mention awards were given for particularly innovative and useful aspects of those entries. And further recognition was given to entries that correctly answered a bonus question about how a proposed "gene therapy" change to a SNP might change an individual's disease status, which served as a calibration for each approaches' applicability to a typical domain question. In the future, BioVis will continue the data analysis and visualization contest, maintaining the philosophy of providing new challenging questions in open-ended and dramatically underserved Bio/Life Sciences domains.
doi:10.1186/1471-2105-13-S8-S8
PMCID: PMC3355334
PMID: 22607587
To facilitate genome-guided breeding in potato, we developed an 8303 Single Nucleotide Polymorphism (SNP) marker array using potato genome and transcriptome resources. To validate the Infinium 8303 Potato Array, we developed linkage maps from two diploid populations (DRH and D84) and compared these maps with the assembled potato genome sequence. Both populations used the doubled monoploid reference genotype DM1-3 516 R44 as the female parent but had different heterozygous diploid male parents (RH89-039-16 and 84SD22). Over 4,400 markers were mapped (1,960 in DRH and 2,454 in D84, 787 in common) resulting in map sizes of 965 (DRH) and 792 (D84) cM, covering 87% (DRH) and 88% (D84) of genome sequence length. Of the mapped markers, 33.5% were in candidate genes selected for the array, 4.5% were markers from existing genetic maps, and 61% were selected based on distribution across the genome. Markers with distorted segregation ratios occurred in blocks in both linkage maps, accounting for 4% (DRH) and 9% (D84) of mapped markers. Markers with distorted segregation ratios were unique to each population with blocks on chromosomes 9 and 12 in DRH and 3, 4, 6 and 8 in D84. Chromosome assignment of markers based on linkage mapping differed from sequence alignment with the Potato Genome Sequencing Consortium (PGSC) pseudomolecules for 1% of the mapped markers with some disconcordant markers attributable to paralogs. In total, 126 (DRH) and 226 (D84) mapped markers were not anchored to the pseudomolecules and provide new scaffold anchoring data to improve the potato genome assembly. The high degree of concordance between the linkage maps and the pseudomolecules demonstrates both the quality of the potato genome sequence and the functionality of the Infinium 8303 Potato Array. The broad genome coverage of the Infinium 8303 Potato Array compared to other marker sets will enable numerous downstream applications.
doi:10.1371/journal.pone.0036347
PMCID: PMC3338666
PMID: 22558443
Conrad, Donald F. | Pinto, Dalila | Redon, Richard | Feuk, Lars | Gokcumen, Omer | Zhang, Yujun | Aerts, Jan | Andrews, T. Daniel | Barnes, Chris | Campbell, Peter | Fitzgerald, Tomas | Hu, Min | Ihm, Chun Hwa | Kristiansson, Kati | MacArthur, Daniel G. | MacDonald, Jeffrey R. | Onyiah, Ifejinelo | Pang, Andy Wing Chun | Robson, Sam | Stirrups, Kathy | Valsesia, Armand | Walter, Klaudia | Wei, John | Tyler-Smith, Chris | Carter, Nigel P. | Lee, Charles | Scherer, Stephen W. | Hurles, Matthew E.
Nature
2009;464(7289):704-712.
Structural variations of DNA greater than 1 kilobase in size account for most bases that vary among human genomes, but are still relatively under-ascertained. Here we use tiling oligonucleotide microarrays, comprising 42 million probes, to generate a comprehensive map of 11,700 copy number variations (CNVs) greater than 443 base pairs, of which most (8,599) have been validated independently. For 4,978 of these CNVs, we generated reference genotypes from 450 individuals of European, African or East Asian ancestry. The predominant mutational mechanisms differ among CNV size classes. Retrotransposition has duplicated and inserted some coding and non-coding DNA segments randomly around the genome. Furthermore, by correlation with known trait-associated single nucleotide polymorphisms (SNPs), we identified 30 loci with CNVs that are candidates for influencing disease susceptibility. Despite this, having assessed the completeness of our map and the patterns of linkage disequilibrium between CNVs and SNPs, we conclude that, for complex traits, the heritability void left by genome-wide association studies will not be accounted for by common CNVs.
doi:10.1038/nature08516
PMCID: PMC3330748
PMID: 19812545
Background
Elucidating the genotype-phenotype connection is one of the big challenges of modern molecular biology. To fully understand this connection, it is necessary to consider the underlying networks and the time factor. In this context of data deluge and heterogeneous information, visualization plays an essential role in interpreting complex and dynamic topologies. Thus, software that is able to bring the network, phenotypic and temporal information together is needed. Arena3D has been previously introduced as a tool that facilitates link discovery between processes. It uses a layered display to separate different levels of information while emphasizing the connections between them. We present novel developments of the tool for the visualization and analysis of dynamic genotype-phenotype landscapes.
Results
Version 2.0 introduces novel features that allow handling time course data in a phenotypic context. Gene expression levels or other measures can be loaded and visualized at different time points and phenotypic comparison is facilitated through clustering and correlation display or highlighting of impacting changes through time. Similarity scoring allows the identification of global patterns in dynamic heterogeneous data. In this paper we demonstrate the utility of the tool on two distinct biological problems of different scales. First, we analyze a medium scale dataset that looks at perturbation effects of the pluripotency regulator Nanog in murine embryonic stem cells. Dynamic cluster analysis suggests alternative indirect links between Nanog and other proteins in the core stem cell network. Moreover, recurrent correlations from the epigenetic to the translational level are identified. Second, we investigate a large scale dataset consisting of genome-wide knockdown screens for human genes essential in the mitotic process. Here, a potential new role for the gene lsm14a in cytokinesis is suggested. We also show how phenotypic patterning allows for extensive comparison and identification of high impact knockdown targets.
Conclusions
We present a new visualization approach for perturbation screens with multiple phenotypic outcomes. The novel functionality implemented in Arena3D enables effective understanding and comparison of temporal patterns within morphological layers, to help with the system-wide analysis of dynamic processes. Arena3D is available free of charge for academics as a downloadable standalone application from: http://arena3d.org/.
doi:10.1186/1471-2105-13-45
PMCID: PMC3368716
PMID: 22439608
Bonnal, Raoul J.P. | Aerts, Jan | Githinji, George | Goto, Naohisa | MacLean, Dan | Miller, Chase A. | Mishima, Hiroyuki | Pagani, Massimiliano | Ramirez-Gonzalez, Ricardo | Smant, Geert | Strozzi, Francesco | Syme, Rob | Vos, Rutger | Wennblom, Trevor J. | Woodcroft, Ben J. | Katayama, Toshiaki | Prins, Pjotr
Summary: Biogem provides a software development environment for the Ruby programming language, which encourages community-based software development for bioinformatics while lowering the barrier to entry and encouraging best practices.
Biogem, with its targeted modular and decentralized approach, software generator, tools and tight web integration, is an improved general model for scaling up collaborative open source software development in bioinformatics.
Availability: Biogem and modules are free and are OSS. Biogem runs on all systems that support recent versions of Ruby, including Linux, Mac OS X and Windows. Further information at http://www.biogems.info. A tutorial is available at http://www.biogems.info/howto.html
Contact:
bonnal@ingm.org
doi:10.1093/bioinformatics/bts080
PMCID: PMC3315718
PMID: 22332238
Background
Protein-Protein interactions (PPI) play a key role in determining the outcome of most cellular processes. The correct identification and characterization of protein interactions and the networks, which they comprise, is critical for understanding the molecular mechanisms within the cell. Large-scale techniques such as pull down assays and tandem affinity purification are used in order to detect protein interactions in an organism. Today, relatively new high-throughput methods like yeast two hybrid, mass spectrometry, microarrays, and phage display are also used to reveal protein interaction networks.
Results
In this paper we evaluated four different clustering algorithms using six different interaction datasets. We parameterized the MCL, Spectral, RNSC and Affinity Propagation algorithms and applied them to six PPI datasets produced experimentally by Yeast 2 Hybrid (Y2H) and Tandem Affinity Purification (TAP) methods. The predicted clusters, so called protein complexes, were then compared and benchmarked with already known complexes stored in published databases.
Conclusions
While results may differ upon parameterization, the MCL and RNSC algorithms seem to be more promising and more accurate at predicting PPI complexes. Moreover, they predict more complexes than other reviewed algorithms in absolute numbers. On the other hand the spectral clustering algorithm achieves the highest valid prediction rate in our experiments. However, it is nearly always outperformed by both RNSC and MCL in terms of the geometrical accuracy while it generates the fewest valid clusters than any other reviewed algorithm. This article demonstrates various metrics to evaluate the accuracy of such predictions as they are presented in the text below. Supplementary material can be found at: http://www.bioacademy.gr/bioinformatics/projects/ppireview.htm
doi:10.1186/1756-0500-4-549
PMCID: PMC3267700
PMID: 22185599
Background
Biological processes such as metabolic pathways, gene regulation or protein-protein interactions are often represented as graphs in systems biology. The understanding of such networks, their analysis, and their visualization are today important challenges in life sciences. While a great variety of visualization tools that try to address most of these challenges already exists, only few of them succeed to bridge the gap between visualization and network analysis.
Findings
Medusa is a powerful tool for visualization and clustering analysis of large-scale biological networks. It is highly interactive and it supports weighted and unweighted multi-edged directed and undirected graphs. It combines a variety of layouts and clustering methods for comprehensive views and advanced data analysis. Its main purpose is to integrate visualization and analysis of heterogeneous data from different sources into a single network.
Conclusions
Medusa provides a concise visual tool, which is helpful for network analysis and interpretation. Medusa is offered both as a standalone application and as an applet written in Java. It can be found at: https://sites.google.com/site/medusa3visualization.
doi:10.1186/1756-0500-4-384
PMCID: PMC3197509
PMID: 21978489
graph; visualization; biological networks; clustering analysis; data integration
Katayama, Toshiaki | Wilkinson, Mark D | Vos, Rutger | Kawashima, Takeshi | Kawashima, Shuichi | Nakao, Mitsuteru | Yamamoto, Yasunori | Chun, Hong-Woo | Yamaguchi, Atsuko | Kawano, Shin | Aerts, Jan | Aoki-Kinoshita, Kiyoko F | Arakawa, Kazuharu | Aranda, Bruno | Bonnal, Raoul JP | Fernández, José M | Fujisawa, Takatomo | Gordon, Paul MK | Goto, Naohisa | Haider, Syed | Harris, Todd | Hatakeyama, Takashi | Ho, Isaac | Itoh, Masumi | Kasprzyk, Arek | Kido, Nobuhiro | Kim, Young-Joo | Kinjo, Akira R | Konishi, Fumikazu | Kovarskaya, Yulia | von Kuster, Greg | Labarga, Alberto | Limviphuvadh, Vachiranee | McCarthy, Luke | Nakamura, Yasukazu | Nam, Yunsun | Nishida, Kozo | Nishimura, Kunihiro | Nishizawa, Tatsuya | Ogishima, Soichi | Oinn, Tom | Okamoto, Shinobu | Okuda, Shujiro | Ono, Keiichiro | Oshita, Kazuki | Park, Keun-Joon | Putnam, Nicholas | Senger, Martin | Severin, Jessica | Shigemoto, Yasumasa | Sugawara, Hideaki | Taylor, James | Trelles, Oswaldo | Yamasaki, Chisato | Yamashita, Riu | Satoh, Noriyuki | Takagi, Toshihisa
Background
The interaction between biological researchers and the bioinformatics tools they use is still hampered by incomplete interoperability between such tools. To ensure interoperability initiatives are effectively deployed, end-user applications need to be aware of, and support, best practices and standards. Here, we report on an initiative in which software developers and genome biologists came together to explore and raise awareness of these issues: BioHackathon 2009.
Results
Developers in attendance came from diverse backgrounds, with experts in Web services, workflow tools, text mining and visualization. Genome biologists provided expertise and exemplar data from the domains of sequence and pathway analysis and glyco-informatics. One goal of the meeting was to evaluate the ability to address real world use cases in these domains using the tools that the developers represented. This resulted in i) a workflow to annotate 100,000 sequences from an invertebrate species; ii) an integrated system for analysis of the transcription factor binding sites (TFBSs) enriched based on differential gene expression data obtained from a microarray experiment; iii) a workflow to enumerate putative physical protein interactions among enzymes in a metabolic pathway using protein structure data; iv) a workflow to analyze glyco-gene-related diseases by searching for human homologs of glyco-genes in other species, such as fruit flies, and retrieving their phenotype-annotated SNPs.
Conclusions
Beyond deriving prototype solutions for each use-case, a second major purpose of the BioHackathon was to highlight areas of insufficiency. We discuss the issues raised by our exploration of the problem/solution space, concluding that there are still problems with the way Web services are modeled and annotated, including: i) the absence of several useful data or analysis functions in the Web service "space"; ii) the lack of documentation of methods; iii) lack of compliance with the SOAP/WSDL specification among and between various programming-language libraries; and iv) incompatibility between various bioinformatics data formats. Although it was still difficult to solve real world problems posed to the developers by the biological researchers in attendance because of these problems, we note the promise of addressing these issues within a semantic framework.
doi:10.1186/2041-1480-2-4
PMCID: PMC3170566
PMID: 21806842
Understanding complex systems often requires a bottom-up analysis towards a systems biology approach. The need to investigate a system, not only as individual components but as a whole, emerges. This can be done by examining the elementary constituents individually and then how these are connected. The myriad components of a system and their interactions are best characterized as networks and they are mainly represented as graphs where thousands of nodes are connected with thousands of vertices. In this article we demonstrate approaches, models and methods from the graph theory universe and we discuss ways in which they can be used to reveal hidden properties and features of a network. This network profiling combined with knowledge extraction will help us to better understand the biological significance of the system.
doi:10.1186/1756-0381-4-10
PMCID: PMC3101653
PMID: 21527005
biological network; clustering analysis; graph theory; node ranking
Summary: The Ensembl database makes genomic features available via its Genome Browser. It is also possible to access the underlying data through a Perl API for advanced querying. We have developed a full-featured Ruby API to the Ensembl databases, providing the same functionality as the Perl interface with additional features. A single Ruby API is used to access different releases of the Ensembl databases and is also able to query multi-species databases.
Availability and Implementation: Most functionality of the API is provided using the ActiveRecord pattern. The library depends on introspection to make it release independent. The API is available through the Rubygem system and can be installed with the command gem install ruby-ensembl-api.
Contact: jan.aerts@esat.kuleuven.be
doi:10.1093/bioinformatics/btr050
PMCID: PMC3065687
PMID: 21278190
Kettner, Carsten | Field, Dawn | Sansone, Susanna-Assunta | Taylor, Chris | Aerts, Jan | Binns, Nigel | Blake, Andrew | Britten, Cedrik M. | de Marco, Ario | Fostel, Jennifer | Gaudet, Pascale | González-Beltrán, Alejandra | Hardy, Nigel | Hellemans, Jan | Hermjakob, Henning | Juty, Nick | Leebens-Mack, Jim | Maguire, Eamonn | Neumann, Steffen | Orchard, Sandra | Parkinson, Helen | Piel, William | Ranganathan, Shoba | Rocca-Serra, Philippe | Santarsiero, Annapaola | Shotton, David | Sterk, Peter | Untergasser, Andreas | Whetzel, Patricia L.
This report summarizes the proceedings of the second workshop of the ‘Minimum Information for Biological and Biomedical Investigations’ (MIBBI) consortium held on Dec 1-2, 2010 in Rüdesheim, Germany through the sponsorship of the Beilstein-Institute. MIBBI is an umbrella organization uniting communities developing Minimum Information (MI) checklists to standardize the description of data sets, the workflows by which they were generated and the scientific context for the work. This workshop brought together representatives of more than twenty communities to present the status of their MI checklists and plans for future development. Shared challenges and solutions were identified and the role of MIBBI in MI checklist development was discussed. The meeting featured some thirty presentations, wide-ranging discussions and breakout groups. The top outcomes of the two-day workshop as defined by the participants were: 1) the chance to share best practices and to identify areas of synergy; 2) defining a series of tasks for updating the MIBBI Portal; 3) reemphasizing the need to maintain independent MI checklists for various communities while leveraging common terms and workflow elements contained in multiple checklists; and 4) revision of the concept of the MIBBI Foundry to focus on the creation of a core set of MIBBI modules intended for reuse by individual MI checklist projects while maintaining the integrity of each MI project. Further information about MIBBI and its range of activities can be found at http://mibbi.org/.
doi:10.4056/sigs.147362
PMCID: PMC3035314
PMID: 21304730
Summary: The BioRuby software toolkit contains a comprehensive set of free development tools and libraries for bioinformatics and molecular biology, written in the Ruby programming language. BioRuby has components for sequence analysis, pathway analysis, protein modelling and phylogenetic analysis; it supports many widely used data formats and provides easy access to databases, external programs and public web services, including BLAST, KEGG, GenBank, MEDLINE and GO. BioRuby comes with a tutorial, documentation and an interactive environment, which can be used in the shell, and in the web browser.
Availability: BioRuby is free and open source software, made available under the Ruby license. BioRuby runs on all platforms that support Ruby, including Linux, Mac OS X and Windows. And, with JRuby, BioRuby runs on the Java Virtual Machine. The source code is available from http://www.bioruby.org/.
Contact: katayama@bioruby.org
doi:10.1093/bioinformatics/btq475
PMCID: PMC2951089
PMID: 20739307
Katayama, Toshiaki | Arakawa, Kazuharu | Nakao, Mitsuteru | Ono, Keiichiro | Aoki-Kinoshita, Kiyoko F | Yamamoto, Yasunori | Yamaguchi, Atsuko | Kawashima, Shuichi | Chun, Hong-Woo | Aerts, Jan | Aranda, Bruno | Barboza, Lord Hendrix | Bonnal, Raoul JP | Bruskiewich, Richard | Bryne, Jan C | Fernández, José M | Funahashi, Akira | Gordon, Paul MK | Goto, Naohisa | Groscurth, Andreas | Gutteridge, Alex | Holland, Richard | Kano, Yoshinobu | Kawas, Edward A | Kerhornou, Arnaud | Kibukawa, Eri | Kinjo, Akira R | Kuhn, Michael | Lapp, Hilmar | Lehvaslaiho, Heikki | Nakamura, Hiroyuki | Nakamura, Yasukazu | Nishizawa, Tatsuya | Nobata, Chikashi | Noguchi, Tamotsu | Oinn, Thomas M | Okamoto, Shinobu | Owen, Stuart | Pafilis, Evangelos | Pocock, Matthew | Prins, Pjotr | Ranzinger, René | Reisinger, Florian | Salwinski, Lukasz | Schreiber, Mark | Senger, Martin | Shigemoto, Yasumasa | Standley, Daron M | Sugawara, Hideaki | Tashiro, Toshiyuki | Trelles, Oswaldo | Vos, Rutger A | Wilkinson, Mark D | York, William | Zmasek, Christian M | Asai, Kiyoshi | Takagi, Toshihisa
Web services have become a key technology for bioinformatics, since life science databases are globally decentralized and the exponential increase in the amount of available data demands for efficient systems without the need to transfer entire databases for every step of an analysis. However, various incompatibilities among database resources and analysis services make it difficult to connect and integrate these into interoperable workflows. To resolve this situation, we invited domain specialists from web service providers, client software developers, Open Bio* projects, the BioMoby project and researchers of emerging areas where a standard exchange data format is not well established, for an intensive collaboration entitled the BioHackathon 2008. The meeting was hosted by the Database Center for Life Science (DBCLS) and Computational Biology Research Center (CBRC) and was held in Tokyo from February 11th to 15th, 2008. In this report we highlight the work accomplished and the common issues arisen from this event, including the standardization of data exchange formats and services in the emerging fields of glycoinformatics, biological interaction networks, text mining, and phyloinformatics. In addition, common shared object development based on BioSQL, as well as technical challenges in large data management, asynchronous services, and security are discussed. Consequently, we improved interoperability of web services in several fields, however, further cooperation among major database centers and continued collaborative efforts between service providers and software developers are still necessary for an effective advance in bioinformatics web service technologies.
doi:10.1186/2041-1480-1-8
PMCID: PMC2939597
PMID: 20727200
Taylor, Chris F | Field, Dawn | Sansone, Susanna-Assunta | Aerts, Jan | Apweiler, Rolf | Ashburner, Michael | Ball, Catherine A | Binz, Pierre-Alain | Bogue, Molly | Booth, Tim | Brazma, Alvis | Brinkman, Ryan R | Clark, Adam Michael | Deutsch, Eric W | Fiehn, Oliver | Fostel, Jennifer | Ghazal, Peter | Gibson, Frank | Gray, Tanya | Grimes, Graeme | Hancock, John M | Hardy, Nigel W | Hermjakob, Henning | Julian, Randall K | Kane, Matthew | Kettner, Carsten | Kinsinger, Christopher | Kolker, Eugene | Kuiper, Martin | Le Novère, Nicolas | Leebens-Mack, Jim | Lewis, Suzanna E | Lord, Phillip | Mallon, Ann-Marie | Marthandan, Nishanth | Masuya, Hiroshi | McNally, Ruth | Mehrle, Alexander | Morrison, Norman | Orchard, Sandra | Quackenbush, John | Reecy, James M | Robertson, Donald G | Rocca-Serra, Philippe | Rodriguez, Henry | Rosenfelder, Heiko | Santoyo-Lopez, Javier | Scheuermann, Richard H | Schober, Daniel | Smith, Barry | Snape, Jason | Stoeckert, Christian J | Tipton, Keith | Sterk, Peter | Untergasser, Andreas | Vandesompele, Jo | Wiemann, Stefan
The Minimum Information for Biological and Biomedical Investigations (MIBBI) project provides a resource for those exploring the range of extant minimum information checklists and fosters coordinated development of such checklists.
doi:10.1038/nbt.1411
PMCID: PMC2771753
PMID: 18688244