PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-16 (16)
 

Clipboard (0)
None

Select a Filter Below

Journals
more »
Year of Publication
1.  Community-driven development for computational biology at Sprints, Hackathons and Codefests 
BMC Bioinformatics  2014;15(Suppl 14):S7.
Background
Computational biology comprises a wide range of technologies and approaches. Multiple technologies can be combined to create more powerful workflows if the individuals contributing the data or providing tools for its interpretation can find mutual understanding and consensus. Much conversation and joint investigation are required in order to identify and implement the best approaches.
Traditionally, scientific conferences feature talks presenting novel technologies or insights, followed up by informal discussions during coffee breaks. In multi-institution collaborations, in order to reach agreement on implementation details or to transfer deeper insights in a technology and practical skills, a representative of one group typically visits the other. However, this does not scale well when the number of technologies or research groups is large.
Conferences have responded to this issue by introducing Birds-of-a-Feather (BoF) sessions, which offer an opportunity for individuals with common interests to intensify their interaction. However, parallel BoF sessions often make it hard for participants to join multiple BoFs and find common ground between the different technologies, and BoFs are generally too short to allow time for participants to program together.
Results
This report summarises our experience with computational biology Codefests, Hackathons and Sprints, which are interactive developer meetings. They are structured to reduce the limitations of traditional scientific meetings described above by strengthening the interaction among peers and letting the participants determine the schedule and topics. These meetings are commonly run as loosely scheduled "unconferences" (self-organized identification of participants and topics for meetings) over at least two days, with early introductory talks to welcome and organize contributors, followed by intensive collaborative coding sessions. We summarise some prominent achievements of those meetings and describe differences in how these are organised, how their audience is addressed, and their outreach to their respective communities.
Conclusions
Hackathons, Codefests and Sprints share a stimulating atmosphere that encourages participants to jointly brainstorm and tackle problems of shared interest in a self-driven proactive environment, as well as providing an opportunity for new participants to get involved in collaborative projects.
doi:10.1186/1471-2105-15-S14-S7
PMCID: PMC4255748  PMID: 25472764
2.  Targeted exome sequencing of suspected mitochondrial disorders 
Neurology  2013;80(19):1762-1770.
Objective:
To evaluate the utility of targeted exome sequencing for the molecular diagnosis of mitochondrial disorders, which exhibit marked phenotypic and genetic heterogeneity.
Methods:
We considered a diverse set of 102 patients with suspected mitochondrial disorders based on clinical, biochemical, and/or molecular findings, and whose disease ranged from mild to severe, with varying age at onset. We sequenced the mitochondrial genome (mtDNA) and the exons of 1,598 nuclear-encoded genes implicated in mitochondrial biology, mitochondrial disease, or monogenic disorders with phenotypic overlap. We prioritized variants likely to underlie disease and established molecular diagnoses in accordance with current clinical genetic guidelines.
Results:
Targeted exome sequencing yielded molecular diagnoses in established disease loci in 22% of cases, including 17 of 18 (94%) with prior molecular diagnoses and 5 of 84 (6%) without. The 5 new diagnoses implicated 2 genes associated with canonical mitochondrial disorders (NDUFV1, POLG2), and 3 genes known to underlie other neurologic disorders (DPYD, KARS, WFS1), underscoring the phenotypic and biochemical overlap with other inborn errors. We prioritized variants in an additional 26 patients, including recessive, X-linked, and mtDNA variants that were enriched 2-fold over background and await further support of pathogenicity. In one case, we modeled patient mutations in yeast to provide evidence that recessive mutations in ATP5A1 can underlie combined respiratory chain deficiency.
Conclusion:
The results demonstrate that targeted exome sequencing is an effective alternative to the sequential testing of mtDNA and individual nuclear genes as part of the investigation of mitochondrial disease. Our study underscores the ongoing challenge of variant interpretation in the clinical setting.
doi:10.1212/WNL.0b013e3182918c40
PMCID: PMC3719425  PMID: 23596069
3.  Comparison of Illumina and 454 Deep Sequencing in Participants Failing Raltegravir-Based Antiretroviral Therapy 
PLoS ONE  2014;9(3):e90485.
Background
The impact of raltegravir-resistant HIV-1 minority variants (MVs) on raltegravir treatment failure is unknown. Illumina sequencing offers greater throughput than 454, but sequence analysis tools for viral sequencing are needed. We evaluated Illumina and 454 for the detection of HIV-1 raltegravir-resistant MVs.
Methods
A5262 was a single-arm study of raltegravir and darunavir/ritonavir in treatment-naïve patients. Pre-treatment plasma was obtained from 5 participants with raltegravir resistance at the time of virologic failure. A control library was created by pooling integrase clones at predefined proportions. Multiplexed sequencing was performed with Illumina and 454 platforms at comparable costs. Illumina sequence analysis was performed with the novel snp-assess tool and 454 sequencing was analyzed with V-Phaser.
Results
Illumina sequencing resulted in significantly higher sequence coverage and a 0.095% limit of detection. Illumina accurately detected all MVs in the control library at ≥0.5% and 7/10 MVs expected at 0.1%. 454 sequencing failed to detect any MVs at 0.1% with 5 false positive calls. For MVs detected in the patient samples by both 454 and Illumina, the correlation in the detected variant frequencies was high (R2 = 0.92, P<0.001). Illumina sequencing detected 2.4-fold greater nucleotide MVs and 2.9-fold greater amino acid MVs compared to 454. The only raltegravir-resistant MV detected was an E138K mutation in one participant by Illumina sequencing, but not by 454.
Conclusions
In participants of A5262 with raltegravir resistance at virologic failure, baseline raltegravir-resistant MVs were rarely detected. At comparable costs to 454 sequencing, Illumina demonstrated greater depth of coverage, increased sensitivity for detecting HIV MVs, and fewer false positive variant calls.
doi:10.1371/journal.pone.0090485
PMCID: PMC3946168  PMID: 24603872
4.  Small RNA pathway genes identified by patterns of phylogenetic conservation and divergence 
Nature  2012;493(7434):694-698.
Genetic and biochemical analyses of RNA interference (RNAi) and microRNA (miRNA) pathways have revealed proteins such as Argonaute/PIWI and Dicer that process and present small RNAs to their targets. Well validated small RNA pathway cofactors, such as the Argonaute/PIWI proteins show distinctive patterns of conservation or divergence in particular animal, plant, fungal, and protist species. We compared 86 divergent eukaryotic genome sequences to discern sets of proteins that show similar phylogenetic profiles with known small RNA cofactors. A large set of additional candidate small RNA cofactors have emerged from functional genomic screens for defects in miRNA- or siRNA-mediated repression in C. elegans and D. melanogaster1,2 and from proteomic analyses of proteins co-purifying with validated small RNA pathway proteins3,4. The phylogenetic profiles of many of these candidate small RNA pathway proteins are similar to those of known small RNA cofactor proteins. We used a Bayesian approach to integrate the phylogenetic profile analysis with predictions from diverse transcriptional coregulation and proteome interaction datasets to assign a probability for each protein for a role in a small RNA pathway. Testing high-confidence candidates from this analysis for defects in RNAi silencing, we found that about half of the predicted small RNA cofactors are required for RNAi silencing. Many of the newly identified small RNA pathway proteins are orthologues of proteins implicated in RNA splicing. In support of a deep connection between the mechanism of RNA splicing and small RNA-mediated gene silencing, the presence of the Argonaute proteins and other small RNA components in the many species analysed strongly correlates with the number of introns in that species.
doi:10.1038/nature11779
PMCID: PMC3762460  PMID: 23364702
5.  GEMINI: Integrative Exploration of Genetic Variation and Genome Annotations 
PLoS Computational Biology  2013;9(7):e1003153.
Modern DNA sequencing technologies enable geneticists to rapidly identify genetic variation among many human genomes. However, isolating the minority of variants underlying disease remains an important, yet formidable challenge for medical genetics. We have developed GEMINI (GEnome MINIng), a flexible software package for exploring all forms of human genetic variation. Unlike existing tools, GEMINI integrates genetic variation with a diverse and adaptable set of genome annotations (e.g., dbSNP, ENCODE, UCSC, ClinVar, KEGG) into a unified database to facilitate interpretation and data exploration. Whereas other methods provide an inflexible set of variant filters or prioritization methods, GEMINI allows researchers to compose complex queries based on sample genotypes, inheritance patterns, and both pre-installed and custom genome annotations. GEMINI also provides methods for ad hoc queries and data exploration, a simple programming interface for custom analyses that leverage the underlying database, and both command line and graphical tools for common analyses. We demonstrate GEMINI's utility for exploring variation in personal genomes and family based genetic studies, and illustrate its ability to scale to studies involving thousands of human samples. GEMINI is designed for reproducibility and flexibility and our goal is to provide researchers with a standard framework for medical genomics.
doi:10.1371/journal.pcbi.1003153
PMCID: PMC3715403  PMID: 23874191
6.  Using Cloud Computing infrastructure with CloudBioLinux, CloudMan and Galaxy 
Cloud computing has revolutionized availability and access to computing and storage resources; making it possible to provision a large computational infrastructure with only a few clicks in a web browser. However, those resources are typically provided in the form of low-level infrastructure components that need to be procured and configured before use. In this protocol, we demonstrate how to utilize cloud computing resources to perform open-ended bioinformatics analyses, with fully automated management of the underlying cloud infrastructure. By combining three projects, CloudBioLinux, CloudMan, and Galaxy into a cohesive unit, we have enabled researchers to gain access to more than 100 preconfigured bioinformatics tools and gigabytes of reference genomes on top of the flexible cloud computing infrastructure. The protocol demonstrates how to setup the available infrastructure and how to use the tools via a graphical desktop interface, a parallel command line interface, and the web-based Galaxy interface.
doi:10.1002/0471250953.bi1109s38
PMCID: PMC3412548  PMID: 22700313
accessible cloud computing; enabling bioinformatics analyses; turnkey computing
7.  The 3rd DBCLS BioHackathon: improving life science data integration with Semantic Web technologies 
Background
BioHackathon 2010 was the third in a series of meetings hosted by the Database Center for Life Sciences (DBCLS) in Tokyo, Japan. The overall goal of the BioHackathon series is to improve the quality and accessibility of life science research data on the Web by bringing together representatives from public databases, analytical tool providers, and cyber-infrastructure researchers to jointly tackle important challenges in the area of in silico biological research.
Results
The theme of BioHackathon 2010 was the 'Semantic Web', and all attendees gathered with the shared goal of producing Semantic Web data from their respective resources, and/or consuming or interacting those data using their tools and interfaces. We discussed on topics including guidelines for designing semantic data and interoperability of resources. We consequently developed tools and clients for analysis and visualization.
Conclusion
We provide a meeting report from BioHackathon 2010, in which we describe the discussions, decisions, and breakthroughs made as we moved towards compliance with Semantic Web technologies - from source provider, through middleware, to the end-consumer.
doi:10.1186/2041-1480-4-6
PMCID: PMC3598643  PMID: 23398680
BioHackathon; Open source; Software; Semantic Web; Databases; Data integration; Data visualization; Web services; Interfaces
8.  CloudMan as a platform for tool, data, and analysis distribution 
BMC Bioinformatics  2012;13:315.
Background
Cloud computing provides an infrastructure that facilitates large scale computational analysis in a scalable, democratized fashion, However, in this context it is difficult to ensure sharing of an analysis environment and associated data in a scalable and precisely reproducible way.
Results
CloudMan (usecloudman.org) enables individual researchers to easily deploy, customize, and share their entire cloud analysis environment, including data, tools, and configurations.
Conclusions
With the enabled customization and sharing of instances, CloudMan can be used as a platform for collaboration. The presented solution improves accessibility of cloud resources, tools, and data to the level of an individual researcher and contributes toward reproducibility and transparency of research solutions.
doi:10.1186/1471-2105-13-315
PMCID: PMC3556322  PMID: 23181507
Cloud computing; Service customization; Reproducibility; Accessibility; Galaxy
9.  Toward interoperable bioscience data 
Nature genetics  2012;44(2):121-126.
To make full use of research data, the bioscience community needs to adopt technologies and reward mechanisms that support interoperability and promote the growth of an open ‘data commoning’ culture. Here we describe the prerequisites for data commoning and present an established and growing ecosystem of solutions using the shared ‘Investigation-Study-Assay’ framework to support that vision.
doi:10.1038/ng.1054
PMCID: PMC3428019  PMID: 22281772
10.  Bio.Phylo: A unified toolkit for processing, analyzing and visualizing phylogenetic trees in Biopython 
BMC Bioinformatics  2012;13:209.
Background
Ongoing innovation in phylogenetics and evolutionary biology has been accompanied by a proliferation of software tools, data formats, analytical techniques and web servers. This brings with it the challenge of integrating phylogenetic and other related biological data found in a wide variety of formats, and underlines the need for reusable software that can read, manipulate and transform this information into the various forms required to build computational pipelines.
Results
We built a Python software library for working with phylogenetic data that is tightly integrated with Biopython, a broad-ranging toolkit for computational biology. Our library, Bio.Phylo, is highly interoperable with existing libraries, tools and standards, and is capable of parsing common file formats for phylogenetic trees, performing basic transformations and manipulations, attaching rich annotations, and visualizing trees. We unified the modules for working with the standard file formats Newick, NEXUS and phyloXML behind a consistent and simple API, providing a common set of functionality independent of the data source.
Conclusions
Bio.Phylo meets a growing need in bioinformatics for working with heterogeneous types of phylogenetic data. By supporting interoperability with multiple file formats and leveraging existing Biopython features, this library simplifies the construction of phylogenetic workflows. We also provide examples of the benefits of building a community around a shared open-source project. Bio.Phylo is included with Biopython, available through the Biopython website, http://biopython.org.
doi:10.1186/1471-2105-13-209
PMCID: PMC3468381  PMID: 22909249
11.  Histone H3R2 symmetric dimethylation and histone H3K4 trimethylation are tightly correlated in eukaryotic genomes 
Cell reports  2012;1(2):83-90.
Summary
The preferential in vitro interaction of the PHD finger of RAG2, a subunit of the V(D)J recombinase, with histone H3 tails simultaneously trimethylated at lysine 4 and symmetrically dimethylated at arginine 2 (H3R2me2sK4me3) predicted the existence of the previously unknown histone modification, H3R2me2s. Here, we report the in vivo identification of H3R2me2s . Consistent with the binding specificity of the RAG2 PHD finger, high levels of H3R2me2sK4me3 are found at antigen receptor gene segments ready for rearrangement. However, this double modification is much more general; it is conserved throughout eukaryotic evolution. In mouse, H3R2me2s is tightly correlated with H3K4me3 at active promoters throughout the genome. Mutational analysis in S. cerevisiae reveals that deposition of H3R2me2s requires the same Set1 complex that deposits H3K4me3. Our work suggests that H3R2me2sK4me3, not simply H3K4me3 alone, is the mark of active promoters, and that factors that recognize H3K4me3 will have their binding modulated by their preference for H3R2me2s.
doi:10.1016/j.celrep.2011.12.008
PMCID: PMC3377377  PMID: 22720264
12.  Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community 
BMC Bioinformatics  2012;13:42.
Background
A steep drop in the cost of next-generation sequencing during recent years has made the technology affordable to the majority of researchers, but downstream bioinformatic analysis still poses a resource bottleneck for smaller laboratories and institutes that do not have access to substantial computational resources. Sequencing instruments are typically bundled with only the minimal processing and storage capacity required for data capture during sequencing runs. Given the scale of sequence datasets, scientific value cannot be obtained from acquiring a sequencer unless it is accompanied by an equal investment in informatics infrastructure.
Results
Cloud BioLinux is a publicly accessible Virtual Machine (VM) that enables scientists to quickly provision on-demand infrastructures for high-performance bioinformatics computing using cloud platforms. Users have instant access to a range of pre-configured command line and graphical software applications, including a full-featured desktop interface, documentation and over 135 bioinformatics packages for applications including sequence alignment, clustering, assembly, display, editing, and phylogeny. Each tool's functionality is fully described in the documentation directly accessible from the graphical interface of the VM. Besides the Amazon EC2 cloud, we have started instances of Cloud BioLinux on a private Eucalyptus cloud installed at the J. Craig Venter Institute, and demonstrated access to the bioinformatic tools interface through a remote connection to EC2 instances from a local desktop computer. Documentation for using Cloud BioLinux on EC2 is available from our project website, while a Eucalyptus cloud image and VirtualBox Appliance is also publicly available for download and use by researchers with access to private clouds.
Conclusions
Cloud BioLinux provides a platform for developing bioinformatics infrastructures on the cloud. An automated and configurable process builds Virtual Machines, allowing the development of highly customized versions from a shared code base. This shared community toolkit enables application specific analysis platforms on the cloud by minimizing the effort required to prepare and maintain them.
doi:10.1186/1471-2105-13-42
PMCID: PMC3372431  PMID: 22429538
13.  The Stem Cell Discovery Engine: an integrated repository and analysis system for cancer stem cell comparisons 
Nucleic Acids Research  2011;40(Database issue):D984-D991.
Mounting evidence suggests that malignant tumors are initiated and maintained by a subpopulation of cancerous cells with biological properties similar to those of normal stem cells. However, descriptions of stem-like gene and pathway signatures in cancers are inconsistent across experimental systems. Driven by a need to improve our understanding of molecular processes that are common and unique across cancer stem cells (CSCs), we have developed the Stem Cell Discovery Engine (SCDE)—an online database of curated CSC experiments coupled to the Galaxy analytical framework. The SCDE allows users to consistently describe, share and compare CSC data at the gene and pathway level. Our initial focus has been on carefully curating tissue and cancer stem cell-related experiments from blood, intestine and brain to create a high quality resource containing 53 public studies and 1098 assays. The experimental information is captured and stored in the multi-omics Investigation/Study/Assay (ISA-Tab) format and can be queried in the data repository. A linked Galaxy framework provides a comprehensive, flexible environment populated with novel tools for gene list comparisons against molecular signatures in GeneSigDB and MSigDB, curated experiments in the SCDE and pathways in WikiPathways. The SCDE is available at http://discovery.hsci.harvard.edu.
doi:10.1093/nar/gkr1051
PMCID: PMC3245064  PMID: 22121217
14.  Galaxy CloudMan: delivering cloud compute clusters 
BMC Bioinformatics  2010;11(Suppl 12):S4.
Background
Widespread adoption of high-throughput sequencing has greatly increased the scale and sophistication of computational infrastructure needed to perform genomic research. An alternative to building and maintaining local infrastructure is “cloud computing”, which, in principle, offers on demand access to flexible computational infrastructure. However, cloud computing resources are not yet suitable for immediate “as is” use by experimental biologists.
Results
We present a cloud resource management system that makes it possible for individual researchers to compose and control an arbitrarily sized compute cluster on Amazon’s EC2 cloud infrastructure without any informatics requirements. Within this system, an entire suite of biological tools packaged by the NERC Bio-Linux team (http://nebc.nerc.ac.uk/tools/bio-linux) is available for immediate consumption. The provided solution makes it possible, using only a web browser, to create a completely configured compute cluster ready to perform analysis in less than five minutes. Moreover, we provide an automated method for building custom deployments of cloud resources. This approach promotes reproducibility of results and, if desired, allows individuals and labs to add or customize an otherwise available cloud system to better meet their needs.
Conclusions
The expected knowledge and associated effort with deploying a compute cluster in the Amazon EC2 cloud is not trivial. The solution presented in this paper eliminates these barriers, making it possible for researchers to deploy exactly the amount of computing power they need, combined with a wealth of existing analysis software, to handle the ongoing data deluge.
doi:10.1186/1471-2105-11-S12-S4
PMCID: PMC3040530  PMID: 21210983
15.  Pairwise selection assembly for sequence-independent construction of long-length DNA 
Nucleic Acids Research  2010;38(8):2594-2602.
The engineering of biological components has been facilitated by de novo synthesis of gene-length DNA. Biological engineering at the level of pathways and genomes, however, requires a scalable and cost-effective assembly of DNA molecules that are longer than ∼10 kb, and this remains a challenge. Here we present the development of pairwise selection assembly (PSA), a process that involves hierarchical construction of long-length DNA through the use of a standard set of components and operations. In PSA, activation tags at the termini of assembly sub-fragments are reused throughout the assembly process to activate vector-encoded selectable markers. Marker activation enables stringent selection for a correctly assembled product in vivo, often obviating the need for clonal isolation. Importantly, construction via PSA is sequence-independent, and does not require primary sequence modification (e.g. the addition or removal of restriction sites). The utility of PSA is demonstrated in the construction of a completely synthetic 91-kb chromosome arm from Saccharomyces cerevisiae.
doi:10.1093/nar/gkq123
PMCID: PMC2860126  PMID: 20194119
16.  Biopython: freely available Python tools for computational molecular biology and bioinformatics 
Bioinformatics  2009;25(11):1422-1423.
Summary: The Biopython project is a mature open source international collaboration of volunteer developers, providing Python libraries for a wide range of bioinformatics problems. Biopython includes modules for reading and writing different sequence file formats and multiple sequence alignments, dealing with 3D macro molecular structures, interacting with common tools such as BLAST, ClustalW and EMBOSS, accessing key online databases, as well as providing numerical methods for statistical learning.
Availability: Biopython is freely available, with documentation and source code at www.biopython.org under the Biopython license.
Contact: All queries should be directed to the Biopython mailing lists, see www.biopython.org/wiki/_Mailing_listspeter.cock@scri.ac.uk.
doi:10.1093/bioinformatics/btp163
PMCID: PMC2682512  PMID: 19304878

Results 1-16 (16)