Search tips
Search criteria

Results 1-11 (11)

Clipboard (0)

Select a Filter Below

Year of Publication
1.  FIGG: Simulating populations of whole genome sequences for heterogeneous data analyses 
BMC Bioinformatics  2014;15:149.
High-throughput sequencing has become one of the primary tools for investigation of the molecular basis of disease. The increasing use of sequencing in investigations that aim to understand both individuals and populations is challenging our ability to develop analysis tools that scale with the data. This issue is of particular concern in studies that exhibit a wide degree of heterogeneity or deviation from the standard reference genome. The advent of population scale sequencing studies requires analysis tools that are developed and tested against matching quantities of heterogeneous data.
We developed a large-scale whole genome simulation tool, FIGG, which generates large numbers of whole genomes with known sequence characteristics based on direct sampling of experimentally known or theorized variations. For normal variations we used publicly available data to determine the frequency of different mutation classes across the genome. FIGG then uses this information as a background to generate new sequences from a parent sequence with matching frequencies, but different actual mutations. The background can be normal variations, known disease variations, or a theoretical frequency distribution of variations.
In order to enable the creation of large numbers of genomes, FIGG generates simulated sequences from known genomic variation and iteratively mutates each genome separately. The result is multiple whole genome sequences with unique variations that can primarily be used to provide different reference genomes, model heterogeneous populations, and can offer a standard test environment for new analysis algorithms or bioinformatics tools.
PMCID: PMC4039316  PMID: 24885193
Genome sequence; Simulation; Variation frequency; Population
2.  Interfaces to PeptideAtlas: a case study of standard data access systems 
Briefings in Bioinformatics  2011;13(5):615-626.
Access to public data sets is important to the scientific community as a resource to develop new experiments or validate new data. Projects such as the PeptideAtlas, Ensembl and The Cancer Genome Atlas (TCGA) offer both access to public data and a repository to share their own data. Access to these data sets is often provided through a web page form and a web service API. Access technologies based on web protocols (e.g. http) have been in use for over a decade and are widely adopted across the industry for a variety of functions (e.g. search, commercial transactions, and social media). Each architecture adapts these technologies to provide users with tools to access and share data. Both commonly used web service technologies (e.g. REST and SOAP), and custom-built solutions over HTTP are utilized in providing access to research data. Providing multiple access points ensures that the community can access the data in the simplest and most effective manner for their particular needs. This article examines three common access mechanisms for web accessible data: BioMart, caBIG, and Google Data Sources. These are illustrated by implementing each over the PeptideAtlas repository and reviewed for their suitability based on specific usages common to research. BioMart, Google Data Sources, and caBIG are each suitable for certain uses. The tradeoffs made in the development of the technology are dependent on the uses each was designed for (e.g. security versus speed). This means that an understanding of specific requirements and tradeoffs is necessary before selecting the access technology.
PMCID: PMC3431717  PMID: 22941959
BioMart; Google Data Sources; caBIG; data access; proteomics
3.  Integration of biological networks and gene expression data using Cytoscape 
Nature protocols  2007;2(10):2366-2382.
Cytoscape is a free software package for visualizing, modeling and analyzing molecular and genetic interaction networks. This protocol explains how to use Cytoscape to analyze the results of mRNA expression profiling, and other functional genomics and proteomics experiments, in the context of an interaction network obtained for genes of interest. Five major steps are described: (i) obtaining a gene or protein network, (ii) displaying the network using layout algorithms, (iii) integrating with gene expression and other functional attributes, (iv) identifying putative complexes and functional modules and (v) identifying enriched Gene Ontology annotations in the network. These steps provide a broad sample of the types of analyses performed by Cytoscape.
PMCID: PMC3685583  PMID: 17947979
4.  Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework 
BMC Bioinformatics  2012;13:324.
For shotgun mass spectrometry based proteomics the most computationally expensive step is in matching the spectra against an increasingly large database of sequences and their post-translational modifications with known masses. Each mass spectrometer can generate data at an astonishingly high rate, and the scope of what is searched for is continually increasing. Therefore solutions for improving our ability to perform these searches are needed.
We present a sequence database search engine that is specifically designed to run efficiently on the Hadoop MapReduce distributed computing framework. The search engine implements the K-score algorithm, generating comparable output for the same input files as the original implementation. The scalability of the system is shown, and the architecture required for the development of such distributed processing is discussed.
The software is scalable in its ability to handle a large peptide database, numerous modifications and large numbers of spectra. Performance scales with the number of processors in the cluster, allowing throughput to expand with the available resources.
PMCID: PMC3538679  PMID: 23216909
5.  Methods for visual mining of genomic and proteomic data atlases 
BMC Bioinformatics  2012;13:58.
As the volume, complexity and diversity of the information that scientists work with on a daily basis continues to rise, so too does the requirement for new analytic software. The analytic software must solve the dichotomy that exists between the need to allow for a high level of scientific reasoning, and the requirement to have an intuitive and easy to use tool which does not require specialist, and often arduous, training to use. Information visualization provides a solution to this problem, as it allows for direct manipulation and interaction with diverse and complex data. The challenge addressing bioinformatics researches is how to apply this knowledge to data sets that are continually growing in a field that is rapidly changing.
This paper discusses an approach to the development of visual mining tools capable of supporting the mining of massive data collections used in systems biology research, and also discusses lessons that have been learned providing tools for both local researchers and the wider community. Example tools were developed which are designed to enable the exploration and analyses of both proteomics and genomics based atlases. These atlases represent large repositories of raw and processed experiment data generated to support the identification of biomarkers through mass spectrometry (the PeptideAtlas) and the genomic characterization of cancer (The Cancer Genome Atlas). Specifically the tools are designed to allow for: the visual mining of thousands of mass spectrometry experiments, to assist in designing informed targeted protein assays; and the interactive analysis of hundreds of genomes, to explore the variations across different cancer genomes and cancer types.
The mining of massive repositories of biological data requires the development of new tools and techniques. Visual exploration of the large-scale atlas data sets allows researchers to mine data to find new meaning and make sense at scales from single samples to entire populations. Providing linked task specific views that allow a user to start from points of interest (from diseases to single genes) enables targeted exploration of thousands of spectra and genomes. As the composition of the atlases changes, and our understanding of the biology increase, new tasks will continually arise. It is therefore important to provide the means to make the data available in a suitable manner in as short a time as possible. We have done this through the use of common visualization workflows, into which we rapidly deploy visual tools. These visualizations follow common metaphors where possible to assist users in understanding the displayed data. Rapid development of tools and task specific views allows researchers to mine large-scale data almost as quickly as it is produced. Ultimately these visual tools enable new inferences, new analyses and further refinement of the large scale data being provided in atlases such as PeptideAtlas and The Cancer Genome Atlas.
PMCID: PMC3352268  PMID: 22524279
6.  SAMQA: error classification and validation of high-throughput sequenced read data 
BMC Genomics  2011;12:419.
The advances in high-throughput sequencing technologies and growth in data sizes has highlighted the need for scalable tools to perform quality assurance testing. These tests are necessary to ensure that data is of a minimum necessary standard for use in downstream analysis. In this paper we present the SAMQA tool to rapidly and robustly identify errors in population-scale sequence data.
SAMQA has been used on samples from three separate sets of cancer genome data from The Cancer Genome Atlas (TCGA) project. Using technical standards provided by the SAM specification and biological standards defined by researchers, we have classified errors in these sequence data sets relative to individual reads within a sample. Due to an observed linearithmic speedup through the use of a high-performance computing (HPC) framework for the majority of tasks, poor quality data was identified prior to secondary analysis in significantly less time on the HPC framework than the same data run using alternative parallelization strategies on a single server.
The SAMQA toolset validates a minimum set of data quality standards across whole-genome and exome sequences. It is tuned to run on a high-performance computational framework, enabling QA across hundreds gigabytes of samples regardless of coverage or sample type.
PMCID: PMC3170309  PMID: 21851633
8.  SEQADAPT: an adaptable system for the tracking, storage and analysis of high throughput sequencing experiments 
BMC Bioinformatics  2010;11:377.
High throughput sequencing has become an increasingly important tool for biological research. However, the existing software systems for managing and processing these data have not provided the flexible infrastructure that research requires.
Existing software solutions provide static and well-established algorithms in a restrictive package. However as high throughput sequencing is a rapidly evolving field, such static approaches lack the ability to readily adopt the latest advances and techniques which are often required by researchers. We have used a loosely coupled, service-oriented infrastructure to develop SeqAdapt. This system streamlines data management and allows for rapid integration of novel algorithms. Our approach also allows computational biologists to focus on developing and applying new methods instead of writing boilerplate infrastructure code.
The system is based around the Addama service architecture and is available at our website as a demonstration web application, an installable single download and as a collection of individual customizable services.
PMCID: PMC2916924  PMID: 20630057
9.  Adaptable data management for systems biology investigations 
BMC Bioinformatics  2009;10:79.
Within research each experiment is different, the focus changes and the data is generated from a continually evolving barrage of technologies. There is a continual introduction of new techniques whose usage ranges from in-house protocols through to high-throughput instrumentation. To support these requirements data management systems are needed that can be rapidly built and readily adapted for new usage.
The adaptable data management system discussed is designed to support the seamless mining and analysis of biological experiment data that is commonly used in systems biology (e.g. ChIP-chip, gene expression, proteomics, imaging, flow cytometry). We use different content graphs to represent different views upon the data. These views are designed for different roles: equipment specific views are used to gather instrumentation information; data processing oriented views are provided to enable the rapid development of analysis applications; and research project specific views are used to organize information for individual research experiments. This management system allows for both the rapid introduction of new types of information and the evolution of the knowledge it represents.
Data management is an important aspect of any research enterprise. It is the foundation on which most applications are built, and must be easily extended to serve new functionality for new scientific areas. We have found that adopting a three-tier architecture for data management, built around distributed standardized content repositories, allows us to rapidly develop new applications to support a diverse user community.
PMCID: PMC2670281  PMID: 19265554
10.  Systems biology driven software design for the research enterprise 
BMC Bioinformatics  2008;9:295.
In systems biology, and many other areas of research, there is a need for the interoperability of tools and data sources that were not originally designed to be integrated. Due to the interdisciplinary nature of systems biology, and its association with high throughput experimental platforms, there is an additional need to continually integrate new technologies. As scientists work in isolated groups, integration with other groups is rarely a consideration when building the required software tools.
We illustrate an approach, through the discussion of a purpose built software architecture, which allows disparate groups to reuse tools and access data sources in a common manner. The architecture allows for: the rapid development of distributed applications; interoperability, so it can be used by a wide variety of developers and computational biologists; development using standard tools, so that it is easy to maintain and does not require a large development effort; extensibility, so that new technologies and data types can be incorporated; and non intrusive development, insofar as researchers need not to adhere to a pre-existing object model.
By using a relatively simple integration strategy, based upon a common identity system and dynamically discovered interoperable services, a light-weight software architecture can become the focal point through which scientists can both get access to and analyse the plethora of experimentally derived data.
PMCID: PMC2478690  PMID: 18578887
11.  T1DBase: integration and presentation of complex data for type 1 diabetes research 
Nucleic Acids Research  2006;35(Database issue):D742-D746.
T1DBase () [Smink et al. (2005) Nucleic Acids Res., 33, D544–D549; Burren et al. (2004) Hum. Genomics, 1, 98–109] is a public website and database that supports the type 1 diabetes (T1D) research community. T1DBase provides a consolidated T1D-oriented view of the complex data world that now confronts medical researchers and enables scientists to navigate from information they know to information that is new to them. Overview pages for genes and markers summarize information for these elements. The Gene Dossier summarizes information for a list of genes. GBrowse [Stein et al. (2002) Genome Res., 10, 1599–1610] displays genes and other features in their genomic context, and Cytoscape [Shannon et al. (2003) Genome Res., 13, 2498–2504] shows genes in the context of interacting proteins and genes. The Beta Cell Gene Atlas shows gene expression in β cells, islets, and related cell types and lines, and the Tissue Expression Viewer shows expression across other tissues. The Microarray Viewer shows expression from more than 20 array experiments. The Beta Cell Gene Expression Bank contains manually curated gene and pathway annotations for genes expressed in β cells. T1DMart is a query tool for markers and genotypes. PosterPages are ‘home pages’ about specific topics or datasets. The key challenge, now and in the future, is to provide powerful informatics capabilities to T1D scientists in a form they can use to enhance their research.
PMCID: PMC1781218  PMID: 17169983

Results 1-11 (11)