Search tips
Search criteria

Results 1-7 (7)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
Document Types
1.  Shining a Light on Dark Sequencing: Characterising Errors in Ion Torrent PGM Data 
PLoS Computational Biology  2013;9(4):e1003031.
The Ion Torrent Personal Genome Machine (PGM) is a new sequencing platform that substantially differs from other sequencing technologies by measuring pH rather than light to detect polymerisation events. Using re-sequencing datasets, we comprehensively characterise the biases and errors introduced by the PGM at both the base and flow level, across a combination of factors, including chip density, sequencing kit, template species and machine. We found two distinct insertion/deletion (indel) error types that accounted for the majority of errors introduced by the PGM. The main error source was inaccurate flow-calls, which introduced indels at a raw rate of 2.84% (1.38% after quality clipping) using the OneTouch 200 bp kit. Inaccurate flow-calls typically resulted in over-called short-homopolymers and under-called long-homopolymers. Flow-call accuracy decreased with consecutive flow cycles, but we also found significant periodic fluctuations in the flow error-rate, corresponding to specific positions within the flow-cycle pattern. Another less common PGM error, high frequency indel (HFI) errors, are indels that occur at very high frequency in the reads relative to a given base position in the reference genome, but in the majority of instances were not replicated consistently across separate runs. HFI errors occur approximately once every thousand bases in the reference, and correspond to 0.06% of bases in reads. Currently, the PGM does not achieve the accuracy of competing light-based technologies. However, flow-call inaccuracy is systematic and the statistical models of flow-values developed here will enable PGM-specific bioinformatics approaches to be developed, which will account for these errors. HFI errors may prove more challenging to address, especially for polymorphism and amplicon applications, but may be overcome by sequencing the same DNA template across multiple chips.
Author Summary
DNA sequencing is used routinely within biology to reveal the genetic information of living organisms. In recent years, technological advances have led to the availability of high-throughput, low-cost DNA sequencing machines (‘sequencers’). In 2011, Life Sciences released a new sequencer, the Ion Torrent Personal Genome Machine (PGM). This is the first sequencer to measure changes in pH rather that emitted light to register sequencing reactions. Consequently, this unique technology is both cost-effective and advertised to have high accuracy, making it attractive for many laboratories. However, every sequencing technology introduces unique errors and biases into the resulting DNA sequences, and understanding PGM-specific characteristics is crucial to determining suitable applications for this new technology. We comprehensively examine the types of errors and biases in PGM-sequenced data across several experimental variables, including chip density, template kit, template DNA and across two machines. Using statistical approaches, we quantify the influence of experimental variables, as well as DNA sequence-specific effects, and find that the PGM has two types of technology-specific errors. We also find that the accuracy of the PGM is poorer than that of light-based technologies, and we make recommendations for this technology as well as provide statistical models for overcoming PGM sequencing errors.
PMCID: PMC3623719  PMID: 23592973
2.  Visualisation in imaging mass spectrometry using the minimum noise fraction transform 
BMC Research Notes  2012;5:419.
Imaging Mass Spectrometry (IMS) provides a means to measure the spatial distribution of biochemical features on the surface of a sectioned tissue sample. IMS datasets are typically huge and visualisation and subsequent analysis can be challenging. Principal component analysis (PCA) is one popular data reduction technique that has been used and we propose another; the minimum noise fraction (MNF) transform which is popular in remote sensing.
The MNF transform is able to extract spatially coherent information from IMS data. The MNF transform is implemented through an R-package which is available together with example data from∼glenn/∖#Software.
In our example, the MNF transform was able to find additional images of interest. The extracted information forms a useful basis for subsequent analyses.
PMCID: PMC3441902  PMID: 22871049
Dimension reduction; MALDI imaging mass spectrometry; Image processing
3.  Development and validation of a novel molecular biomarker diagnostic test for the early detection of sepsis 
Critical Care  2011;15(3):R149.
Sepsis is a complex immunological response to infection characterized by early hyper-inflammation followed by severe and protracted immunosuppression, suggesting that a multi-marker approach has the greatest clinical utility for early detection, within a clinical environment focused on Systemic Inflammatory Response Syndrome (SIRS) differentiation. Pre-clinical research using an equine sepsis model identified a panel of gene expression biomarkers that define the early aberrant immune activation. Thus, the primary objective was to apply these gene expression biomarkers to distinguish patients with sepsis from those who had undergone major open surgery and had clinical outcomes consistent with systemic inflammation due to physical trauma and wound healing.
This was a multi-centre, prospective clinical trial conducted across four tertiary critical care settings in Australia. Sepsis patients were recruited if they met the 1992 Consensus Statement criteria and had clinical evidence of systemic infection based on microbiology diagnoses (n = 27). Participants in the post-surgical (PS) group were recruited pre-operatively and blood samples collected within 24 hours following surgery (n = 38). Healthy controls (HC) included hospital staff with no known concurrent illnesses (n = 20). Each participant had minimally 5 ml of PAXgene blood collected for leucocyte RNA isolation and gene expression analyses. Affymetrix array and multiplex tandem (MT)-PCR studies were conducted to evaluate transcriptional profiles in circulating white blood cells applying a set of 42 molecular markers that had been identified a priori. A LogitBoost algorithm was used to create a machine learning diagnostic rule to predict sepsis outcomes.
Based on preliminary microarray analyses comparing HC and sepsis groups, a panel of 42-gene expression markers were identified that represented key innate and adaptive immune function, cell cycling, WBC differentiation, extracellular remodelling and immune modulation pathways. Comparisons against GEO data confirmed the definitive separation of the sepsis cohort. Quantitative PCR results suggest the capacity for this test to differentiate severe systemic inflammation from HC is 92%. The area under the curve (AUC) receiver operator characteristics (ROC) curve findings demonstrated sepsis prediction within a mixed inflammatory population, was between 86 and 92%.
This novel molecular biomarker test has a clinically relevant sensitivity and specificity profile, and has the capacity for early detection of sepsis via the monitoring of critical care patients.
PMCID: PMC3219023  PMID: 21682927
4.  ChIPseqR: analysis of ChIP-seq experiments 
BMC Bioinformatics  2011;12:39.
The use of high-throughput sequencing in combination with chromatin immunoprecipitation (ChIP-seq) has enabled the study of genome-wide protein binding at high resolution. While the amount of data generated from such experiments is steadily increasing, the methods available for their analysis remain limited. Although several algorithms for the analysis of ChIP-seq data have been published they focus almost exclusively on transcription factor studies and are usually not well suited for the analysis of other types of experiments.
Here we present ChIPseqR, an algorithm for the analysis of nucleosome positioning and histone modification ChIP-seq experiments. The performance of this novel method is studied on short read sequencing data of Arabidopsis thaliana mononucleosomes as well as on simulated data.
ChIPseqR is shown to improve sensitivity and spatial resolution over existing methods while maintaining high specificity. Further analysis of predicted nucleosomes reveals characteristic patterns in nucleosome sequences and placement.
PMCID: PMC3045301  PMID: 21281468
5.  k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage 
Bioinformatics  2009;25(18):2302-2308.
Motivation: The clustering of expressed sequence tags (ESTs) is a crucial step in many sequence analysis studies that require a high level of redundancy. Chimeric sequences, while uncommon, can make achieving the optimal EST clustering a challenge. Single-linkage algorithms are particularly vulnerable to the effects of chimeras. To avoid chimera-facilitated erroneous merges, researchers using single-linkage algorithms are forced to use stringent sequence–similarity thresholds. Such thresholds reduce the sensitivity of the clustering algorithm.
Results: We introduce the concept of k-link clustering for EST data. We evaluate how clustering error rates vary over a range of linkage thresholds. Using k-link, we show that Type II error decreases in response to increasing the number of shared ESTs (ie. links) required. We observe a base level of Type II error likely caused by the presence of unmasked low-complexity or repetitive sequence. We find that Type I error increases gradually with increased linkage. To minimize the Type I error introduced by increased linkage requirements, we propose an extension to k-link which modifies the required number of links with respect to the size of clusters being compared.
Availability: The implementation of k-link is available under the terms of the GPL from k-link is licensed under the GNU General Public License, and can be downloaded from k-link is written in C++.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2735666  PMID: 19570806
6.  Parameter estimation for robust HMM analysis of ChIP-chip data 
BMC Bioinformatics  2008;9:343.
Tiling arrays are an important tool for the study of transcriptional activity, protein-DNA interactions and chromatin structure on a genome-wide scale at high resolution. Although hidden Markov models have been used successfully to analyse tiling array data, parameter estimation for these models is typically ad hoc. Especially in the context of ChIP-chip experiments, no standard procedures exist to obtain parameter estimates from the data. Common methods for the calculation of maximum likelihood estimates such as the Baum-Welch algorithm or Viterbi training are rarely applied in the context of tiling array analysis.
Here we develop a hidden Markov model for the analysis of chromatin structure ChIP-chip tiling array data, using t emission distributions to increase robustness towards outliers. Maximum likelihood estimates are used for all model parameters. Two different approaches to parameter estimation are investigated and combined into an efficient procedure.
We illustrate an efficient parameter estimation procedure that can be used for HMM based methods in general and leads to a clear increase in performance when compared to the use of ad hoc estimates. The resulting hidden Markov model outperforms established methods like TileMap in the context of histone modification studies.
PMCID: PMC2536674  PMID: 18706106
7.  From transcriptome to biological function: environmental stress in an ectothermic vertebrate, the coral reef fish Pomacentrus moluccensis 
BMC Genomics  2007;8:358.
Our understanding of the importance of transcriptional regulation for biological function is continuously improving. We still know, however, comparatively little about how environmentally induced stress affects gene expression in vertebrates, and the consistency of transcriptional stress responses to different types of environmental stress. In this study, we used a multi-stressor approach to identify components of a common stress response as well as components unique to different types of environmental stress. We exposed individuals of the coral reef fish Pomacentrus moluccensis to hypoxic, hyposmotic, cold and heat shock and measured the responses of approximately 16,000 genes in liver. We also compared winter and summer responses to heat shock to examine the capacity for such responses to vary with acclimation to different ambient temperatures.
We identified a series of gene functions that were involved in all stress responses examined here, suggesting some common effects of stress on biological function. These common responses were achieved by the regulation of largely independent sets of genes; the responses of individual genes varied greatly across different stress types. In response to heat exposure over five days, a total of 324 gene loci were differentially expressed. Many heat-responsive genes had functions associated with protein turnover, metabolism, and the response to oxidative stress. We were also able to identify groups of co-regulated genes, the genes within which shared similar functions.
This is the first environmental genomic study to measure gene regulation in response to different environmental stressors in a natural population of a warm-adapted ectothermic vertebrate. We have shown that different types of environmental stress induce expression changes in genes with similar gene functions, but that the responses of individual genes vary between stress types. The functions of heat-responsive genes suggest that prolonged heat exposure leads to oxidative stress and protein damage, a challenge of the immune system, and the re-allocation of energy sources. This study hence offers insight into the effects of environmental stress on biological function and sheds light on the expected sensitivity of coral reef fishes to elevated temperatures in the future.
PMCID: PMC2222645  PMID: 17916261

Results 1-7 (7)