Search tips
Search criteria

Results 1-14 (14)

Clipboard (0)

Select a Filter Below

Year of Publication
Document Types
1.  Cpipe: a shared variant detection pipeline designed for diagnostic settings 
Genome Medicine  2015;7(1):68.
The benefits of implementing high throughput sequencing in the clinic are quickly becoming apparent. However, few freely available bioinformatics pipelines have been built from the ground up with clinical genomics in mind. Here we present Cpipe, a pipeline designed specifically for clinical genetic disease diagnostics. Cpipe was developed by the Melbourne Genomics Health Alliance, an Australian initiative to promote common approaches to genomics across healthcare institutions. As such, Cpipe has been designed to provide fast, effective and reproducible analysis, while also being highly flexible and customisable to meet the individual needs of diverse clinical settings. Cpipe is being shared with the clinical sequencing community as an open source project and is available at
Electronic supplementary material
The online version of this article (doi:10.1186/s13073-015-0191-x) contains supplementary material, which is available to authorized users.
PMCID: PMC4515933  PMID: 26217397
2.  High-resolution characterization of sequence signatures due to non-random cleavage of cell-free DNA 
BMC Medical Genomics  2015;8:29.
High-throughput sequencing of cell-free DNA fragments found in human plasma has been used to non-invasively detect fetal aneuploidy, monitor organ transplants and investigate tumor DNA. However, many biological properties of this extracellular genetic material remain unknown. Research that further characterizes circulating DNA could substantially increase its diagnostic value by allowing the application of more sophisticated bioinformatics tools that lead to an improved signal to noise ratio in the sequencing data.
In this study, we investigate various features of cell-free DNA in plasma using deep-sequencing data from two pregnant women (>70X, >50X) and compare them with matched cellular DNA. We utilize a descriptive approach to examine how the biological cleavage of cell-free DNA affects different sequence signatures such as fragment lengths, sequence motifs at fragment ends and the distribution of cleavage sites along the genome.
We show that the size distributions of these cell-free DNA molecules are dependent on their autosomal and mitochondrial origin as well as the genomic location within chromosomes. DNA mapping to particular microsatellites and alpha repeat elements display unique size signatures. We show how cell-free fragments occur in clusters along the genome, localizing to nucleosomal arrays and are preferentially cleaved at linker regions by correlating the mapping locations of these fragments with ENCODE annotation of chromatin organization. Our work further demonstrates that cell-free autosomal DNA cleavage is sequence dependent. The region spanning up to 10 positions on either side of the DNA cleavage site show a consistent pattern of preference for specific nucleotides. This sequence motif is present in cleavage sites localized to nucleosomal cores and linker regions but is absent in nucleosome-free mitochondrial DNA.
These background signals in cell-free DNA sequencing data stem from the non-random biological cleavage of these fragments. This sequence structure can be harnessed to improve bioinformatics algorithms, in particular for CNV and structural variant detection. Descriptive measures for cell-free DNA features developed here could also be used in biomarker analysis to monitor the changes that occur during different pathological conditions.
Electronic supplementary material
The online version of this article (doi:10.1186/s12920-015-0107-z) contains supplementary material, which is available to authorized users.
PMCID: PMC4469119  PMID: 26081108
Cell-free DNA; extracellular DNA; biomarkers; fragment lengths; fragmentation motifs; nucleosomes; higher-order chromatin packaging; apoptosis; necrosis
3.  Harnessing Gene Expression Networks to Prioritize Candidate Epileptic Encephalopathy Genes 
PLoS ONE  2014;9(7):e102079.
We apply a novel gene expression network analysis to a cohort of 182 recently reported candidate Epileptic Encephalopathy genes to identify those most likely to be true Epileptic Encephalopathy genes. These candidate genes were identified as having single variants of likely pathogenic significance discovered in a large-scale massively parallel sequencing study. Candidate Epileptic Encephalopathy genes were prioritized according to their co-expression with 29 known Epileptic Encephalopathy genes. We utilized developing brain and adult brain gene expression data from the Allen Human Brain Atlas (AHBA) and compared this to data from Celsius: a large, heterogeneous gene expression data warehouse. We show replicable prioritization results using these three independent gene expression resources, two of which are brain-specific, with small sample size, and the third derived from a heterogeneous collection of tissues with large sample size. Of the nineteen genes that we predicted with the highest likelihood to be true Epileptic Encephalopathy genes, two (GNAO1 and GRIN2B) have recently been independently reported and confirmed. We compare our results to those produced by an established in silico prioritization approach called Endeavour, and finally present gene expression networks for the known and candidate Epileptic Encephalopathy genes. This highlights sub-networks of gene expression, particularly in the network derived from the adult AHBA gene expression dataset. These networks give clues to the likely biological interactions between Epileptic Encephalopathy genes, potentially highlighting underlying mechanisms and avenues for therapeutic targets.
PMCID: PMC4090166  PMID: 25014031
4.  Investigating and Correcting Plasma DNA Sequencing Coverage Bias to Enhance Aneuploidy Discovery 
PLoS ONE  2014;9(1):e86993.
Pregnant women carry a mixture of cell-free DNA fragments from self and fetus (non-self) in their circulation. In recent years multiple independent studies have demonstrated the ability to detect fetal trisomies such as trisomy 21, the cause of Down syndrome, by Next-Generation Sequencing of maternal plasma. The current clinical tests based on this approach show very high sensitivity and specificity, although as yet they have not become the standard diagnostic test. Here we describe improvements to the analysis of the sequencing data by reducing GC bias and better handling of the genomic repeats. We show substantial improvements in the sensitivity of the standard trisomy 21 statistical tests, which we measure by artificially reducing read coverage. We also explore the bias stemming from the natural cleavage of plasma DNA by examining DNA motifs and position specific base distributions. We propose a model to correct this fragmentation bias and observe that incorporating this bias does not lead to any further improvements in the detection of fetal trisomy. The improved bias corrections that we demonstrate in this work can be readily adopted into existing fetal trisomy detection protocols and should also lead to improvements in sub-chromosomal copy number variation detection.
PMCID: PMC3906086  PMID: 24489824
5.  Variable hearing impairment in a DFNB2 family with a novel MYO7A missense mutation 
Clinical genetics  2010;77(6):563-571.
Myosin VIIA mutations have been associated with non-syndromic hearing loss (DFNB2; DFNA11) and Usher syndrome type 1B (USH1B). We report clinical and genetic analyzes of a consanguineous Iranian family segregating autosomal recessive non-syndromic hearing loss (ARNSHL). The hearing impairment was mapped to the DFNB2 locus using Affymetrix 50K GeneChips; direct sequencing of the MYO7A gene was completed. The Iranian family (L-1419) was shown to segregate a novel homozygous missense mutation (c.1184G>A) that results in a p.R395H amino acid substitution in the motor domain of the myosin VIIA protein. Since one affected family member had significantly less severe hearing loss we used a candidate approach to search for a genetic modifier. This novel MYO7A mutation is the first reported to cause DFNB2 in the Iranian population and this DFNB2 family is the first to be associated with a potential modifier. The absence of vestibular and retinal defects, and less severe low frequency hearing loss, is consistent with the phenotype of a recently reported Pakistani DFNB2 family. Thus, we conclude this family has non-syndromic hearing loss (DFNB2) rather than Usher syndrome type 1B (USH1B), providing further evidence that these two diseases represent discrete disorders.
PMCID: PMC2891191  PMID: 20132242
DFNB2; genetic modifier; MYO7A gene; missense mutation; motor domain; myosin VIIA protein; USH1B
6.  The cost of reducing starting RNA quantity for Illumina BeadArrays: A bead-level dilution experiment 
BMC Genomics  2010;11:540.
The demands of microarray expression technologies for quantities of RNA place a limit on the questions they can address. As a consequence, the RNA requirements have reduced over time as technologies have improved. In this paper we investigate the costs of reducing the starting quantity of RNA for the Illumina BeadArray platform. This we do via a dilution data set generated from two reference RNA sources that have become the standard for investigations into microarray and sequencing technologies.
We find that the starting quantity of RNA has an effect on observed intensities despite the fact that the quantity of cRNA being hybridized remains constant. We see a loss of sensitivity when using lower quantities of RNA, but no great rise in the false positive rate. Even with 10 ng of starting RNA, the positive results are reliable although many differentially expressed genes are missed. We see that there is some scope for combining data from samples that have contributed differing quantities of RNA, but note also that sample sizes should increase to compensate for the loss of signal-to-noise when using low quantities of starting RNA.
The BeadArray platform maintains a low false discovery rate even when small amounts of starting RNA are used. In contrast, the sensitivity of the platform drops off noticeably over the same range. Thus, those conducting experiments should not opt for low quantities of starting RNA without consideration of the costs of doing so. The implications for experimental design, and the integration of data from different starting quantities, are complex.
PMCID: PMC3091689  PMID: 20925945
7.  A Bayesian deconvolution strategy for immunoprecipitation-based DNA methylome analysis 
Nature biotechnology  2008;26(7):779-785.
DNA methylation is an indispensible epigenetic modification of mammalian genomes. Consequently there is great interest in strategies for genome-wide/whole-genome DNA methylation analysis, and immunoprecipitation-based methods have proven to be a powerful option. Such methods are rapidly shifting the bottleneck from data generation to data analysis, necessitating the development of better analytical tools. Until now, a major analytical difficulty associated with immunoprecipitation-based DNA methylation profiling has been the inability to estimate absolute methylation levels. Here we report the development of a novel cross-platform algorithm – Bayesian Tool for Methylation Analysis (Batman) – for analyzing Methylated DNA Immunoprecipitation (MeDIP) profiles generated using arrays (MeDIP-chip) or next-generation sequencing (MeDIP-seq). The latter is an approach we have developed to elucidate the first high-resolution whole-genome DNA methylation profile (DNA methylome) of any mammalian genome. MeDIP-seq/MeDIP-chip combined with Batman represent robust, quantitative, and cost-effective functional genomic strategies for elucidating the function of DNA methylation.
PMCID: PMC2644410  PMID: 18612301
8.  Tissue-specific splicing factor gene expression signatures 
Nucleic Acids Research  2008;36(15):4823-4832.
The alternative splicing code that controls and coordinates the transcriptome in complex multicellular organisms remains poorly understood. It has long been argued that regulation of alternative splicing relies on combinatorial interactions between multiple proteins, and that tissue-specific splicing decisions most likely result from differences in the concentration and/or activity of these proteins. However, large-scale data to systematically address this issue have just recently started to become available. Here we show that splicing factor gene expression signatures can be identified that reflect cell type and tissue-specific patterns of alternative splicing. We used a computational approach to analyze microarray-based gene expression profiles of splicing factors from mouse, chimpanzee and human tissues. Our results show that brain and testis, the two tissues with highest levels of alternative splicing events, have the largest number of splicing factor genes that are most highly differentially expressed. We further identified SR protein kinases and small nuclear ribonucleoprotein particle (snRNP) proteins among the splicing factor genes that are most highly differentially expressed in a particular tissue. These results indicate the power of generating signature-based predictions as an initial computational approach into a global view of tissue-specific alternative splicing regulation.
PMCID: PMC2528195  PMID: 18653532
9.  Breaking the waves: improved detection of copy number variation from microarray-based comparative genomic hybridization 
Genome Biology  2007;8(10):R228.
Datasets used for detecting copy number variation (CNV) are shown to be affected by a technical artifact. A novel CNV calling algorithm is presented which removes this artifact and identifies regions of CNV better than existing methods.
Large-scale high throughput studies using microarray technology have established that copy number variation (CNV) throughout the genome is more frequent than previously thought. Such variation is known to play an important role in the presence and development of phenotypes such as HIV-1 infection and Alzheimer's disease. However, methods for analyzing the complex data produced and identifying regions of CNV are still being refined.
We describe the presence of a genome-wide technical artifact, spatial autocorrelation or 'wave', which occurs in a large dataset used to determine the location of CNV across the genome. By removing this artifact we are able to obtain both a more biologically meaningful clustering of the data and an increase in the number of CNVs identified by current calling methods without a major increase in the number of false positives detected. Moreover, removing this artifact is critical for the development of a novel model-based CNV calling algorithm - CNVmix - that uses cross-sample information to identify regions of the genome where CNVs occur. For regions of CNV that are identified by both CNVmix and current methods, we demonstrate that CNVmix is better able to categorize samples into groups that represent copy number gains or losses.
Removing artifactual 'waves' (which appear to be a general feature of array comparative genomic hybridization (aCGH) datasets) and using cross-sample information when identifying CNVs enables more biological information to be extracted from aCGH experiments designed to investigate copy number variation in normal individuals.
PMCID: PMC2246302  PMID: 17961237
10.  Cell Cycle Genes Are the Evolutionarily Conserved Targets of the E2F4 Transcription Factor 
PLoS ONE  2007;2(10):e1061.
Maintaining quiescent cells in G0 phase is achieved in part through the multiprotein subunit complex known as DREAM, and in human cell lines the transcription factor E2F4 directs this complex to its cell cycle targets. We found that E2F4 binds a highly overlapping set of human genes among three diverse primary tissues and an asynchronous cell line, which suggests that tissue-specific binding partners and chromatin structure have minimal influence on E2F4 targeting. To investigate the conservation of these transcription factor binding events, we identified the mouse genes bound by E2f4 in seven primary mouse tissues and a cell line. E2f4 bound a set of mouse genes that was common among mouse tissues, but largely distinct from the genes bound in human. The evolutionarily conserved set of E2F4 bound genes is highly enriched for functionally relevant regulatory interactions important for maintaining cellular quiescence. In contrast, we found minimal mRNA expression perturbations in this core set of E2f4 bound genes in the liver, kidney, and testes of E2f4 null mice. Thus, the regulatory mechanisms maintaining quiescence are robust even to complete loss of conserved transcription factor binding events.
PMCID: PMC2020443  PMID: 17957245
11.  MicroRNA expression profiling of human breast cancer identifies new markers of tumor subtype 
Genome Biology  2007;8(10):R214.
Integrated analysis of miRNA expression and genomic changes in human breast tumors allows the classification of tumor subtypes.
MicroRNAs (miRNAs), a class of short non-coding RNAs found in many plants and animals, often act post-transcriptionally to inhibit gene expression.
Here we report the analysis of miRNA expression in 93 primary human breast tumors, using a bead-based flow cytometric miRNA expression profiling method. Of 309 human miRNAs assayed, we identify 133 miRNAs expressed in human breast and breast tumors. We used mRNA expression profiling to classify the breast tumors as luminal A, luminal B, basal-like, HER2+ and normal-like. A number of miRNAs are differentially expressed between these molecular tumor subtypes and individual miRNAs are associated with clinicopathological factors. Furthermore, we find that miRNAs could classify basal versus luminal tumor subtypes in an independent data set. In some cases, changes in miRNA expression correlate with genomic loss or gain; in others, changes in miRNA expression are likely due to changes in primary transcription and or miRNA biogenesis. Finally, the expression of DICER1 and AGO2 is correlated with tumor subtype and may explain some of the changes in miRNA expression observed.
This study represents the first integrated analysis of miRNA expression, mRNA expression and genomic changes in human breast cancer and may serve as a basis for functional studies of the role of miRNAs in the etiology of breast cancer. Furthermore, we demonstrate that bead-based flow cytometric miRNA expression profiling might be a suitable platform to classify breast cancer into prognostic molecular subtypes.
PMCID: PMC2246288  PMID: 17922911
12.  High-resolution aCGH and expression profiling identifies a novel genomic subtype of ER negative breast cancer 
Genome Biology  2007;8(10):R215.
High resolution array-CGH and expression profiling identifies a novel genomic subtype of ER negative breast cancer, and provides a genome-wide list of common copy number alterations associated with aberrant expression and poor prognosis.
The characterization of copy number alteration patterns in breast cancer requires high-resolution genome-wide profiling of a large panel of tumor specimens. To date, most genome-wide array comparative genomic hybridization studies have used tumor panels of relatively large tumor size and high Nottingham Prognostic Index (NPI) that are not as representative of breast cancer demographics.
We performed an oligo-array-based high-resolution analysis of copy number alterations in 171 primary breast tumors of relatively small size and low NPI, which was therefore more representative of breast cancer demographics. Hierarchical clustering over the common regions of alteration identified a novel subtype of high-grade estrogen receptor (ER)-negative breast cancer, characterized by a low genomic instability index. We were able to validate the existence of this genomic subtype in one external breast cancer cohort. Using matched array expression data we also identified the genomic regions showing the strongest coordinate expression changes ('hotspots'). We show that several of these hotspots are located in the phosphatome, kinome and chromatinome, and harbor members of the 122-breast cancer CAN-list. Furthermore, we identify frequently amplified hotspots on 8q22.3 (EDD1, WDSOF1), 8q24.11-13 (THRAP6, DCC1, SQLE, SPG8) and 11q14.1 (NDUFC2, ALG8, USP35) associated with significantly worse prognosis. Amplification of any of these regions identified 37 samples with significantly worse overall survival (hazard ratio (HR) = 2.3 (1.3-1.4) p = 0.003) and time to distant metastasis (HR = 2.6 (1.4-5.1) p = 0.004) independently of NPI.
We present strong evidence for the existence of a novel subtype of high-grade ER-negative tumors that is characterized by a low genomic instability index. We also provide a genome-wide list of common copy number alteration regions in breast cancer that show strong coordinate aberrant expression, and further identify novel frequently amplified regions that correlate with poor prognosis. Many of the genes associated with these regions represent likely novel oncogenes or tumor suppressors.
PMCID: PMC2246289  PMID: 17925008
13.  Missing channels in two-colour microarray experiments: Combining single-channel and two-channel data 
BMC Bioinformatics  2007;8:26.
There are mechanisms, notably ozone degradation, that can damage a single channel of two-channel microarray experiments. Resulting analyses therefore often choose between the unacceptable inclusion of poor quality data or the unpalatable exclusion of some (possibly a lot of) good quality data along with the bad. Two such approaches would be a single channel analysis using some of the data from all of the arrays, and an analysis of all of the data, but only from unaffected arrays. In this paper we examine a 'combined' approach to the analysis of such affected experiments that uses all of the unaffected data.
A simulation experiment shows that while a single channel analysis performs relatively well when the majority of arrays are affected, and excluding affected arrays performs relatively well when few arrays are affected (as would be expected in both cases), the combined approach out-performs both. There are benefits to actively estimating the key-parameter of the approach, but whether these compensate for the increased computational cost and complexity over just setting that parameter to take a fixed value is not clear. Inclusion of ozone-affected data results in poor performance, with a clear spatial effect in the damage being apparent.
There is no need to exclude unaffected data in order to remove those which are damaged. The combined approach discussed here is shown to out-perform more usual approaches, although it seems that if the damage is limited to very few arrays, or extends to very nearly all, then the benefits will be limited. In other circumstances though, large improvements in performance can be achieved by adopting such an approach.
PMCID: PMC1797192  PMID: 17254358
14.  MMASS: an optimized array-based method for assessing CpG island methylation 
Nucleic Acids Research  2006;34(20):e136.
We describe an optimized microarray method for identifying genome-wide CpG island methylation called microarray-based methylation assessment of single samples (MMASS) which directly compares methylated to unmethylated sequences within a single sample. To improve previous methods we used bioinformatic analysis to predict an optimized combination of methylation-sensitive enzymes that had the highest utility for CpG-island probes and different methods to produce unmethylated representations of test DNA for more sensitive detection of differential methylation by hybridization. Subtraction or methylation-dependent digestion with McrBC was used with optimized (MMASS-v2) or previously described (MMASS-v1, MMASS-sub) methylation-sensitive enzyme combinations and compared with a published McrBC method. Comparison was performed using DNA from the cell line HCT116. We show that the distribution of methylation microarray data is inherently skewed and requires exogenous spiked controls for normalization and that analysis of digestion of methylated and unmethylated control sequences together with linear fit models of replicate data showed superior statistical power for the MMASS-v2 method. Comparison with previous methylation data for HCT116 and validation of CpG islands from PXMP4, SFRP2, DCC, RARB and TSEN2 confirmed the accuracy of MMASS-v2 results. The MMASS-v2 method offers improved sensitivity and statistical power for high-throughput microarray identification of differential methylation.
PMCID: PMC1635254  PMID: 17041235

Results 1-14 (14)