All cancers are caused by somatic mutations. However, understanding of the biological processes generating these mutations is limited. The catalogue of somatic mutations from a cancer genome bears the signatures of the mutational processes that have been operative. Here, we analysed 4,938,362 mutations from 7,042 cancers and extracted more than 20 distinct mutational signatures. Some are present in many cancer types, notably a signature attributed to the APOBEC family of cytidine deaminases, whereas others are confined to a single class. Certain signatures are associated with age of the patient at cancer diagnosis, known mutagenic exposures or defects in DNA maintenance, but many are of cryptic origin. In addition to these genome-wide mutational signatures, hypermutation localized to small genomic regions, kataegis, is found in many cancer types. The results reveal the diversity of mutational processes underlying the development of cancer with potential implications for understanding of cancer etiology, prevention and therapy.
The Protein Interaction Network Analysis (PINA) platform is a comprehensive web resource, which includes a database of unified protein–protein interaction data integrated from six manually curated public databases, and a set of built-in tools for network construction, filtering, analysis and visualization. The second version of PINA enhances its utility for studies of protein interactions at a network level, by including multiple collections of interaction modules identified by different clustering approaches from the whole network of protein interactions (‘interactome’) for six model organisms. All identified modules are fully annotated by enriched Gene Ontology terms, KEGG pathways, Pfam domains and the chemical and genetic perturbations collection from MSigDB. Moreover, a new tool is provided for module enrichment analysis in addition to simple query function. The interactome data are also available on the web site for further bioinformatics analysis. PINA is freely accessible at http://cbg.garvan.unsw.edu.au/pina/.
We developed a generalized framework for multiplexed resequencing of targeted regions of the human genome on the Illumina Genome Analyzer using degenerate indexed DNA sequence barcodes ligated to fragmented DNA prior to sequencing. Using this method, the DNA of multiple HapMap individuals was simultaneously sequenced at several ENCODE (ENCyclopedia of DNA Elements) regions. We then evaluated the use of Bayes factors for discovering and genotyping polymorphisms from aligned sequenced reads. If we required that predicted polymorphisms be either previously identified by dbSNP or be visually evident upon reinspection of archived ENCODE traces, we observed a false-positive rate of 11.3% using strict thresholds (Ks>1,000) for predicting variants and 69.6% for lax thresholds (Ks>10). Conversely, false-negative rates ranged from 10.8% to 90.8%, with those at stricter cut-offs occurring at lower coverage (< 10 aligned reads). These results suggest that >90% of genetic variants are discoverable using multiplexed sequencing provided sufficient coverage at the polymorphic base.
We recently reported evidence for an association between the individual variation in normal human episodic memory and a common variant of the KIBRA gene, KIBRA rs17070145 (T-allele). Since memory impairment is a cardinal clinical feature of Alzheimer’s disease (AD), we investigated the possibility of an association between the KIBRA gene and AD using data from neuronal gene expression, brain imaging studies, and genetic association tests. KIBRA was significantly over-expressed and 3 of its 4 known binding partners under-expressed in AD-affected hippocampal, posterior cingulate and temporal cortex regions (p<0.010, corrected) in a study of laser capture microdissected neurons. Using positron emission tomography in a cohort of cognitively normal, late-middle-aged persons genotyped for KIBRA rs17070145, KIBRA T non-carriers exhibited lower glucose metabolism than did carriers in posterior cingulate and precuneus brain regions (P<0.001, uncorrected). Lastly, non-carriers of the KIBRA rs17070145 T-allele had increased risk of late-onset AD in an association study of 702 neuropathologically verified expired subjects (p=0.034; OR=1.29) and in a combined analysis of 1026 additional living and expired subjects (p=0.039; OR=1.26). Our findings suggest that KIBRA is associated with both individual variation in normal episodic memory and predisposition to AD.
genetics; imaging; expression profiling; memory
Next-generation sequencing enables use of whole-genome sequence typing (WGST) as a viable and discriminatory tool for genotyping and molecular epidemiologic analysis. We used WGST to confirm the linkage of a cluster of Coccidioides immitis isolates from 3 patients who received organ transplants from a single donor who later had positive test results for coccidioidomycosis. Isolates from the 3 patients were nearly genetically identical (a total of 3 single-nucleotide polymorphisms identified among them), thereby demonstrating direct descent of the 3 isolates from an original isolate. We used WGST to demonstrate the genotypic relatedness of C. immitis isolates that were also epidemiologically linked. Thus, WGST offers unique benefits to public health for investigation of clusters considered to be linked to a single source.
Fungi; next generation sequencing; Coccidioides; genotyping; molecular epidemiology; whole genome sequence typing; research
Summary: Accurate and complete mapping of short-read sequencing to a reference genome greatly enhances the discovery of biological results and improves statistical predictions. We recently presented RNA-MATE, a pipeline for the recursive mapping of RNA-Seq datasets. With the rapid increase in genome re-sequencing projects, progression of available mapping software and the evolution of file formats, we now present X-MATE, an updated version of RNA-MATE, capable of mapping both RNA-Seq and DNA datasets and with improved performance, output file formats, configuration files, and flexibility in core mapping software.
Availability: Executables, source code, junction libraries, test data and results and the user manual are available from http://grimmond.imb.uq.edu.au/X-MATE/.
Contact: email@example.com; firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics Online.
As a first step in analyzing high-throughput data in genome-wide studies, several algorithms are available to identify and prioritize candidates lists for downstream fine-mapping. The prioritized candidates could be differentially expressed genes, aberrations in comparative genomics hybridization studies, or single nucleotide polymorphisms (SNPs) in association studies. Different analysis algorithms are subject to various experimental artifacts and analytical features that lead to different candidate lists. However, little research has been carried out to theoretically quantify the consensus between different candidate lists and to compare the study specific accuracy of the analytical methods based on a known reference candidate list. Within the context of genome-wide studies, we propose a generic mathematical framework to statistically compare ranked lists of candidates from different algorithms with each other or, if available, with a reference candidate list. To cope with the growing need for intuitive visualization of high-throughput data in genome-wide studies, we describe a complementary customizable visualization tool. As a case study, we demonstrate application of our framework to the comparison and visualization of candidate lists generated in a DNA-pooling based genome-wide association study of CEPH data in the HapMap project, where prior knowledge from individual genotyping can be used to generate a true reference candidate list. The results provide a theoretical basis to compare the accuracy of various methods and to identify redundant methods, thus providing guidance for selecting the most suitable analysis method in genome-wide studies.
genome-wide association studies; candidate lists
Summary: For many genome-wide association (GWA) studies individually genotyping one million or more SNPs provides a marginal increase in coverage at a substantial cost. Much of the information gained is redundant due to the correlation structure inherent in the human genome. Pooling-based GWA studies could benefit significantly by utilizing this redundancy to reduce noise, improve the accuracy of the observations and increase genomic coverage. We introduce a measure of correlation between individual genotyping and pooling, under the same framework that r2 provides a measure of linkage disequilibrium (LD) between pairs of SNPs. We then report a new non-haplotype multimarker multi-loci method that leverages the correlation structure between SNPs in the human genome to increase the efficacy of pooling-based GWA studies. We first give a theoretical framework and derivation of our multimarker method. Next, we evaluate simulations using this multimarker approach in comparison to single marker analysis. Finally, we experimentally evaluate our method using different pools of HapMap individuals on the Illumina 450S Duo, Illumina 550K and Affymetrix 5.0 platforms for a combined total of 1 333 631 SNPs. Our results show that use of multimarker analysis reduces noise specific to pooling-based studies, allows for efficient integration of multiple microarray platforms and provides more accurate measures of significance than single marker analysis. Additionally, this approach can be extended to allow for imputing the association significance for SNPs not directly observed using neighboring SNPs in LD. This multimarker method can now be used to cost-effectively complete pooling-based GWA studies with multiple platforms across over one million SNPs and to impute neighboring SNPs weighted for the loss of information due to pooling.
Supplementary information: Supplementary data are available at Bioinformatics online.
For late onset Alzheimer's disease (LOAD), the only confirmed, genetic association is with the apolipoprotein E (APOE) locus on chromosome 19. Meta-analysis is often employed to sort the true associations from the false positives. LOAD research has the advantage of a continuously updated meta-analysis of candidate gene association studies in the web-based AlzGene database. The top 30 AlzGene loci on May 1st, 2007 were investigated in our whole genome association data set consisting of 1411 LOAD cases and neuropathoiogicaiiy verified controls genotyped at 312,316 SNPs using the Affymetrix 500K Mapping Platform. Of the 30 “top AlzGenes", 32 SNPs in 24 genes had odds ratios (OR) whose 95% confidence intervals that did not include 1. Of these 32 SNPs, six were part of the Affymetrix 500K Mapping panel and another ten had proxies on the Affymetrix array that had >80% power to detect an association with α=0.001. Two of these 16 SNPs showed significant association with LOAD in our sample series. One was rs4420638 at the APOE locus (uncorrected p-value=4.58E-37) and the other was rs4293, located in the angiotensin converting enzyme (ACE) locus (uncorrected p-value=0.014). Since this result was nominally significant, but did not survive multiple testing correction for 16 independent tests, this association at rs4293 was verified in a geographically distinct German cohort (p-value=0.03). We present the results of our ACE replication aiongwith a discussion of the statistical limitations of multiple test corrections in whole genome studies.
Late-onset Alzheimer disease; single nucleotide polymorphism; genome-wide association study; meta-analysis; ACE
Dementia is a common disabling complication in patients with Parkinson's disease (PD). The underlying molecular causes of Parkinson's disease with dementia (PDD) are poorly understood. To identify candidate genes and molecular pathways involved in PDD, we have performed whole genome expression profiling of susceptible cortical neuronal populations. Results show significant differences in expression of 162 genes (P < 0.01) between PD patients who are cognitively normal (PD-CogNL) and controls. In contrast, there were 556 genes (P < 0.01) significantly altered in PDD compared to either healthy controls or to PD-CogNL cases. These results are consistent with increased cortical pathology in PDD relative to PD-CogNL and identify underlying molecular changes associated with the increased pathology of PDD. Lastly, we have identified expression differences in 69 genes in PD cortical neurons that occur before the onset of dementia and that are exacerbated upon the development of dementia, suggesting that they may be relevant presymptomatic contributors to the onset of dementia in PD. These results provide new insights into the cortical molecular changes associated with PDD and provide a highly useful reference database for researchers interested in PDD.
Parkinson's disease; gene expression; mRNA splicing; ; laser capture microdissection; dementia
The apolipoprotein E (APOE) ε4 allele is the best established genetic risk factor for late-onset Alzheimer’s disease (LOAD). We conducted genome-wide surveys of 502,627 single-nucleotide polymorphisms (SNPs) to characterize and confirm other LOAD susceptibility genes. In ε4 carriers from neuropathologically verified discovery, neuropathologically verified replication, and clinically characterized replication cohorts of 1411 cases and controls, LOAD was associated with six SNPs from the GRB-associated binding protein 2 (GAB2) gene and a common haplotype encompassing the entire GAB2 gene. SNP rs2373115 (p = 9 × 10−11) was associated with an odds ratio of 4.06 (confidence interval 2.81–14.69), which interacts with APOE ε4 to further modify risk. GAB2 was overexpressed in pathologically vulnerable neurons; the Gab2 protein was detected in neurons, tangle-bearing neurons, and dystrophic neuritis; and interference with GAB2 gene expression increased tau phosphorylation. Our findings suggest that GAB2 modifies LOAD risk in APOE ε4 carriers and influences Alzheimer’s neuropathology.
Somatic mutation calling from next-generation sequencing data remains a challenge due to the difficulties of distinguishing true somatic events from artifacts arising from PCR, sequencing errors or mis-mapping. Tumor cellularity or purity, sub-clonality and copy number changes also confound the identification of true somatic events against a background of germline variants. We have developed a heuristic strategy and software (http://www.qcmg.org/bioinformatics/qsnp/) for somatic mutation calling in samples with low tumor content and we show the superior sensitivity and precision of our approach using a previously sequenced cell line, a series of tumor/normal admixtures, and 3,253 putative somatic SNVs verified on an orthogonal platform.
We use high-density single nucleotide polymorphism (SNP) genotyping microarrays to demonstrate the ability to accurately and robustly determine whether individuals are in a complex genomic DNA mixture. We first develop a theoretical framework for detecting an individual's presence within a mixture, then show, through simulations, the limits associated with our method, and finally demonstrate experimentally the identification of the presence of genomic DNA of specific individuals within a series of highly complex genomic mixtures, including mixtures where an individual contributes less than 0.1% of the total genomic DNA. These findings shift the perceived utility of SNPs for identifying individual trace contributors within a forensics mixture, and suggest future research efforts into assessing the viability of previously sub-optimal DNA sources due to sample contamination. These findings also suggest that composite statistics across cohorts, such as allele frequency or genotype counts, do not mask identity within genome-wide association studies. The implications of these findings are discussed.
In this report we describe a framework for accurately and robustly resolving whether individuals are in a complex genomic DNA mixture using high-density single nucleotide polymorphism (SNP) genotyping microarrays. We develop a theoretical framework for detecting an individual's presence within a mixture, show its limits through simulation, and finally demonstrate experimentally the identification of the presence of genomic DNA of individuals within a series of highly complex genomic mixtures. Our approaches demonstrate straightforward identification of trace amounts (<1%) of DNA from an individual contributor within a complex mixture. We show how probe-intensity analysis of high-density SNP data can be used, even given the experimental noise of a microarray. We discuss the implications of these findings in two fields: forensics and genome-wide association (GWA) genetic studies. Within forensics, resolving whether an individual is contributing trace amounts of genomic DNA to a complex mixture is a tremendous challenge. Within GWA studies, there is a considerable push to make experimental data publicly available so that the data can be combined with other studies. Our findings show that such an approach does not completely conceal identity, since it is straightforward to assess the probability that a person or relative participated in a GWA study.
Francisella tularensis is the causative agent of tularemia, which is a highly lethal disease from nature and potentially from a biological weapon. This species contains four recognized subspecies including the North American endemic F. tularensis subsp. tularensis (type A), whose genetic diversity is correlated with its geographic distribution including a major population subdivision referred to as A.I and A.II. The biological significance of the A.I – A.II genetic differentiation is unknown, though there are suggestive ecological and epidemiological correlations. In order to understand the differentiation at the genomic level, we have determined the complete sequence of an A.II strain (WY96-3418) and compared it to the genome of Schu S4 from the A.I population. We find that this A.II genome is 1,898,476 bp in size with 1,820 genes, 1,303 of which code for proteins. While extensive genomic variation exists between “WY96” and Schu S4, there is only one whole gene difference. This one gene difference is a hypothetical protein of unknown function. In contrast, there are numerous SNPs (3,367), small indels (1,015), IS element differences (7) and large chromosomal rearrangements (31), including both inversions and translocations. The rearrangement borders are frequently associated with IS elements, which would facilitate intragenomic recombination events. The pathogenicity island duplicated regions (DR1 and DR2) are essentially identical in WY96 but vary relative to Schu S4 at 60 nucleotide positions. Other potential virulence-associated genes (231) varied at 559 nucleotide positions, including 357 non-synonymous changes. Molecular clock estimates for the divergence time between A.I and A.II genomes for different chromosomal regions ranged from 866 to 2131 years before present. This paper is the first complete genomic characterization of a member of the A.II clade of Francisella tularensis subsp. tularensis.
High throughput microarray-based single nucleotide polymorphism (SNP) genotyping has revolutionized the way genome-wide linkage scans and association analyses are performed. One of the key features of the array-based GeneChip® Mapping 10K Array from Affymetrix is the automated SNP calling algorithm. The Affymetrix algorithm was trained on a database of ethnically diverse DNA samples to create SNP call zones that are used as static models to make genotype calls for experimental data. We describe here the implementation of clustering algorithms on large training datasets resulting in improved SNP call rates on the 10K GeneChip.
A database of 948 individuals genotyped on the GeneChip® Mapping 10K 2.0 Array was used to identify 822 SNPs that were called consistently less than 75% of the time. These SNPs represent on average 8.25% of the total SNPs on each chromosome with chromosome 19, the most gene-rich chromosome, containing the highest proportion of poor performers (18.7%). To remedy this, we created SNiPer, a new application which uses two clustering algorithms to yield increased call rates and equivalent concordance to Affymetrix called genotypes. We include a training set for these algorithms based on individual genotypes for 705 samples. SNiPer has the capability to be retrained for lab-specific training sets. SNiPer is freely available for download at .
The correct calling of poor performing SNPs may prove to be key in future linkage studies performed on the 10K GeneChip. It would prove particularly invaluable for those diseases that map to chromosome 19, known to contain a high proportion of poorly performing SNPs. Our results illustrate that SNiPer can be used to increase call rates on the 10K GeneChip® without sacrificing accuracy, thereby increasing the amount of valid data generated.
Tumour cellularity, the relative proportion of tumour and normal cells in a sample, affects the sensitivity of mutation detection, copy number analysis, cancer gene expression and methylation profiling. Tumour cellularity is traditionally estimated by pathological review of sectioned specimens; however this method is both subjective and prone to error due to heterogeneity within lesions and cellularity differences between the sample viewed during pathological review and tissue used for research purposes. In this paper we describe a statistical model to estimate tumour cellularity from SNP array profiles of paired tumour and normal samples using shifts in SNP allele frequency at regions of loss of heterozygosity (LOH) in the tumour. We also provide qpure, a software implementation of the method. Our experiments showed that there is a medium correlation 0.42 (-value = 0.0001) between tumor cellularity estimated by qpure and pathology review. Interestingly there is a high correlation 0.87 (-value 2.2e-16) between cellularity estimates by qpure and deep Ion Torrent sequencing of known somatic KRAS mutations; and a weaker correlation 0.32 (-value = 0.004) between IonTorrent sequencing and pathology review. This suggests that qpure may be a more accurate predictor of tumour cellularity than pathology review. qpure can be downloaded from https://sourceforge.net/projects/qpure/.
Age-related hearing impairment (ARHI), or presbycusis, is the most prevalent sensory impairment in the elderly. ARHI is a complex disease caused by an interaction between environmental and genetic factors. Here we describe the results of the first whole genome association study for ARHI. The study was performed using 846 cases and 846 controls selected from 3434 individuals collected by eight centers in six European countries. DNA pools for cases and controls were allelotyped on the Affymetrix 500K GeneChip® for each center separately. The 252 top-ranked single nucleotide polymorphisms (SNPs) identified in a non-Finnish European sample group (1332 samples) and the 177 top-ranked SNPs from a Finnish sample group (360 samples) were confirmed using individual genotyping. Subsequently, the 23 most interesting SNPs were individually genotyped in an independent European replication group (138 samples). This resulted in the identification of a highly significant and replicated SNP located in GRM7, the gene encoding metabotropic glutamate receptor type 7. Also in the Finnish sample group, two GRM7 SNPs were significant, albeit in a different region of the gene. As the Finnish are genetically distinct from the rest of the European population, this may be due to allelic heterogeneity. We performed histochemical studies in human and mouse and showed that mGluR7 is expressed in hair cells and in spiral ganglion cells of the inner ear. Together these data indicate that common alleles of GRM7 contribute to an individual's risk of developing ARHI, possibly through a mechanism of altered susceptibility to glutamate excitotoxicity.