DNA methylation (the addition of a methyl group to a cytosine) is an important epigenetic event in mammalian cells because it plays a key role in regulating gene expression. Most previous methylation studies assume that DNA methylation occurs on both positive and negative strands. However, a few studies have reported that in some genes, methylation occurs only on one strand (ie, hemimethylation) and has clustering patterns. These studies report that hemimethylation occurs on individual genes. It is unclear whether hemimethylation occurs genome-wide and whether there are hemimethylation differences between cancerous and noncancerous cells. To address these questions, we have developed the first-ever pipeline, named hemimethylation pipeline (HMPL), to identify hemimethylation patterns. Utilizing the available software and the newly developed Perl and R scripts, HMPL can identify hemimethylation patterns for a single sample and can also compare two different samples.
hemimethylation; NGS (next-generation sequencing); HMPL
The RNA-binding protein FUS/TLS, mutation in which is causative of the fatal motor neuron disease ALS, is demonstrated to directly bind to the U1-snRNP and SMN complexes. ALS-causative mutations in FUS/TLS are shown to abnormally enhance their interaction with SMN and dysregulate its function, including loss of Gems and altered levels of small nuclear RNAs (snRNAs). The same mutants are found to have reduced association with U1-snRNP. Correspondingly, global RNA analysis reveals a mutant-dependent loss of splicing activity, with ALS-linked mutants failing to reverse changes caused by loss of wild-type FUS/TLS. Furthermore, a common FUS/TLS mutant-associated RNA splicing signature is identified in ALS patient fibroblasts. Taken together, these studies establish potentially converging disease mechanisms in ALS and spinal muscular atrophy, with ALS-causative mutants acquiring properties representing both gain (dysregulation of SMN) and loss (reduced RNA processing mediated by U1-snRNP) of function.
Common terms used in genetics with multiple meanings are explained and the terminology used in subsequent chapters is defined. Statistical Human Genetics has existed as a discipline for over a century, and during that time the meanings of many of the terms used have evolved, largely driven by molecular discoveries, to the point that molecular and statistical geneticists often have difficulty understanding each other. It is therefore imperative, now that so much of molecular genetics is becoming an in silico and statistical science, that we have well-defined, common terminology.
Gene; allele; locus; site; genotype; phenotype; dominant; recessive; codominant; additive; phenoset; diallelic; multiallelic; polyallelic; monomorphic; monoallelic; polymorphism; mutation; complex trait; multifactorial; polygenic; monogenic; mixed model; transmission probability; transition probability; epistasis; interaction; pleiotropy; quantitative trait locus; probit; logit; penetrance; transformation; scale of measurement; identity by descent; identity in state; Haplotype; phase; multilocus genotype; allelic association; linkage disequilibrium; gametic phase disequilibrium
The RNA-binding proteins Rbfox1/2/3 regulate alternative splicing in the nervous system, and disruption of Rbfox1 has been implicated in autism. However, comprehensive identification of functional Rbfox targets has been challenging. Here we performed HITS-CLIP for all three Rbfox family members to globally map, at a single-nucleotide resolution, their in vivo RNA interaction sites in the mouse brain. We found that the two guanines in the Rbfox-binding motif UGCAUG are critical for protein-RNA interactions and crosslinking. Using integrative modeling, these interaction sites combined with additional datasets defined 1,059 direct Rbfox target alternative splicing events. Over half of the quantifiable targets show dynamic changes during brain development. Of particular interest are 111 events from 48 candidate autism-susceptibility genes, including syndromic autism genes Shank3, Cacna1c, and Tsc2. Alteration of Rbfox targets in some autistic brains is correlated with down-regulation of all three Rbfox proteins, supporting the potential clinical relevance of the splicing- regulatory network.
Loss of the lariat debranching enzyme Dbr1 is found to repress TDP-43 toxicity. The accumulated intronic lariat RNAs, which are normally degraded after splicing, likely act as decoys to sequester TDP-43 away from binding to and disrupting function of other RNAs.
Many Single Nucleotide Polymorphism (SNP) calling programs have been developed to identify Single Nucleotide Variations (SNVs) in next-generation sequencing (NGS) data. However, low sequencing coverage presents challenges to accurate SNV identification, especially in single-sample data. Moreover, commonly used SNP calling programs usually include several metrics in their output files for each potential SNP. These metrics are highly correlated in complex patterns, making it extremely difficult to select SNPs for further experimental validations.
To explore solutions to the above challenges, we compare the performance of four SNP calling algorithm, SOAPsnp, Atlas-SNP2, SAMtools, and GATK, in a low-coverage single-sample sequencing dataset. Without any post-output filtering, SOAPsnp calls more SNVs than the other programs since it has fewer internal filtering criteria. Atlas-SNP2 has stringent internal filtering criteria; thus it reports the least number of SNVs. The numbers of SNVs called by GATK and SAMtools fall between SOAPsnp and Atlas-SNP2. Moreover, we explore the values of key metrics related to SNVs’ quality in each algorithm and use them as post-output filtering criteria to filter out low quality SNVs. Under different coverage cutoff values, we compare four algorithms and calculate the empirical positive calling rate and sensitivity. Our results show that: 1) the overall agreement of the four calling algorithms is low, especially in non-dbSNPs; 2) the agreement of the four algorithms is similar when using different coverage cutoffs, except that the non-dbSNPs agreement level tends to increase slightly with increasing coverage; 3) SOAPsnp, SAMtools, and GATK have a higher empirical calling rate for dbSNPs compared to non-dbSNPs; and 4) overall, GATK and Atlas-SNP2 have a relatively higher positive calling rate and sensitivity, but GATK calls more SNVs.
Our results show that the agreement between different calling algorithms is relatively low. Thus, more caution should be used in choosing algorithms, setting filtering parameters, and designing validation studies. For reliable SNV calling results, we recommend that users employ more than one algorithm and use metrics related to calling quality and coverage as filtering criteria.
Next generation sequencing; SNP calling; Low-coverage; Single-sample; SOAPsnp; Atlas-SNP2; SAMtools; GATK
DNA methylation is an epigenetic event that adds a methyl-group to the 5’ cytosine. This epigenetic modification can significantly affect gene expression in both normal and diseased cells. Hence, it is important to study methylation signals at the single cytosine site level, which is now possible utilizing bisulfite conversion technique (i.e., converting unmethylated Cs to Us and then to Ts after PCR amplification) and next generation sequencing (NGS) technologies. Despite the advances of NGS technologies, certain quality issues remain. Some of the more prevalent quality issues involve low per-base sequencing quality at the 3’ end, PCR amplification bias, and bisulfite conversion rates. Therefore, it is important to conduct quality assessment before downstream analysis. To the best of our knowledge, no existing software packages can generally assess the quality of methylation sequencing data generated based on different bisulfite-treated protocols.
To conduct the quality assessment of bisulfite methylation sequencing data, we have developed a pipeline named MethyQA. MethyQA combines currently available open-source software packages with our own custom programs written in Perl and R. The pipeline can provide quality assessment results for tens of millions of reads in under an hour. The novelty of our pipeline lies in its examination of bisulfite conversion rates and of the DNA sequence structure of regions that have different conversion rates or coverage.
MethyQA is a new software package that provides users with a unique insight into the methylation sequencing data they are researching. It allows the users to determine the quality of their data and better prepares them to address the research questions that lie ahead. Due to the speed and efficiency at which MethyQA operates, it will become an important tool for studies dealing with bisulfite methylation sequencing data.
DNA methylation; Next generation sequencing; Alignment; BRAT; Quality assessment
The advent of Next-Generation sequencing technologies, which significantly increases the throughput and reduces the cost of large scale sequencing efforts, provides an unprecedented opportunity for discovery of novel gene mutations in human cancers. However, it remains a challenge to apply Next-Generation technologies to DNA extracted from formalin fixed paraffin embedded cancer specimens. We describe here the successful development of a custom DNA capture method using Next-Generation for detection of 140 driver genes in 5 formalin fixed paraffin embedded human colon cancer samples using an improved extraction process to produce high quality DNA. Isolated DNA was enriched for targeted exons and sequenced using the Illumina Next-Generation platform. An analytical pipeline using 3 software platforms to define single nucleotide variants was used to evaluate the data output. Approximately 250x average coverage was obtained with >96% of target bases having at least 30 sequence reads. Results were then compared to previously performed high throughput Sanger sequencing. Using an algorithm of needing a positive call from all 3 callers to give a positive result, 98% of the verified Sanger sequencing somatic driver gene mutations were identified by our method with a specificity of 90%. 13 insertions and deletions identified by Next-Generation were confirmed by Sanger sequencing. We also applied this technology to two components of a biphasic colon cancer which had strikingly differing histology. Remarkably, no new driver gene mutation accumulation was identified in the more undifferentiated component. Applying this method to profiling of formalin fixed paraffin embedded colon cancer tissue samples yields equivalent sensitivity and specificity for mutation detection as Sanger sequencing of matched cell lines derived from these cancers. This method directly enables high throughput comprehensive mutational profiling of colon cancer samples, and is easily extendable to enable targeted sequencing from formalin fixed paraffin embedded material for other tumor types.
next generation sequencing; colon cancer; driver gene mutations
Motivation: Alternative splicing (AS) is a pre-mRNA maturation process leading to the expression of multiple mRNA variants from the same primary transcript. More than 90% of human genes are expressed via AS. Therefore, quantifying the inclusion level of every exon is crucial for generating accurate transcriptomic maps and studying the regulation of AS.
Results: Here we introduce SpliceTrap, a method to quantify exon inclusion levels using paired-end RNA-seq data. Unlike other tools, which focus on full-length transcript isoforms, SpliceTrap approaches the expression-level estimation of each exon as an independent Bayesian inference problem. In addition, SpliceTrap can identify major classes of alternative splicing events under a single cellular condition, without requiring a background set of reads to estimate relative splicing changes. We tested SpliceTrap both by simulation and real data analysis, and compared it to state-of-the-art tools for transcript quantification. SpliceTrap demonstrated improved accuracy, robustness and reliability in quantifying exon-inclusion ratios.
Conclusions: SpliceTrap is a useful tool to study alternative splicing regulation, especially for accurate quantification of local exon-inclusion ratios from RNA-seq data.
Availability and Implementation: SpliceTrap can be implemented online through the CSH Galaxy server http://cancan.cshl.edu/splicetrap and is also available for download and installation at http://rulai.cshl.edu/splicetrap/.
Supplementary Information: Supplementary data are available at Bioinformatics online.
Both splicing factors and microRNAs are important regulatory molecules that play key roles in post-transcriptional gene regulation. By miRNA deep sequencing, we identified 40 miRNAs that are differentially expressed upon ectopic overexpression of the splicing factor SF2/ASF. Here we show that SF2/ASF and one of its upregulated microRNAs (miR-7) can form a negative feedback loop: SF2/ASF promotes miR-7 maturation, and mature miR-7 in turn targets the 3′UTR of SF2/ASF to repress its translation. Enhanced microRNA expression is mediated by direct interaction between SF2/ASF and the primary miR-7 transcript to facilitate Drosha cleavage and is independent of SF2/ASF’s function in splicing. Other miRNAs, including miR-221 and miR-222, may also be regulated by SF2/ASF through a similar mechanism. These results underscore a function of SF2/ASF in pri-miRNA processing and highlight the potential coordination between splicing control and miRNA-mediated gene repression in gene regulatory networks.
Next-generation sequencing technologies generate a significant number of short reads that are utilized to address a variety of biological questions. However, quite often, sequencing reads tend to have low quality at the 3’ end and are generated from the repetitive regions of a genome. It is unclear how different alignment programs perform under these different cases. In order to investigate this question, we use both real data and simulated data with the above issues to evaluate the performance of four commonly used algorithms: SOAP2, Bowtie, BWA, and Novoalign.
The performance of different alignment algorithms are measured in terms of concordance between any pair of aligners (for real sequencing data without known truth) and the accuracy of simulated read alignment.
Our results show that, for sequencing data with reads that have relatively good quality or that have had low quality bases trimmed off, all four alignment programs perform similarly. We have also demonstrated that trimming off low quality ends markedly increases the number of aligned reads and improves the consistency among different aligners as well, especially for low quality data. However, Novoalign is more sensitive to the improvement of data quality. Trimming off low quality ends significantly increases the concordance between Novoalign and other aligners. As for aligning reads from repetitive regions, our simulation data show that reads from repetitive regions tend to be aligned incorrectly, and suppressing reads with multiple hits can improve alignment accuracy.
This study provides a systematic comparison of commonly used alignment algorithms in the context of sequencing data with varying qualities and from repetitive regions. Our approach can be applied to different sequencing data sets generated from different platforms. It can also be utilized to study the performance of other alignment programs.
Next generation sequencing; Alignment; Sequencing quality; SOAP2; Bowtie; BWA; Novoalign
DNA methylation plays a very important role in the silencing of tumor suppressor genes in various tumor types. In order to gain a genome-wide understanding of how changes in methylation affect tumor growth, the differential methylation hybridization (DMH) protocol has been developed and large amounts of DMH microarray data have been generated. However, it is still unclear how to preprocess this type of microarray data and how different background correction and normalization methods used for two-color gene expression arrays perform for the methylation microarray data. In this paper, we demonstrate our discovery of a set of internal control probes that have log ratios (M) theoretically equal to zero according to this DMH protocol. With the aid of this set of control probes, we propose two LOESS (or LOWESS, locally weighted scatter-plot smoothing) normalization methods that are novel and unique for DMH microarray data. Combining with other normalization methods (global LOESS and no normalization), we compare four normalization methods. In addition, we compare five different background correction methods.
We study 20 different preprocessing methods, which are the combination of five background correction methods and four normalization methods. In order to compare these 20 methods, we evaluate their performance of identifying known methylated and un-methylated housekeeping genes based on two statistics. Comparison details are illustrated using breast cancer cell line and ovarian cancer patient methylation microarray data. Our comparison results show that different background correction methods perform similarly; however, four normalization methods perform very differently. In particular, all three different LOESS normalization methods perform better than the one without any normalization.
It is necessary to do within-array normalization, and the two LOESS normalization methods based on specific DMH internal control probes produce more stable and relatively better results than the global LOESS normalization method.
DNA methylation has been shown to play an important role in the silencing of tumor suppressor genes in various tumor types. In order to have a system-wide understanding of the methylation changes that occur in tumors, we have developed a differential methylation hybridization (DMH) protocol that can simultaneously assay the methylation status of all known CpG islands (CGIs) using microarray technologies. A large percentage of signals obtained from microarrays can be attributed to various measurable and unmeasurable confounding factors unrelated to the biological question at hand. In order to correct the bias due to noise, we first implemented a quantile regression model, with a quantile level equal to 75%, to identify hypermethylated CGIs in an earlier work. As a proof of concept, we applied this model to methylation microarray data generated from breast cancer cell lines. However, we were unsure whether 75% was the best quantile level for identifying hypermethylated CGIs. In this paper, we attempt to determine which quantile level should be used to identify hypermethylated CGIs and their associated genes.
We introduce three statistical measurements to compare the performance of the proposed quantile regression model at different quantile levels (95%, 90%, 85%, 80%, 75%, 70%, 65%, 60%), using known methylated genes and unmethylated housekeeping genes reported in breast cancer cell lines and ovarian cancer patients. Our results show that the quantile levels ranging from 80% to 90% are better at identifying known methylated and unmethylated genes.
In this paper, we propose to use a quantile regression model to identify hypermethylated CGIs by incorporating probe effects to account for noise due to unmeasurable factors. Our model can efficiently identify hypermethylated CGIs in both breast and ovarian cancer data.
SF2/ASF is a prototypical SR protein, with important roles in splicing and other aspects of mRNA metabolism. SFRS1 (SF2/ASF) is a potent proto-oncogene with abnormal expression in many tumors. We found that SF2/ASF negatively autoregulates its expression to maintain homeostatic levels. We characterized six SF2/ASF alternatively spliced mRNA isoforms: the major isoform encodes full-length protein, whereas the others are either retained in the nucleus or degraded by NMD. Unproductive splicing accounts for only part of the autoregulation, which occurs primarily at the translational level. The effect is specific to SF2/ASF and requires RRM2. The ultraconserved 3′UTR is necessary and sufficient for downregulation. SF2/ASF overexpression shifts the distribution of target mRNA towards mono-ribosomes, and translational repression is partly independent of Dicer and a 5′ cap. Thus, multiple post-transcriptional and translational mechanisms are involved in fine-tuning the expression of SF2/ASF.
Several founder mutations leading to increased risk of cancer among Ashkenazi Jewish individuals have been identified, and some estimates of the age of the mutations have been published. A variety of different methods have been used previously to estimate the age of the mutations. Here three datasets containing genotype information near known founder mutations are reanalyzed in order to compare three approaches for estimating the age of a mutation. The methods are: (a) the single marker method used by Risch et al., (1995); (b) the intra-allelic coalescent model known as DMLE, and (c) the Goldgar method proposed in Neuhausen et al. (1996), and modified slightly by our group. The three mutations analyzed were MSH2*1906 G->C, APC*I1307K, and BRCA2*6174delT.
All methods depend on accurate estimates of inter-marker recombination rates. The modified Goldgar method allows for marker mutation as well as recombination, but requires prior estimates of the possible haplotypes carrying the mutation for each individual. It does not incorporate population growth rates. The DMLE method simultaneously estimates the haplotypes with the mutation age, and builds in the population growth rate. The single marker estimates, however, are more sensitive to the recombination rates and are unstable. Mutation age estimates based on DMLE are 16.8 generations for MSH2 (95% credible interval (13, 23)), 106 generations for I1037K (86-129), and 90 generations for 6174delT (71-114).
For recent founder mutations where marker mutations are unlikely to have occurred, both DMLE and the Goldgar method can give good results. Caution is necessary for older mutations, especially if the effective population size may have remained small for a long period of time.
The interplay between histone modifications and promoter hypermethylation provides a causative explanation for epigenetic gene silencing in cancer. Less is known about the upstream initiators that direct this process. Here, we report that the Cystatin M (CST6) tumor suppressor gene is concurrently down-regulated with other loci in breast epithelial cells co-cultured with cancer-associated fibroblasts (CAFs). Promoter hypermethylation of CST6 is associated with aberrant AKT1 activation in epithelial cells, as well as the disabled INNP4B regulator resulted from the suppression by CAFs. Repressive chromatin, marked by trimethyl-H3K27 and dimethyl-H3K9, and de novo DNA methylation is established at the promoter. The findings suggest that microenvironmental stimuli are triggers in this epigenetic cascade, leading to the long-term silencing of CST6 in breast tumors. Our present findings implicate a causal mechanism defining how tumor stromal fibroblasts support neoplastic progression by manipulating the epigenome of mammary epithelial cells. The result also highlights the importance of direct cell-cell contract between epithelial cells and the surrounding fibroblasts that confer this epigenetic perturbation. Since this two-way interaction is anticipated, the described co-culture system can be used to determine the effect of epithelial factors on fibroblasts in future studies.
DNA methylation plays an important role in the process of tumorigenesis. Identifying differentially methylated genes or CpG islands (CGIs) associated with genes between two tumor subtypes is thus an important biological question. The methylation status of all CGIs in the whole genome can be assayed with differential methylation hybridization (DMH) microarrays. However, patient samples or cell lines are heterogeneous, so their methylation pattern may be very different. In addition, neighboring probes at each CGI are correlated. How these factors affect the analysis of DMH data is unknown.
We propose a new method for identifying differentially methylated (DM) genes by identifying the associated DM CGI(s). At each CGI, we implement four different mixed effect and generalized least square models to identify DM genes between two groups. We compare four models with a simple least square regression model to study the impact of incorporating random effects and correlations.
We demonstrate that the inclusion (or exclusion) of random effects and the choice of correlation structures can significantly affect the results of the data analysis. We also assess the false discovery rate of different models using CGIs associated with housekeeping genes.