Regulation of gene expression has been shown to involve not only the binding of transcription factor at target gene promoters but also the characterization of histone around which DNA is wrapped around. Some histone modification, for example di-methylated histone H3 at lysine 4 (H3K4me2), has been shown to bind to promoters and activate target genes. However, no clear pattern has been shown to predict human promoters. This paper proposed a novel quantitative approach to characterize patterns of promoter regions and predict novel and alternative promoters. We utilized high-throughput data generated using chromatin immunoprecipitation methods followed by massively parallel sequencing (ChIP-seq) technology on RNA Polymerase II (Pol-II) and H3K4me2. Common patterns of promoter regions are modeled using a mixture model involving double-exponential and uniform distributions. The fitted model obtained were then used to search for regions displaying similar patterns over the entire genome to find novel and alternative promoters. Regions with high correlations with the common patterns are identified as putative novel promoters. We used this proposed algorithm, RNA-seq data and several transcripts databases to find alternative promoters in MCF7 (normal breast cancer) cell line. We found 7,235 high-confidence regions that display the identified promoter patterns. Of these, 4,167 regions (58%) can be mapped to RefSeq regions. 2,444 regions are in a gene body or overlap with transcripts (non-coding RNAs, ESTs, and transcripts that are predicted by RNA-seq data). Some of these maybe potential alternative promoters. We also found 193 regions that map to enhancer regions (represented by androgen and estrogen receptor binding sites) and other regulatory regions such as CTCF (CCCTC binding factor) and CpG island. Around 5% (431 regions) of these correlated regions do not overlap with any transcripts or regulatory regions suggesting that these might be potential new promoters or markers for other annotation which are currently undiscovered.
A large number of risk alleles have been identified for multiple sclerosis (MS). However, how genetic variations may affect pathogenesis remains largely unknown for most risk alleles. Through direct sequencing of CD24 promoter region, we identified a cluster of 7 new single nucleotide polymorphisms in the CD24 promoter. A hypermorphic haplotype consisting of 3 SNPs was identified through association studies consisting of 935 control and 764 MS patients (P=0.001, odds ratio 1.3). The variant is also associated with more rapid progression of MS (P=0.016, log rank test). In cells that are heterozygous for the risk allele, chromatin immunoprecipitation revealed that risk allele specifically bind to a transcription factor SP1, which is selectively required for the hypermorphic promoter activity of the variant. In MS patients, the CD24 transcript levels associate with the SP1-binding variant in a dose-dependent manner (P=7x10-4). Our data revealed a potential role for SP1-mediated transcriptional regulation in MS pathogenesis.
Multiple sclerosis (MS); SP-binding CD24; promoter; risk alleles; single nucleotide polymorphisms (SNP)
For a diallelic genetic marker locus, tests like the parental-asymmetry test (PAT) are simple and powerful for detecting parent-of-origin effects. However, these approaches are applicable only to qualitative traits and thus are currently not suitable for quantitative traits. In this paper, the authors propose a novel class of PAT-type parent-of-origin effects tests for quantitative traits in families with both parents and an arbitrary number of children, which is denoted by Q-PAT(c) for some constant c. The authors further develop Q-1-PAT(c) for detection of parent-of-origin effects when information is available on only 1 parent in each family. The authors suggest the Q-C-PAT(c) test for combining families with data on both parental genotypes and families with data on only 1 parental genotype. Simulation studies show that the proposed tests control the empirical type I error rates well under the null hypothesis of no parent-of-origin effects. Power comparison also demonstrates that the proposed methods are more powerful than the existing likelihood ratio test. Although normality is commonly assumed in methods for studying quantitative traits, the tests proposed in this paper do not make any assumption about the distribution of the quantitative trait.
genomic imprinting; quantitative trait loci
Summary: Differential Identification using Mixtures Ensemble (DIME) is a package for identification of biologically significant differential binding sites between two conditions using ChIP-seq data. It considers a collection of finite mixture models combined with a false discovery rate (FDR) criterion to find statistically significant regions. This leads to a more reliable assessment of differential binding sites based on a statistical approach. In addition to ChIP-seq, DIME is also applicable to data from other high-throughput platforms.
Availability and implementation: DIME is implemented as an R-package, which is available at http://www.stat.osu.edu/~statgen/SOFTWARE/DIME. It may also be downloaded from http://cran.r-project.org/web/packages/DIME/.
Recently, much attention has been given to elucidate how long-range gene regulation comes into play and how histone modifications and distal transcription factor binding contribute toward this mechanism. Androgen receptor (AR), a key regulator of prostate cancer, has been shown to regulate its target genes via distal enhancers, leading to the hypothesis of global long-range gene regulation. However, despite numerous flows of newly generated data, the precise mechanism with respect to AR-mediated long-range gene regulation is still largely unknown. In this study, we carried out an integrated analysis combining several types of high-throughput data, including genome-wide distribution data of H3K4 di-methylation (H3K4me2), CCCTC binding factor (CTCF), AR and FoxA1 cistrome data as well as androgen-regulated gene expression data. We found that a subset of androgen-responsive genes was significantly enriched near AR/H3K4me2 overlapping regions and FoxA1 binding sites within the same CTCF block. Importantly, genes in this class were enriched in cancer-related pathways and were downregulated in clinical metastatic versus localized prostate cancer. Our results suggest a relatively short combinatorial long-range regulation mechanism facilitated by CTCF blocking. Under such a mechanism, H3K4me2, AR and FoxA1 within the same CTCF block combinatorially regulate a subset of distally located androgen-responsive genes involved in prostate carcinogenesis.
Genome-wide association studies are largely based on single-nucleotide polymorphisms and rest on the common disease/common variants (single-nucleotide polymorphisms) hypothesis. However, it has been argued in the last few years and is well accepted now that rare variants are valuable for studying common diseases. Although current genome-wide association studies have successfully discovered many genetic variants that are associated with common diseases, detecting associated rare variants remains a great challenge. Here, we propose two partial least-squares approaches to aggregate the signals of many single-nucleotide polymorphisms (SNPs) within a gene to reveal possible genetic effects related to rare variants. The availability of the 1000 Genomes Project offers us the opportunity to evaluate the effectiveness of these two gene-based approaches. Compared to results from a SNP-based analysis, the proposed methods were able to identify some (rare) SNPs that were missed by the SNP-based analysis.
DNA methylation plays a very important role in the silencing of tumor suppressor genes in various tumor types. In order to gain a genome-wide understanding of how changes in methylation affect tumor growth, the differential methylation hybridization (DMH) protocol has been developed and large amounts of DMH microarray data have been generated. However, it is still unclear how to preprocess this type of microarray data and how different background correction and normalization methods used for two-color gene expression arrays perform for the methylation microarray data. In this paper, we demonstrate our discovery of a set of internal control probes that have log ratios (M) theoretically equal to zero according to this DMH protocol. With the aid of this set of control probes, we propose two LOESS (or LOWESS, locally weighted scatter-plot smoothing) normalization methods that are novel and unique for DMH microarray data. Combining with other normalization methods (global LOESS and no normalization), we compare four normalization methods. In addition, we compare five different background correction methods.
We study 20 different preprocessing methods, which are the combination of five background correction methods and four normalization methods. In order to compare these 20 methods, we evaluate their performance of identifying known methylated and un-methylated housekeeping genes based on two statistics. Comparison details are illustrated using breast cancer cell line and ovarian cancer patient methylation microarray data. Our comparison results show that different background correction methods perform similarly; however, four normalization methods perform very differently. In particular, all three different LOESS normalization methods perform better than the one without any normalization.
It is necessary to do within-array normalization, and the two LOESS normalization methods based on specific DMH internal control probes produce more stable and relatively better results than the global LOESS normalization method.
DNA methylation has been shown to play an important role in the silencing of tumor suppressor genes in various tumor types. In order to have a system-wide understanding of the methylation changes that occur in tumors, we have developed a differential methylation hybridization (DMH) protocol that can simultaneously assay the methylation status of all known CpG islands (CGIs) using microarray technologies. A large percentage of signals obtained from microarrays can be attributed to various measurable and unmeasurable confounding factors unrelated to the biological question at hand. In order to correct the bias due to noise, we first implemented a quantile regression model, with a quantile level equal to 75%, to identify hypermethylated CGIs in an earlier work. As a proof of concept, we applied this model to methylation microarray data generated from breast cancer cell lines. However, we were unsure whether 75% was the best quantile level for identifying hypermethylated CGIs. In this paper, we attempt to determine which quantile level should be used to identify hypermethylated CGIs and their associated genes.
We introduce three statistical measurements to compare the performance of the proposed quantile regression model at different quantile levels (95%, 90%, 85%, 80%, 75%, 70%, 65%, 60%), using known methylated genes and unmethylated housekeeping genes reported in breast cancer cell lines and ovarian cancer patients. Our results show that the quantile levels ranging from 80% to 90% are better at identifying known methylated and unmethylated genes.
In this paper, we propose to use a quantile regression model to identify hypermethylated CGIs by incorporating probe effects to account for noise due to unmeasurable factors. Our model can efficiently identify hypermethylated CGIs in both breast and ovarian cancer data.
Genomic imprinting is an important epigenetic factor in complex traits study, which has generally been examined by testing for parent-of-origin effects of alleles. For a diallelic marker locus, the parental-asymmetry test (PAT) based on case-parents trios and its extensions to incomplete nuclear families (1-PAT and C-PAT) are simple and powerful for detecting parent-of-origin effects. However, these methods are suitable only for nuclear families and thus are not amenable to general pedigree data. Use of data from extended pedigrees, if available, may lead to more powerful methods than randomly selecting one two-generation nuclear family from each pedigree. In this study, we extend PAT to accommodate general pedigree data by proposing the pedigree PAT (PPAT) statistic, which uses all informative family trios from pedigrees. To fully utilize pedigrees with some missing genotypes, we further develop the Monte Carlo (MC) PPAT (MCPPAT) statistic based on MC sampling and estimation. Extensive simulations were carried out to evaluate the performance of the proposed methods. Under the assumption that the pedigrees and their associated affection patterns are randomly drawn from a population of pedigrees with at least one affected offspring, we demonstrated that MCPPAT is a valid test for parent-of-origin effects in the presence of association. Further, MCPPAT is much more powerful compared to PAT for trios or even PPAT for all informative family trios from the same pedigrees if there is missing data. Application of the proposed methods to a rheumatoid arthritis dataset further demonstrates the advantage of MCPPAT.
genomic imprinting; Monte Carlo sample; missing genotype; rheumatoid arthritis
Motivation: Antibody-based Chromatin Immunoprecipitation assay followed by high-throughput sequencing technology (ChIP-seq) is a relatively new method to study the binding patterns of specific protein molecules over the entire genome. ChIP-seq technology allows scientist to get more comprehensive results in shorter time. Here, we present a non-linear normalization algorithm and a mixture modeling method for comparing ChIP-seq data from multiple samples and characterizing genes based on their RNA polymerase II (Pol II) binding patterns.
Results: We apply a two-step non-linear normalization method based on locally weighted regression (LOESS) approach to compare ChIP-seq data across multiple samples and model the difference using an Exponential-NormalK mixture model. Fitted model is used to identify genes associated with differential binding sites based on local false discovery rate (fdr). These genes are then standardized and hierarchically clustered to characterize their Pol II binding patterns. As a case study, we apply the analysis procedure comparing normal breast cancer (MCF7) to tamoxifen-resistant (OHT) cell line. We find enriched regions that are associated with cancer (P < 0.0001). Our findings also imply that there may be a dysregulation of cell cycle and gene expression control pathways in the tamoxifen-resistant cells. These results show that the non-linear normalization method can be used to analyze ChIP-seq data across multiple samples.
Availability: Data are available at http://www.bmi.osu.edu/~khuang/Data/ChIP/RNAPII/
Contact: email@example.com; firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics online.
Whole genome association studies (WGAS) have surged in popularity in recent years as technological advances have made large-scale genotyping more feasible and as new exciting results offer tremendous hope and optimism. The logic of WGAS rests upon the common disease/common variant (CD/CV) hypothesis. Detection of association under the common disease/rare variant (CD/RV) scenario is much harder, and the current practices of WGAS may be under-power without large enough sample sizes. In this paper, we propose a generalized linear model with regularization (rGLM) approach for detecting disease-haplotype association using unphased single nucleotide polymorphisms data that is applicable to both CD/CV and CD/RV scenarios. We borrow a dimension-reduction method from the data mining and statistical learning literature, but use it for the purpose of weeding out haplotypes that are not associated with the disease so that the associated haplotypes, especially those that are rare, can stand out and be accounted for more precisely. By using high-dimensional data analysis techniques, which are frequently employed in microarray analyses, interacting effects among haplotypes in different blocks can be investigated without much concern about the sample size being overwhelmed by the number of haplotype combinations. Our simulation study demonstrates the gain in power for detecting associations with moderate sample sizes. For detecting association under CD/RV, regression type methods such as that implemented in hapassoc may fail to provide coefficient estimates for rare associated haplotypes, resulting in a loss of power compared to rGLM. Furthermore, our results indicate that rGLM can uncover the associated variants much more frequently than can hapassoc.
whole genome association study; interacting effects between haplotype blocks; dimension reduction; regularization/LASSO; case-control design
Differential Methylation Hybridization (DMH) is a high-throughput DNA methylation screening tool that utilizes methylation-sensitive restriction enzymes to profile methylated fragments by hybridizing them to a CpG island microarray. This array contains probes spanning all the 27,800 islands annotated in the UCSC Genome Browser. Herein we describe a DMH protocol with clearly identified quality control points. In this manner, samples that are unlikely to provide good read-outs for differential methylation profiles between the test and the control samples will be identified and repeated with appropriate modifications. The step-by-step laboratory DMH protocol is described. In addition, we provide descriptions regarding DMH data analysis, including image quantification, background correction, and statistical procedures for both exploratory analysis and more formal inferences. Issues regarding quality control are addressed as well.
DNA methylation; Differential Methylation Hybridization (DMH); CpG islands (CGI); microarray
Microarray technology has made it possible to investigate expression levels, and more recently methylation signatures, of thousands of genes simultaneously, in a biological sample. Since more and more data from different biological systems or technological platforms are being generated at an incredible rate, there is an increasing need to develop statistical methods that are applicable to multiple data types and platforms. Motivated by such a need, a flexible finite mixture model that is applicable to methylation, gene expression, and potentially data from other biological systems, is proposed. Two major thrusts of this approach are to allow for a variable number of components in the mixture to capture non-biological variation and small biases, and to use a robust procedure for parameter estimation and probe classification. The method was applied to the analysis of methylation signatures of three breast cancer cell lines. It was also tested on three sets of expression microarray data to study its power and type I error rates. Comparison with a number of existing methods in the literature yielded very encouraging results; lower type I error rates and comparable/better power were achieved based on the limited study. Furthermore, the method also leads to more biologically interpretable results for the three breast cancer cell lines.
Mixture models; Epigenetics; DNA methylation; gene expression; fdr; weight function
Several reports have been published regarding the use of cyclosporine (CSA) in the treatment of idiopathic thrombotic thrombocytopenic purpura (TTP). We hypothesized that prophylactic CSA therapy may prevent recurrences in patients with a history of multiple relapses of TTP. Nineteen patients with idiopathic TTP were enrolled on prospective studies at Ohio State University between September 2003 and May 2007. Patients achieving remission remained on CSA therapy for 6 months, allowing us to evaluate the efficacy of CSA as prophylactic therapy. CSA was administered orally at a dose of 2–3 mg/kg in a twice a day divided dose in all patients and continued for a total of 6 months. Long-term clinical follow-up with serial analysis of ADAMTS13 biomarkers during and after CSA therapy were performed to evaluate the efficacy of CSA as a prophylactic therapy. 17/19(89%) patients completed 6 months of CSA therapy in a continuous remission. Two patients relapsed during therapy with CSA and 7 patients relapsed after discontinuing CSA therapy. Ten patients have maintained a continuous remission a median of 21 months (range, 5 to 46) after discontinuing CSA. The ADAMTS13 data suggest that CSA resulted in a significant increase in the ADAMTS13 activity during therapy with CSA. 8/9(89%) relapsing patients had severely deficient ADAMTS13 activity (< 5%) suggesting this is a significant risk factor for relapse of TTP. These data support the hypothesis that prophylactic CSA improves the ADAMTS13 activity and may be effective at preventing relapses in patients at risk for recurrences of TTP.
thrombotic thrombocytopenic purpura; ADAMTS13; cyclosporine; relapse; prophylactic therapy
The interplay between histone modifications and promoter hypermethylation provides a causative explanation for epigenetic gene silencing in cancer. Less is known about the upstream initiators that direct this process. Here, we report that the Cystatin M (CST6) tumor suppressor gene is concurrently down-regulated with other loci in breast epithelial cells co-cultured with cancer-associated fibroblasts (CAFs). Promoter hypermethylation of CST6 is associated with aberrant AKT1 activation in epithelial cells, as well as the disabled INNP4B regulator resulted from the suppression by CAFs. Repressive chromatin, marked by trimethyl-H3K27 and dimethyl-H3K9, and de novo DNA methylation is established at the promoter. The findings suggest that microenvironmental stimuli are triggers in this epigenetic cascade, leading to the long-term silencing of CST6 in breast tumors. Our present findings implicate a causal mechanism defining how tumor stromal fibroblasts support neoplastic progression by manipulating the epigenome of mammary epithelial cells. The result also highlights the importance of direct cell-cell contract between epithelial cells and the surrounding fibroblasts that confer this epigenetic perturbation. Since this two-way interaction is anticipated, the described co-culture system can be used to determine the effect of epithelial factors on fibroblasts in future studies.
For a diallelic marker locus, the parental-asymmetry test (PAT) based on case-parents trios and its extensions to accommodate incomplete unclear families (1-PAT and C-PAT) are simple and powerful approaches to test for parent-of-origin effects. However, haplotype analysis is generally regarded as advantageous over single-marker analysis in genetic study of common complex diseases. This is mainly due to the fact that complex diseases are often associated with multiple markers. As such, HAP-PAT was constructed to test for parent-of-origin effects in the framework of haplotype analysis. However, its applicability is limited due to the need for complete parental information. In this paper, for nuclear families with only one parent and multiple affected children, we develop HAP-1-PAT to test for parent-of-origin effects using multiple tightly linked markers. We further propose HAP-C-PAT to combine data from families with both parents and those with only one parent. We carry out a simulation study to evaluate the validity and power of the test statistics in various settings, including incomplete family rates, marker/disease-locus linkage disequilibrium patterns, and population models. We perform analysis for all possible combinations of the markers being considered. A permutation-based Monte Carlo procedure is devised to determine the significance of the tests; the corrected global p values taking into account of multiple testing are used for inferences. The results show that HAP-1-PAT and HAP-C-PAT would work well even under the population stratification demographic model and assortative mating demographic model. Furthermore, for the disease models considered, there are significant gains in power from haplotype analysis compared to single-marker analysis, and from combined analysis using HAP-C-PAT compared to analysis using HAP-PAT for the complete family data only.
Parent-of-origin effects; Haplotype analysis; Single-marker analysis; Missing parent; Incomplete nuclear family; Complete nuclear family; Multiple testing; Population stratification demographic model; Assortative mating demographic model
The Genetic Analysis Workshop 16 rheumatoid arthritis data include a set of 868 cases and 1194 controls genotyped at 545,080 single-nucleotide polymorphisms (SNPs) from the Illumina 550 k chip. We focus on investigating chromosomes 6 and 18, which have 35,574 and 16,450 SNPs, respectively. Association studies, including single SNP and haplotype-based analyses, were applied to the data on those two chromosomes. Specifically, we conducted a generalized linear model with regularization (rGLM) approach for detecting disease-haplotype association using unphased SNP data. A total of 444 and 43 four-SNP tests were found to be significant at the Bonferroni corrected 5% significance level on chromosome 6 and 18, respectively.
Both imprinting and maternal effects could lead to parent-of-origin patterns in complex traits of human disorders. Statistical methods that differentiate these two effects and identify them simultaneously by using family-based data from retrospective studies are available. The usual data structures include case-parents triads and nuclear families with multiple affected siblings. We develop a likelihood-based method to detect imprinting and maternal effects simultaneously using data from prospective studies. The proposed method utilizes both affected and unaffected siblings in nuclear families by modeling familial genotypes and offspring's disease status jointly. Maternal effect is usually modeled as a fixed effect under the assumption that maternal variant allele(s) has (have) identical effect on any offspring. However, recent studies report that different people may carry different amounts of substances encoded by the mother's variant allele(s) (called maternal microchimerism), which could result in heterogeneity of maternal effects. The proposed method incorporates the heterogeneity of maternal effects by adding a random component to the logit of the penetrance. Our method was applied to the Framingham Heart Study data in two steps to detect single-nucleotide polymorphisms (SNPs) that may be associated with high blood pressure. In the first step, SNPs that affect susceptibility of high blood pressure through minor allele, genomic imprinting, or maternal effects were identified by using the proposed model without the random effect component. In the second step, we fitted the mixed effect model to the identified SNPs that have significant maternal effect to detect heterogeneity of the maternal effects.
DNA methylation plays an important role in the process of tumorigenesis. Identifying differentially methylated genes or CpG islands (CGIs) associated with genes between two tumor subtypes is thus an important biological question. The methylation status of all CGIs in the whole genome can be assayed with differential methylation hybridization (DMH) microarrays. However, patient samples or cell lines are heterogeneous, so their methylation pattern may be very different. In addition, neighboring probes at each CGI are correlated. How these factors affect the analysis of DMH data is unknown.
We propose a new method for identifying differentially methylated (DM) genes by identifying the associated DM CGI(s). At each CGI, we implement four different mixed effect and generalized least square models to identify DM genes between two groups. We compare four models with a simple least square regression model to study the impact of incorporating random effects and correlations.
We demonstrate that the inclusion (or exclusion) of random effects and the choice of correlation structures can significantly affect the results of the data analysis. We also assess the false discovery rate of different models using CGIs associated with housekeeping genes.
Parent-of-origin effects are important in studying genetic traits. More than 1% of all mammalian genes are believed to show parent-of-origin effects. Some statistical methods may be ineffective or fail to detect linkage or association for a gene with parent-of-origin effects. Based on case-parents trios, the parental-asymmetry test (PAT) is simple and powerful in detecting parent-of-origin effects. However, it is common in practice to collect nuclear families with both parents as well as nuclear families with only one parent. In this paper, when only one parent is available for each family with an arbitrary number of affected children, we firstly develop a new test statistic 1-PAT to test for parent-of-origin effects in the presence of association between an allele at the marker locus under study and a disease gene. Then we extend the PAT to accommodate complete nuclear families each with one or more affected children. Combining families with both parents and families with only one parent, the C-PAT is proposed to detect parent-of-origin effects. The validity of the test statistics is verified by simulation in various scenarios of parameter values. A power study shows that using the additional information from incomplete nuclear families in the analysis greatly improves the power of the tests, compared to that based on only complete nuclear families. Also, utilizing all affected children in each family, the proposed tests have a higher power than when only one affected child from each family is selected. Additional power comparison also demonstrates that the C-PAT is more powerful than a number of other tests for detecting parent-of-origin effects.
Parent-of-origin effects; Genomic imprinting; Missing parent; Incomplete nuclear family; Complete nuclear family; Genotypic relative risk; Population stratification demographic model; Assortative mating demographic model
Linkage disequilibrium (LD) plays a central role in fine mapping of disease genes and, more recently, in characterizing haplotype blocks. Classical LD measures, such as D′ and r2, are frequently used to quantify relationship between two loci. A pairwise “distance” matrix among a set of loci can be constructed using such a measure, and based upon which a number of haplotype block detection and tagging single nucleotide polymorphism (SNP) selection algorithms have been devised. Although successful in many applications, the pairwise nature of these measures does not provide a direct characterization of joint linkage disequilibrium among multiple loci. Consequently, applications based on them may lead to loss of important information. In this paper we propose a multilocus LD measure based on generalized mutual information, which is also known as relative entropy or Kullback-Leibler distance. In essence, this measure seeks to quantify the distance between the observed haplotype distribution and the expected distribution assuming linkage equilibrium. We can show that this measure is approximately equal to r2 in the special case with two loci. Based on this multilocus LD measure and an entropy measure that characterizes haplotype diversity, we propose a class of stepwise tagging SNP selection algorithms. This represents a unified approach for SNP selection in that it takes into account of both the haplotype diversity and linkage disequilibrium objectives. Applications to both simulated and real data demonstrate the utility of the proposed methods for handling a large number of SNPs. The results indicate that multilocus LD patterns can be captured well, and informative and nonredundant SNPs can be selected effectively from a large set of loci.
multilocus linkage disequilibrium; Kullbak-Leibler distance; relative information; tagging SNP; haplotype diversity; stepwise selection algorithm
Non-biological signal (or noise) has been the bane of microarray analysis. Hybridization effects related to probe-sequence composition and DNA dye-probe interactions have been observed in differential methylation hybridization (DMH) microarray experiments as well as other effects inherent to the DMH protocol.
We suggest two models to correct for non-biologically relevant probe signal with an overarching focus on probe-sequence composition. The estimated effects are evaluated and the strengths of the models are considered in the context of DMH analyses.
The majority of estimated parameters were statistically significant in all considered models. Model selection for signal correction is based on interpretation of the estimated values and their biological significance.
A new method for constructing confidence intervals for the location of putative genes regulating expression levels (quantitative traits) is proposed. This method is suitable for the "intermediate" fine-mapping step usually performed between the initial whole-genome screening and the follow-up fine mapping step as a means of reducing the size of the region where the latter is performed. Assuming the existence of a single quantitative trait locus (QTL) in the region/chromosome identified by the genome scan, the method constructs a confidence region for its true position by testing each location in the chromosome to see if it can be the trait locus. We applied our method to the gene expression data from Problem 1 of Genetic Analysis Workshop 15 (GAW15) data, focusing on 25 genes that have previously been shown to share common regulating factor(s) on chromosome 14. Our results pointed to the same region on chromosome 14 for 13 of the gene expressions studied, not only partially reproducing the results of the previous analysis, but also yielding 95% confidence regions for the regulatory quantitative trait loci. Moreover, we identified three regions, one on each of the chromosomes 3, 9, and 13, which potentially harbor additional common QTLs for several of the original gene expressions.
We proposed a confidence interval method for disease gene localization by testing every position on each chromosome of interest for its possibility of being a disease locus and including those not rejected into the interval. Three test statistics were proposed to perform the tests, including one based on LOD and two generalized likelihood ratio tests with or without model averaging (GLRT/MA and GLRT). For the statistic based on LOD, an integrated procedure was proposed with an adaptive and an importance sampling component. We also proposed asymptotic approaches based on GLRT and GLRT/MA as alternatives that are much more efficient computationally but depends on the reliability of the limiting distributions. Besides its efficiency, the asymptotic procedure based on GLRT/MA also takes model uncertainty into consideration. Applications of these methods to the Genetic Analysis Workshop 15 (GAW15) rheumatoid arthritis data from the French population gave results that successfully captured the well recognized susceptibility gene HLA*DRB1 to a less than 6 cM, 99% confidence interval with the two asymptotic approaches.
With state-of-the-art microarray technologies now available for whole genome CpG island (CGI) methylation profiling, there is a need to develop statistical models that are specifically geared toward the analysis of such data. In this article, we propose a Gamma-Normal-Gamma (GNG) mixture model for describing three groups of CGI loci: hypomethylated, undifferentiated, and hypermethylated, from a single methylation microarray. This model was applied to study the methylation signatures of three breast cancer cell lines: MCF7, T47D, and MDAMB361. Biologically interesting and interpretable results are obtained, which highlights the heterogeneity nature of the three cell lines. This underlies the premise for the need of analyzing each of the microarray slides individually as opposed to pooling them together for a single analysis. Our comparisons with the fitted densities from the Normal-Uniform (NU) mixture model in the literature proposed for gene expression analysis show an improved goodness of fit of the GNG model over the NU model. Although the GNG model was proposed in the context of single-slide methylation analysis, it can be readily adapted to analyze multi-slide methylation data as well as other types of microarray data.
CpG islands; mixture modeling; methylation/epigenetic signature; microarrays; breast cancer cell lines