Search tips
Search criteria

Results 1-25 (66)

Clipboard (0)

Select a Filter Below

Year of Publication
more »
1.  Characterizing the strand-specific distribution of non-CpG methylation in human pluripotent cells 
Nucleic Acids Research  2013;42(5):3009-3016.
DNA methylation is an important defense and regulatory mechanism. In mammals, most DNA methylation occurs at CpG sites, and asymmetric non-CpG methylation has only been detected at appreciable levels in a few cell types. We are the first to systematically study the strand-specific distribution of non-CpG methylation. With the divide-and-compare strategy, we show that CHG and CHH methylation are not intrinsically different in human embryonic stem cells (ESCs) and induced pluripotent stem cells (iPSCs). We also find that non-CpG methylation is skewed between the two strands in introns, especially at intron boundaries and in highly expressed genes. Controlling for the proximal sequences of non-CpG sites, we show that the skew of non-CpG methylation in introns is mainly guided by sequence skew. By studying subgroups of transposable elements, we also found that non-CpG methylation is distributed in a strand-specific manner in both short interspersed nuclear elements (SINE) and long interspersed nuclear elements (LINE), but not in long terminal repeats (LTR). Finally, we show that on the antisense strand of Alus, a non-CpG site just downstream of the A-box is highly methylated. Together, the divide-and-compare strategy leads us to identify regions with strand-specific distributions of non-CpG methylation in humans.
PMCID: PMC3950701  PMID: 24343027
2.  BS-Seeker2: a versatile aligning pipeline for bisulfite sequencing data 
BMC Genomics  2013;14:774.
DNA methylation is an important epigenetic modification involved in many biological processes. Bisulfite treatment coupled with high-throughput sequencing provides an effective approach for studying genome-wide DNA methylation at base resolution. Libraries such as whole genome bisulfite sequencing (WGBS) and reduced represented bisulfite sequencing (RRBS) are widely used for generating DNA methylomes, demanding efficient and versatile tools for aligning bisulfite sequencing data.
We have developed BS-Seeker2, an updated version of BS Seeker, as a full pipeline for mapping bisulfite sequencing data and generating DNA methylomes. BS-Seeker2 improves mappability over existing aligners by using local alignment. It can also map reads from RRBS library by building special indexes with improved efficiency and accuracy. Moreover, BS-Seeker2 provides additional function for filtering out reads with incomplete bisulfite conversion, which is useful in minimizing the overestimation of DNA methylation levels. We also defined CGmap and ATCGmap file formats for full representations of DNA methylomes, as part of the outputs of BS-Seeker2 pipeline together with BAM and WIG files.
Our evaluations on the performance show that BS-Seeker2 works efficiently and accurately for both WGBS data and RRBS data. BS-Seeker2 is freely available at and the Galaxy server.
PMCID: PMC3840619  PMID: 24206606
DNA methylation; Bisulfite sequencing aligner; WGBS; RRBS; BS Seeker; Bisulfite conversion failure; Galaxy toolshed
3.  Neural Potential of a Stem Cell Population in the Hair Follicle 
Cell cycle (Georgetown, Tex.)  2007;6(17):2161-2170.
The bulge region of the hair follicle serves as a repository for epithelial stem cells that can regenerate the follicle in each hair growth cycle and contribute to epidermis regeneration upon injury. Here we describe a population of multipotential stem cells in the hair follicle bulge region; these cells can be identified by fluorescence in transgenic nestin-GFP mice. The morphological features of these cells suggest that they maintain close associations with each other and with the surrounding niche. Upon explantation, these cells can give rise to neurosphere-like structures in vitro. When these cells are permitted to differentiate, they produce several cell types, including cells with neuronal, astrocytic, oligodendrocytic, smooth muscle, adipocytic, and other phenotypes. Furthermore, upon implantation into the developing nervous system of chick, these cells generate neuronal cells in vivo. We used transcriptional profiling to assess the relationship between these cells and embryonic and postnatal neural stem cells and to compare them with other stem cell populations of the bulge. Our results show that nestin-expressing cells in the bulge region of the hair follicle have stem cell-like properties, are multipotent, and can effectively generate cells of neural lineage in vitro and in vivo.
PMCID: PMC3789384  PMID: 17873521
stem cells; hair follicle; bulge; neurogenesis; transcriptional profiling
4.  Chromatin state and microRNA determine different gene expression dynamics responsive to TNF stimulation 
Genomics  2012;100(5):297-302.
Gene expression is a dynamic process, and what factors influence gene expression changes upon external stimulus have not been clearly understood. We studied gene expression profiles in human umbilical vein endothelial cells (HUVEC) after the Tumor Necrosis Factor (TNF) stimulus, and found that: the promoters of fast-response up-regulated genes were enriched with several “active” chromatin markers like H3K27ac and H3K4me3, and also preferentially bound by Pol II and c-Myc; the core-promoter regions of slow-response up-regulated genes were frequently occupied by nucleosomes; down-regulated genes were more intensively regulated by microRNAs. Moreover, the Gene Ontology and motif analysis of the promoter regions revealed that gene clusters with different response behaviors had different functions and were regulated by different sets of transcription factors. Our observations suggested that the different gene expression patterns upon external stimulus were regulated by a combination of multi-layer regulators.
PMCID: PMC3771509  PMID: 22824656
TNF; Gene expression profiles; Chromatin; Histone code; MicroRNAs
5.  Novel Foxo1–dependent transcriptional programs control Treg cell function 
Nature  2012;491(7425):554-559.
Regulatory T (Treg) cells, characterized by expression of the transcription factor forkhead box P3 (Foxp3), maintain immune homeostasis by suppressing self-destructive immune responses1–4. Foxp3 operates as a late-acting differentiation factor controlling Treg cell homeostasis and function5, whereas the early Treg-cell-lineage commitment is regulated by the Akt kinase and the forkhead box O (Foxo) family of transcription factors6–10. However, whether Foxo proteins act beyond the Treg-cell-commitment stage to control Treg cell homeostasis and function remains largely unexplored. Here we show that Foxo1 is a pivotal regulatorof Treg cell function. Treg cells express high amounts of Foxo1 and display reduced T-cell-receptor-induced Akt activation, Foxo1 phosphorylation and Foxo1 nuclear exclusion. Mice with Treg-cell-specific deletion of Foxo1 develop a fatal inflammatory disorder similar in severity to that seen in Foxp3-deficient mice, but without the loss of Treg cells. Genome-wide analysis of Foxo1 binding sites reveals ~300 Foxo1-bound target genes, including the pro-inflammatory cytokine Ifng, that do not seem to be directly regulated by Foxp3. These findings show that the evolutionarily ancient Akt–Foxo1 signalling module controls a novel genetic program indispensable for Treg cell function.
PMCID: PMC3771531  PMID: 23135404
6.  FastDMA: An Infinium HumanMethylation450 Beadchip Analyzer 
PLoS ONE  2013;8(9):e74275.
DNA methylation is vital for many essential biological processes and human diseases. Illumina Infinium HumanMethylation450 Beadchip is a recently developed platform studying genome-wide DNA methylation state on more than 480,000 CpG sites and a few CHG sites with high data quality. To analyze the data of this promising platform, we developed FastDMA which can be used to identify significantly differentially methylated probes. Besides single probe analysis, FastDMA can also do region-based analysis for identifying the differentially methylated region (DMRs). A uniformed statistical model, analysis of covariance (ANCOVA), is used to achieve all the analyses in FastDMA. We apply FastDMA on three large-scale DNA methylation datasets from The Cancer Genome Atlas (TCGA) and find many differentially methylated genomic sites in different types of cancer. On the testing datasets, FastDMA shows much higher computational efficiency than current tools. FastDMA can benefit the data analyses of large-scale DNA methylation studies with an integrative pipeline and a high computational efficiency. The software is freely available via
PMCID: PMC3764200  PMID: 24040221
7.  The Seventh Asia Pacific Bioinformatics Conference (APBC2009) 
BMC Bioinformatics  2009;10(Suppl 1):S1.
PMCID: PMC2648764  PMID: 19208108
8.  OLego: fast and sensitive mapping of spliced mRNA-Seq reads using small seeds 
Nucleic Acids Research  2013;41(10):5149-5163.
A crucial step in analyzing mRNA-Seq data is to accurately and efficiently map hundreds of millions of reads to the reference genome and exon junctions. Here we present OLego, an algorithm specifically designed for de novo mapping of spliced mRNA-Seq reads. OLego adopts a multiple-seed-and-extend scheme, and does not rely on a separate external aligner. It achieves high sensitivity of junction detection by strategic searches with small seeds (∼14 nt for mammalian genomes). To improve accuracy and resolve ambiguous mapping at junctions, OLego uses a built-in statistical model to score exon junctions by splice-site strength and intron size. Burrows–Wheeler transform is used in multiple steps of the algorithm to efficiently map seeds, locate junctions and identify small exons. OLego is implemented in C++ with fully multithreaded execution, and allows fast processing of large-scale data. We systematically evaluated the performance of OLego in comparison with published tools using both simulated and real data. OLego demonstrated better sensitivity, higher or comparable accuracy and substantially improved speed. OLego also identified hundreds of novel micro-exons (<30 nt) in the mouse transcriptome, many of which are phylogenetically conserved and can be validated experimentally in vivo. OLego is freely available at
PMCID: PMC3664805  PMID: 23571760
9.  Computational comparison of two mouse draft genomes and the human golden path 
Genome Biology  2002;4(1):R1.
A comparison of the newly completed, publicly available, genome sequence of the mouse with the prior sequence of the mouse from Celera Genomics Inc. and with the human genome provides a consensus view of the mouse and important insights into human gene numbers.
The availability of both mouse and human draft genomes has marked the beginning of a new era of comparative mammalian genomics. The two available mouse genome assemblies, from the public mouse genome sequencing consortium and Celera Genomics, were obtained using different clone libraries and different assembly methods.
We present here a critical comparison of the two latest mouse genome assemblies. The utility of the combined genomes is further demonstrated by comparing them with the human 'golden path' and through a subsequent analysis of a resulting conserved sequence element (CSE) database, which allows us to identify over 6,000 potential novel genes and to derive independent estimates of the number of human protein-coding genes.
The Celera and public mouse assemblies differ in about 10% of the mouse genome. Each assembly has advantages over the other: Celera has higher accuracy in base-pairs and overall higher coverage of the genome; the public assembly, however, has higher sequence quality in some newly finished bacterial artifical chromosome clone (BAC) regions and the data are freely accessible. Perhaps most important, by combining both assemblies, we can get a better annotation of the human genome; in particular, we can obtain the most complete set of CSEs, one third of which are related to known genes and some others are related to other functional genomic regions. More than half the CSEs are of unknown function. From the CSEs, we estimate the total number of human protein-coding genes to be about 40,000. This searchable publicly available online CSEdb will expedite new discoveries through comparative genomics.
PMCID: PMC151282  PMID: 12537546
10.  Cell-type based analysis of microRNA profiles in the mouse brain 
Neuron  2012;73(1):35-48.
MicroRNAs (miRNA) are implicated in brain development and function but the underlying mechanisms have been difficult to study in part due to the cellular heterogeneity in neural circuits. To systematically analyze miRNA expression in neurons, we have established a miRNA tagging and affinity purification (miRAP) method that is targeted to cell types through the Cre-loxP binary system in mice. Our studies of the neocortex and cerebellum reveal the expression of a large fraction of known miRNAs with distinct profiles in glutamatergic and GABAergic neurons, and subtypes of GABAergic neurons. We further detected putative novel miRNAs, tissue or cell type-specific strand selection of miRNAs, and miRNA editing. Our method thus will facilitate a systematic analysis of miRNA expression and regulation in specific neuron types in the context of neuronal development, physiology, plasticity, pathology and disease models, and is generally applicable to other cell types and tissues.
PMCID: PMC3270494  PMID: 22243745
11.  SpliceTrap: a method to quantify alternative splicing under single cellular conditions 
Bioinformatics  2011;27(21):3010-3016.
Motivation: Alternative splicing (AS) is a pre-mRNA maturation process leading to the expression of multiple mRNA variants from the same primary transcript. More than 90% of human genes are expressed via AS. Therefore, quantifying the inclusion level of every exon is crucial for generating accurate transcriptomic maps and studying the regulation of AS.
Results: Here we introduce SpliceTrap, a method to quantify exon inclusion levels using paired-end RNA-seq data. Unlike other tools, which focus on full-length transcript isoforms, SpliceTrap approaches the expression-level estimation of each exon as an independent Bayesian inference problem. In addition, SpliceTrap can identify major classes of alternative splicing events under a single cellular condition, without requiring a background set of reads to estimate relative splicing changes. We tested SpliceTrap both by simulation and real data analysis, and compared it to state-of-the-art tools for transcript quantification. SpliceTrap demonstrated improved accuracy, robustness and reliability in quantifying exon-inclusion ratios.
Conclusions: SpliceTrap is a useful tool to study alternative splicing regulation, especially for accurate quantification of local exon-inclusion ratios from RNA-seq data.
Availability and Implementation: SpliceTrap can be implemented online through the CSH Galaxy server and is also available for download and installation at
Supplementary Information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3198574  PMID: 21896509
12.  Regulatory elements of Caenorhabditis elegans ribosomal protein genes 
BMC Genomics  2012;13:433.
Ribosomal protein genes (RPGs) are essential, tightly regulated, and highly expressed during embryonic development and cell growth. Even though their protein sequences are strongly conserved, their mechanism of regulation is not conserved across yeast, Drosophila, and vertebrates. A recent investigation of genomic sequences conserved across both nematode species and associated with different gene groups indicated the existence of several elements in the upstream regions of C. elegans RPGs, providing a new insight regarding the regulation of these genes in C. elegans.
In this study, we performed an in-depth examination of C. elegans RPG regulation and found nine highly conserved motifs in the upstream regions of C. elegans RPGs using the motif discovery algorithm DME. Four motifs were partially similar to transcription factor binding sites from C. elegans, Drosophila, yeast, and human. One pair of these motifs was found to co-occur in the upstream regions of 250 transcripts including 22 RPGs. The distance between the two motifs displayed a complex frequency pattern that was related to their relative orientation.
We tested the impact of three of these motifs on the expression of rpl-2 using a series of reporter gene constructs and showed that all three motifs are necessary to maintain the high natural expression level of this gene. One of the motifs was similar to the binding site of an orthologue of POP-1, and we showed that RNAi knockdown of pop-1 impacts the expression of rpl-2. We further determined the transcription start site of rpl-2 by 5’ RACE and found that the motifs lie 40–90 bases upstream of the start site. We also found evidence that a noncoding RNA, contained within the outron of rpl-2, is co-transcribed with rpl-2 and cleaved during trans-splicing.
Our results indicate that C. elegans RPGs are regulated by a complex novel series of regulatory elements that is evolutionarily distinct from those of all other species examined up until now.
PMCID: PMC3575287  PMID: 22928635
13.  Bivalent-Like Chromatin Markers Are Predictive for Transcription Start Site Distribution in Human 
PLoS ONE  2012;7(6):e38112.
Deep sequencing of 5′ capped transcripts has revealed a variety of transcription initiation patterns, from narrow, focused promoters to wide, broad promoters. Attempts have already been made to model empirically classified patterns, but virtually no quantitative models for transcription initiation have been reported. Even though both genetic and epigenetic elements have been associated with such patterns, the organization of regulatory elements is largely unknown. Here, linear regression models were derived from a pool of regulatory elements, including genomic DNA features, nucleosome organization, and histone modifications, to predict the distribution of transcription start sites (TSS). Importantly, models including both active and repressive histone modification markers, e.g. H3K4me3 and H4K20me1, were consistently found to be much more predictive than models with only single-type histone modification markers, indicating the possibility of “bivalent-like” epigenetic control of transcription initiation. The nucleosome positions are proposed to be coded in the active component of such bivalent-like histone modification markers. Finally, we demonstrated that models trained on one cell type could successfully predict TSS distribution in other cell types, suggesting that these models may have a broader application range.
PMCID: PMC3387189  PMID: 22768038
14.  The pro-longevity gene FoxO3 is a direct target of the p53 tumor suppressor 
Oncogene  2011;30(29):3207-3221.
FoxO transcription factors play a conserved role in longevity and act as tissue-specific tumor suppressors in mammals. Several nodes of interaction have been identified between FoxO transcription factors and p53, a major tumor suppressor in humans and mice. However, the extent and importance of the functional interaction between FoxO and p53 have not been fully explored. Here, we show that p53 transactivates the expression of FoxO3, one of the four mammalian FoxO genes, in response to DNA damaging agents in both mouse embryonic fibroblasts and in thymocytes. We show that p53 transactivates FoxO3 in cells by binding to a site in the second intron of the FoxO3 gene, a genomic region recently found to be associated with extreme longevity in humans. While FoxO3 is not necessary for p53-dependent cell cycle arrest, FoxO3 appears to modulate p53-dependent apoptosis. We also find that FoxO3 loss does not interact with p53 loss for tumor development in vivo, although the tumor spectrum of p53 deficient mice may be affected by FoxO3 loss. Our findings indicate that FoxO3 is a p53 target gene, and suggest that FoxO3 and p53 are part of a regulatory transcriptional network that may play an important role during aging and cancer.
PMCID: PMC3136551  PMID: 21423206
15.  Major Chromosomal Breakpoint Intervals in Breast Cancer Co-Localize with Differentially Methylated Regions 
Frontiers in Oncology  2012;2:197.
Solid tumors exhibit chromosomal rearrangements resulting in gain or loss of multiple chromosomal loci (copy number variation, or CNV), and translocations that occasionally result in the creation of novel chimeric genes. In the case of breast cancer, although most individual tumors each have unique CNV landscape, the breakpoints, as measured over large datasets, appear to be non-randomly distributed in the genome. Breakpoints show a significant regional concentration at genomic loci spanning perhaps several megabases. The proximal cause of these breakpoint concentrations is a subject of speculation, but is, as yet, largely unknown. To shed light on this issue, we have performed a bio-statistical analysis on our previously published data for a set of 119 breast tumors and normal controls (Wiedswang et al., 2003), where each sample has both high-resolution CNV and methylation data. The method examined the distribution of closeness of breakpoint regions with differentially methylated regions (DMR), coupled with additional genomic parameters, such as repeat elements and designated “fragile sites” in the reference genome. Through this analysis, we have identified a set of 93 regional loci called breakpoint enriched DMR (BEDMRs) characterized by altered DNA methylation in cancer compared to normal cells that are associated with frequent breakpoint concentrations within a distance of 1 Mb. BEDMR loci are further associated with local hypomethylation (66%), concentrations of the Alu SINE repeats within 3 Mb (35% of the cases), and tend to occur near a number of cancer related genes such as the protocadherins, AKT1, DUB3, GAB2. Furthermore, BEDMRs seem to deregulate members of the histone gene family and chromatin remodeling factors, e.g., JMJD1B, which might affect the chromatin structure and disrupt coordinate signaling and repair. From this analysis we propose that preference for chromosomal breakpoints is related to genome structure coupled with alterations in DNA methylation and hence, chromatin structure, associated with tumorigenesis.
PMCID: PMC3530719  PMID: 23293768
DNA methylation; copy number variation; Alu repeat element; genome instability; multi-modal analysis; breast cancer
16.  Novel Markov model of induced pluripotency predicts gene expression changes in reprogramming 
BMC Systems Biology  2011;5(Suppl 2):S8.
Somatic cells can be reprogrammed to induced-pluripotent stem cells (iPSCs) by introducing few reprogramming factors, which challenges the long held view that cell differentiation is irreversible. However, the mechanism of induced pluripotency is still unknown.
Inspired by the phenomenological reprogramming model of Artyomov et al (2010), we proposed a novel Markov model, stepwise reprogramming Markov (SRM) model, with simpler gene regulation rules and explored various properties of the model with Monte Carlo simulation. We calculated the reprogramming rate and showed that it would increase in the condition of knockdown of somatic transcription factors or inhibition of DNA methylation globally, consistent with the real reprogramming experiments. Furthermore, we demonstrated the utility of our model by testing it with the real dynamic gene expression data spanning across different intermediate stages in the iPS reprogramming process.
The gene expression data at several stages in reprogramming and the reprogramming rate under several typically experiment conditions coincided with our simulation results. The function of reprogramming factors and gene expression change during reprogramming could be partly explained by our model reasonably well.
This lands further support on our general rules of gene regulation network in iPSC reprogramming. This model may help uncover the basic mechanism of reprogramming and improve the efficiency of converting somatic cells to iPSCs.
PMCID: PMC3287488  PMID: 22784579
17.  Identification of Tumor Suppressors and Oncogenes from Genomic and Epigenetic Features in Ovarian Cancer 
PLoS ONE  2011;6(12):e28503.
The identification of genetic and epigenetic alterations from primary tumor cells has become a common method to identify genes critical to the development and progression of cancer. We seek to identify those genetic and epigenetic aberrations that have the most impact on gene function within the tumor. First, we perform a bioinformatic analysis of copy number variation (CNV) and DNA methylation covering the genetic landscape of ovarian cancer tumor cells. We separately examined CNV and DNA methylation for 42 primary serous ovarian cancer samples using MOMA-ROMA assays and 379 tumor samples analyzed by The Cancer Genome Atlas. We have identified 346 genes with significant deletions or amplifications among the tumor samples. Utilizing associated gene expression data we predict 156 genes with altered copy number and correlated changes in expression. Among these genes CCNE1, POP4, UQCRB, PHF20L1 and C19orf2 were identified within both data sets. We were specifically interested in copy number variation as our base genomic property in the prediction of tumor suppressors and oncogenes in the altered ovarian tumor. We therefore identify changes in DNA methylation and expression for all amplified and deleted genes. We statistically define tumor suppressor and oncogenic features for these modalities and perform a correlation analysis with expression. We predicted 611 potential oncogenes and tumor suppressors candidates by integrating these data types. Genes with a strong correlation for methylation dependent expression changes exhibited at varying copy number aberrations include CDCA8, ATAD2, CDKN2A, RAB25, AURKA, BOP1 and EIF2C3. We provide copy number variation and DNA methylation analysis for over 11,500 individual genes covering the genetic landscape of ovarian cancer tumors. We show the extent of genomic and epigenetic alterations for known tumor suppressors and oncogenes and also use these defined features to identify potential ovarian cancer gene candidates.
PMCID: PMC3234280  PMID: 22174824
18.  Study of FoxA Pioneer Factor at Silent Genes Reveals Rfx-Repressed Enhancer at Cdx2 and a Potential Indicator of Esophageal Adenocarcinoma Development 
PLoS Genetics  2011;7(9):e1002277.
Understanding how silent genes can be competent for activation provides insight into development as well as cellular reprogramming and pathogenesis. We performed genomic location analysis of the pioneer transcription factor FoxA in the adult mouse liver and found that about one-third of the FoxA bound sites are near silent genes, including genes without detectable RNA polymerase II. Virtually all of the FoxA-bound silent sites are within conserved sequences, suggesting possible function. Such sites are enriched in motifs for transcriptional repressors, including for Rfx1 and type II nuclear hormone receptors. We found one such target site at a cryptic “shadow” enhancer 7 kilobases (kb) downstream of the Cdx2 gene, where Rfx1 restricts transcriptional activation by FoxA. The Cdx2 shadow enhancer exhibits a subset of regulatory properties of the upstream Cdx2 promoter region. While Cdx2 is ectopically induced in the early metaplastic condition of Barrett's esophagus, its expression is not necessarily present in progressive Barrett's with dysplasia or adenocarcinoma. By contrast, we find that Rfx1 expression in the esophageal epithelium becomes gradually extinguished during progression to cancer, i.e, expression of Rfx1 decreased markedly in dysplasia and adenocarcinoma. We propose that this decreased expression of Rfx1 could be an indicator of progression from Barrett's esophagus to adenocarcinoma and that similar analyses of other transcription factors bound to silent genes can reveal unanticipated regulatory insights into oncogenic progression and cellular reprogramming.
Author Summary
FoxA transcriptional regulatory proteins are “pioneer factors” that engage silent genes, helping to endow the competence for activation. About a third of the DNA sites we found to be occupied by FoxA in the adult liver are at genes that are silent. Analysis of transcription factor binding motifs near the FoxA sites at silent genes revealed a co-occurrence of motifs for the transcriptional repressors Rfx1 and type II nuclear hormone receptors (NHR-II). Further analysis of one such region downstream of the Cdx2 gene shows that it is a cryptic enhancer, in that it functions poorly unless Rfx1 or NHR-II binding is prevented, in which case FoxA1 promotes enhancer activity. Cdx2 encodes a transcription factor that promotes intestinal differentiation; ectopic expression of Cdx2 in the esophagus can help promote metaplasia and cancer. By screening numerous staged samples of human tissues, we show that Rfx1 expression is extinguished during the progression to esophageal adenocarcinoma and thus may serve as a marker of cancer progression. These studies exemplify how the analysis of pioneer factors bound to silent genes can reveal a basis for the competence of cells to deregulate gene expression and undergo transitions to cancer.
PMCID: PMC3174211  PMID: 21935353
19.  Direct cloning of double-stranded RNAs from RNase protection analysis reveals processing patterns of C/D box snoRNAs and provides evidence for widespread antisense transcript expression 
Nucleic Acids Research  2011;39(22):9720-9730.
We describe a new method that allows cloning of double-stranded RNAs (dsRNAs) that are generated in RNase protection experiments. We demonstrate that the mouse C/D box snoRNA MBII-85 (SNORD116) is processed into at least five shorter RNAs using processing sites near known functional elements of C/D box snoRNAs. Surprisingly, the majority of cloned RNAs from RNase protection experiments were derived from endogenous cellular RNA, indicating widespread antisense expression. The cloned dsRNAs could be mapped to genome areas that show RNA expression on both DNA strands and partially overlapped with experimentally determined argonaute-binding sites. The data suggest a conserved processing pattern for some C/D box snoRNAs and abundant expression of longer, non-coding RNAs in the cell that can potentially form dsRNAs.
PMCID: PMC3239178  PMID: 21880592
21.  Inferring Haplotypes of Copy Number Variations From High-Throughput Data With Uncertainty 
G3: Genes|Genomes|Genetics  2011;1(1):35-42.
Accurate information on haplotypes and diplotypes (haplotype pairs) is required for population-genetic analyses; however, microarrays do not provide data on a haplotype or diplotype at a copy number variation (CNV) locus; they only provide data on the total number of copies over a diplotype or an unphased sequence genotype (e.g., AAB, unlike AB of single nucleotide polymorphism). Moreover, such copy numbers or genotypes are often incorrectly determined when microarray signal intensities derived from different copy numbers or genotypes are not clearly separated due to noise. Here we report an algorithm to infer CNV haplotypes and individuals’ diplotypes at multiple loci from noisy microarray data, utilizing the probability that a signal intensity may be derived from different underlying copy numbers or genotypes. Performing simulation studies based on known diplotypes and an error model obtained from real microarray data, we demonstrate that this probabilistic approach succeeds in accurate inference (error rate: 1–2%) from noisy data, whereas previous deterministic approaches failed (error rate: 12–18%). Applying this algorithm to real microarray data, we estimated haplotype frequencies and diplotypes in 1486 CNV regions for 100 individuals. Our algorithm will facilitate accurate population-genetic analyses and powerful disease association studies of CNVs.
PMCID: PMC3276117  PMID: 22384316
copy number variation; EM algorithm; haplotype inference; phasing
22.  ChIP-Array: combinatory analysis of ChIP-seq/chip and microarray gene expression data to discover direct/indirect targets of a transcription factor 
Nucleic Acids Research  2011;39(Web Server issue):W430-W436.
Chromatin immunoprecipitation (ChIP) coupled with high-throughput techniques (ChIP-X), such as next generation sequencing (ChIP-Seq) and microarray (ChIP–chip), has been successfully used to map active transcription factor binding sites (TFBS) of a transcription factor (TF). The targeted genes can be activated or suppressed by the TF, or are unresponsive to the TF. Microarray technology has been used to measure the actual expression changes of thousands of genes under the perturbation of a TF, but is unable to determine if the affected genes are direct or indirect targets of the TF. Furthermore, both ChIP-X and microarray methods produce a large number of false positives. Combining microarray expression profiling and ChIP-X data allows more effective TFBS analysis for studying the function of a TF. However, current web servers only provide tools to analyze either ChIP-X or expression data, but not both. Here, we present ChIP-Array, a web server that integrates ChIP-X and expression data from human, mouse, yeast, fruit fly and Arabidopsis. This server will assist biologists to detect direct and indirect target genes regulated by a TF of interest and to aid in the functional characterization of the TF. ChIP-Array is available at, with free access to academic users.
PMCID: PMC3125757  PMID: 21586587
23.  Histone modification profiles are predictive for tissue/cell-type specific expression of both protein-coding and microRNA genes 
BMC Bioinformatics  2011;12:155.
Gene expression is regulated at both the DNA sequence level and through modification of chromatin. However, the effect of chromatin on tissue/cell-type specific gene regulation (TCSR) is largely unknown. In this paper, we present a method to elucidate the relationship between histone modification/variation (HMV) and TCSR.
A classifier for differentiating CD4+ T cell-specific genes from housekeeping genes using HMV data was built. We found HMV in both promoter and gene body regions to be predictive of genes which are targets of TCSR. For example, the histone modification types H3K4me3 and H3K27ac were identified as the most predictive for CpG-related promoters, whereas H3K4me3 and H3K79me3 were the most predictive for nonCpG-related promoters. However, genes targeted by TCSR can be predicted using other type of HMVs as well. Such redundancy implies that multiple type of underlying regulatory elements, such as enhancers or intragenic alternative promoters, which can regulate gene expression in a tissue/cell-type specific fashion, may be marked by the HMVs. Finally, we show that the predictive power of HMV for TCSR is not limited to protein-coding genes in CD4+ T cells, as we successfully predicted TCSR targeted genes in muscle cells, as well as microRNA genes with expression specific to CD4+ T cells, by the same classifier which was trained on HMV data of protein-coding genes in CD4+ T cells.
We have begun to understand the HMV patterns that guide gene expression in both tissue/cell-type specific and ubiquitous manner.
PMCID: PMC3120700  PMID: 21569556
24.  A highly efficient and effective motif discovery method for ChIP-seq/ChIP-chip data using positional information 
Nucleic Acids Research  2011;40(7):e50.
Identification of DNA motifs from ChIP-seq/ChIP-chip [chromatin immunoprecipitation (ChIP)] data is a powerful method for understanding the transcriptional regulatory network. However, most established methods are designed for small sample sizes and are inefficient for ChIP data. Here we propose a new k-mer occurrence model to reflect the fact that functional DNA k-mers often cluster around ChIP peak summits. With this model, we introduced a new measure to discover functional k-mers. Using simulation, we demonstrated that our method is more robust against noises in ChIP data than available methods. A novel word clustering method is also implemented to group similar k-mers into position weight matrices (PWMs). Our method was applied to a diverse set of ChIP experiments to demonstrate its high sensitivity and specificity. Importantly, our method is much faster than several other methods for large sample sizes. Thus, we have developed an efficient and effective motif discovery method for ChIP experiments.
PMCID: PMC3326300  PMID: 22228832
25.  Multi-stage analysis of gene expression and transcription regulation in C57/B6 mouse liver development 
Genomics  2008;93(3):235-242.
The liver performs a number of essential functions for life. The development of such a complex organ relies on finely regulated gene expression profiles which change over time in the development and determine the phenotype and function of the liver. We used high-density oligonucleotide microarrays to study the gene expression and transcription regulation at 14 time points across the C57/B6 mouse liver development, which include E11.5 (embryonic day 11.5), E12.5, E13.5, E14.5, E15.5, E16.5, E17.5, E18.5, Day0 (the day of birth), Day3, Day7, Day14, Day21, and normal adult liver. With these data, we made a comprehensive analysis on gene expression patterns, functional preferences and transcriptional regulations during the liver development. A group of uncharacterized genes which might be involved in the fetal hematopoiesis were detected.
PMCID: PMC2995988  PMID: 19015022
liver development; microarray; gene expression; function; transcriptional regulation

Results 1-25 (66)