Search tips
Search criteria

Results 1-24 (24)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
Document Types
1.  Integrative genomics and transcriptomics analysis of human embryonic and induced pluripotent stem cells 
BioData Mining  2014;7:32.
Human genomic variations, including single nucleotide polymorphisms (SNPs) and copy number variations (CNVs), are associated with several phenotypic traits varying from mild features to hereditary diseases. Several genome-wide studies have reported genomic variants that correlate with gene expression levels in various tissue and cell types.
We studied human embryonic stem cells (hESCs) and human induced pluripotent stem cells (hiPSCs) measuring the SNPs and CNVs with Affymetrix SNP 6 microarrays and expression values with Affymetrix Exon microarrays. We computed the linear relationships between SNPs and expression levels of exons, transcripts and genes, and the associations between gene CNVs and gene expression levels. Further, for a few of the resulted genes, the expression value was associated with both CNVs and SNPs. Our results revealed altogether 217 genes and 584 SNPs whose genomic alterations affect the transcriptome in the same cells. We analyzed the enriched pathways and gene ontologies within these groups of genes, and found out that the terms related to alternative splicing and development were enriched.
Our results revealed that in the human pluripotent stem cells, the expression values of several genes, transcripts and exons were affected due to the genomic variation.
Electronic supplementary material
The online version of this article (doi:10.1186/s13040-014-0032-2) contains supplementary material, which is available to authorized users.
PMCID: PMC4298950  PMID: 25649046
hESC; hiPSC; Association analysis; SNP; CNV; Gene expression; Exon expression; Transcript expression
2.  Th1/Th17 Plasticity Is a Marker of Advanced β Cell Autoimmunity and Impaired Glucose Tolerance in Humans 
Upregulation of IL-17 immunity and detrimental effects of IL-17 on human islets have been implicated in human type 1 diabetes. In animal models, the plasticity of Th1/Th17 cells contributes to the development of autoimmune diabetes. In this study, we demonstrate that the upregulation of the IL-17 pathway and Th1/Th17 plasticity in peripheral blood are markers of advanced β cell autoimmunity and impaired β cell function in human type 1 diabetes. Activated Th17 immunity was observed in the late stage of preclinical diabetes in children with β cell autoimmunity and impaired glucose tolerance, but not in children with early β cell autoimmunity. We found an increased ratio of IFN-γ/IL-17 expression in Th17 cells in children with advanced β cell autoimmunity, which correlated with HbA1c and plasma glucose concentrations in an oral glucose tolerance test, and thus impaired β cell function. Low expression of Helios was seen in Th17 cells, suggesting that Th1/Th17 cells are not converted thymus-derived regulatory T cells. Our results suggest that the development of Th1/Th17 plasticity may serve as a biomarker of disease progression from β cell autoantibody positivity to type 1 diabetes. These data in human type 1 diabetes emphasize the role of Th1/Th17 plasticity as a potential contributor to tissue destruction in autoimmune conditions.
PMCID: PMC4273995  PMID: 25480564
3.  The genome-wide landscape of copy number variations in the MUSGEN study provides evidence for a founder effect in the isolated Finnish population 
European Journal of Human Genetics  2013;21(12):1411-1416.
Here we characterized the genome-wide architecture of copy number variations (CNVs) in 286 healthy, unrelated Finnish individuals belonging to the MUSGEN study, where molecular background underlying musical aptitude and related traits are studied. By using Illumina HumanOmniExpress-12v.1.0 beadchip, we identified 5493 CNVs that were spread across 467 different cytogenetic regions, spanning a total size of 287.83 Mb (∼9.6% of the human genome). Merging the overlapping CNVs across samples resulted in 999 discrete copy number variable regions (CNVRs), of which ∼6.9% were putatively novel. The average number of CNVs per person was 20, whereas the average size of CNV per locus was 52.39 kb. Large CNVs (>1 Mb) were present in 4% of the samples. The proportion of homozygous deletions in this data set (∼12.4%) seemed to be higher when compared with three other populations. Interestingly, several CNVRs were significantly enriched in this sample set, whereas several others were totally depleted. For example, a CNVR at chr2p22.1 intersecting GALM was more common in this population (P=3.3706 × 10−44) than in African and other European populations. The enriched CNVRs, however, showed no significant association with music-related phenotypes. Moreover, the most common CNV locations in world's normal population cohorts (6q14.1, 11q11) were overrepresented in this population. Thus, the genome-wide CNV investigation in this Finnish sample set demonstrated features that are characteristic to isolated populations. Novel CNVRs and the functional implications of CNVs revealed in this study elucidate structural variation present in this population isolate, and may also serve as candidate gene loci for music-related traits.
PMCID: PMC3831076  PMID: 23591402
copy number variation; isolated Finnish population; founder effect; MUSGEN; Illumina HumanOmniExpress-12v.1.0 beadchip; PennCNV; QuantiSNP
4.  Expression profiles of long non-coding RNAs located in autoimmune disease-associated regions reveal immune cell-type specificity 
Genome Medicine  2014;6(10):88.
Although genome-wide association studies (GWAS) have identified hundreds of variants associated with a risk for autoimmune and immune-related disorders (AID), our understanding of the disease mechanisms is still limited. In particular, more than 90% of the risk variants lie in non-coding regions, and almost 10% of these map to long non-coding RNA transcripts (lncRNAs). lncRNAs are known to show more cell-type specificity than protein-coding genes.
We aimed to characterize lncRNAs and protein-coding genes located in loci associated with nine AIDs which have been well-defined by Immunochip analysis and by transcriptome analysis across seven populations of peripheral blood leukocytes (granulocytes, monocytes, natural killer (NK) cells, B cells, memory T cells, naive CD4+ and naive CD8+ T cells) and four populations of cord blood-derived T-helper cells (precursor, primary, and polarized (Th1, Th2) T-helper cells).
We show that lncRNAs mapping to loci shared between AID are significantly enriched in immune cell types compared to lncRNAs from the whole genome (α <0.005). We were not able to prioritize single cell types relevant for specific diseases, but we observed five different cell types enriched (α <0.005) in five AID (NK cells for inflammatory bowel disease, juvenile idiopathic arthritis, primary biliary cirrhosis, and psoriasis; memory T and CD8+ T cells in juvenile idiopathic arthritis, primary biliary cirrhosis, psoriasis, and rheumatoid arthritis; Th0 and Th2 cells for inflammatory bowel disease, juvenile idiopathic arthritis, primary biliary cirrhosis, psoriasis, and rheumatoid arthritis). Furthermore, we show that co-expression analyses of lncRNAs and protein-coding genes can predict the signaling pathways in which these AID-associated lncRNAs are involved.
The observed enrichment of lncRNA transcripts in AID loci implies lncRNAs play an important role in AID etiology and suggests that lncRNA genes should be studied in more detail to interpret GWAS findings correctly. The co-expression results strongly support a model in which the lncRNA and protein-coding genes function together in the same pathways.
Electronic supplementary material
The online version of this article (doi:10.1186/s13073-014-0088-0) contains supplementary material, which is available to authorized users.
PMCID: PMC4240855  PMID: 25419237
5.  Quantitative proteomic analysis of signalosome dynamics in primary T cells identifies the CD6 surface receptor as a Lat-independent TCR signaling hub 
Nature immunology  2014;15(4):384-392.
T cell antigen receptor (TCR)-mediated T cell activation requires the interaction of dozens of proteins. We used quantitative mass spectrometry and activated primary CD4+ T cells from mice in which a tag for affinity purification was knocked into several genes to determine the composition and dynamics of multiprotein complexes forming around the kinase Zap70 and the adaptors Lat and SLP-76. Most of the 112 high confidence time-resolved protein interactions we observed were novel. The CD6 surface receptor was found capable of initiating its own signaling pathway by recruiting SLP-76 and Vav1, irrespective of the presence of Lat. Our findings provide a more complete model of TCR signaling in which CD6 constitutes a signaling hub contributing to TCR signal diversification.
PMCID: PMC4037560  PMID: 24584089
6.  Methods for time series analysis of RNA-seq data with application to human Th17 cell differentiation 
Bioinformatics  2014;30(12):i113-i120.
Motivation: Gene expression profiling using RNA-seq is a powerful technique for screening RNA species’ landscapes and their dynamics in an unbiased way. While several advanced methods exist for differential expression analysis of RNA-seq data, proper tools to anal.yze RNA-seq time-course have not been proposed.
Results: In this study, we use RNA-seq to measure gene expression during the early human T helper 17 (Th17) cell differentiation and T-cell activation (Th0). To quantify Th17-specific gene expression dynamics, we present a novel statistical methodology, DyNB, for analyzing time-course RNA-seq data. We use non-parametric Gaussian processes to model temporal correlation in gene expression and combine that with negative binomial likelihood for the count data. To account for experiment-specific biases in gene expression dynamics, such as differences in cell differentiation efficiencies, we propose a method to rescale the dynamics between replicated measurements. We develop an MCMC sampling method to make inference of differential expression dynamics between conditions. DyNB identifies several known and novel genes involved in Th17 differentiation. Analysis of differentiation efficiencies revealed consistent patterns in gene expression dynamics between different cultures. We use qRT-PCR to validate differential expression and differentiation efficiencies for selected genes. Comparison of the results with those obtained via traditional timepoint-wise analysis shows that time-course analysis together with time rescaling between cultures identifies differentially expressed genes which would not otherwise be detected.
Availability: An implementation of the proposed computational methods will be available at
Contact: or
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC4058923  PMID: 24931974
7.  Modulation of TET2 expression and 5-methylcytosine oxidation by the CXXC domain protein IDAX 
Nature  2013;497(7447):122-126.
TET (Ten-Eleven-Translocation) proteins are Fe(II) and α-ketoglutarate-dependent dioxygenases1-3 that modify the methylation status of DNA by successively oxidizing 5-methylcytosine (5mC) to 5-hydroxymethylcytosine (5hmC), 5-formylcytosine and 5-carboxycytosine1,3-5, potential intermediates in the active erasure of DNA methylation marks5,6. We show here that IDAX/ CXXC4, a player in the Wnt signaling pathway7 that has been implicated in malignant renal cell carcinoma8 and colonic villous adenoma9, functions as a negative regulator of TET2 protein expression. IDAX/ CXXC4 was originally encoded within an ancestral TET2 gene that underwent a chromosomal gene inversion during evolution, thus separating the TET2 CXXC domain from the catalytic domain. The Idax CXXC domain binds DNA sequences containing unmethylated CpGs, localises to promoters and CpG islands in genomic DNA, and interacts directly with the catalytic domain of Tet2. Unexpectedly, Idax expression resulted in caspase activation and Tet2 protein downregulation, in a manner that depended on DNA-binding through the Idax CXXC domain. Idax depletion prevented Tet2 downregulation in differentiating mouse embryonic stem (ES) cells, and shRNA against IDAX increased TET2 protein expression in the human monocytic cell line U937. Notably, we find that the expression and activity of TET3 are also regulated through its CXXC domain. Taken together, these results establish the separate and linked CXXC domains of TET2 and TET3 respectively as novel regulators of caspase activation and TET enzymatic activity.
PMCID: PMC3643997  PMID: 23563267
8.  Continuous Hypoxic Culturing of Human Embryonic Stem Cells Enhances SSEA-3 and MYC Levels 
PLoS ONE  2013;8(11):e78847.
Low oxygen tension (hypoxia) contributes critically to pluripotency of human embryonic stem cells (hESCs) by preventing spontaneous differentiation and supporting self-renewal. However, it is not well understood how hESCs respond to reduced oxygen availability and what are the molecular mechanisms maintaining pluripotency in these conditions. In this study we characterized the transcriptional and molecular responses of three hESC lines (H9, HS401 and HS360) on short (2 hours), intermediate (24 hours) and prolonged (7 days) exposure to low oxygen conditions (4% O2). In response to prolonged hypoxia the expression of pluripotency surface marker SSEA-3 was increased. Furthermore, the genome wide gene-expression analysis revealed that a substantial proportion (12%) of all hypoxia-regulated genes in hESCs, were directly linked to the mechanisms controlling pluripotency or differentiation. Moreover, transcription of MYC oncogene was induced in response to continuous hypoxia. At the protein level MYC was stabilized through phosphorylation already in response to a short hypoxic exposure. Total MYC protein levels remained elevated throughout all the time points studied. Further, MYC protein expression in hypoxia was affected by silencing HIF2α, but not HIF1α. Since MYC has a crucial role in regulating pluripotency we propose that induction of sustained MYC expression in hypoxia contributes to activation of transcriptional programs critical for hESC self-renewal and maintenance of enhanced pluripotent state.
PMCID: PMC3827269  PMID: 24236059
9.  Evaluating a linear k-mer model for protein-DNA interactions using high-throughput SELEX data 
BMC Bioinformatics  2013;14(Suppl 10):S2.
Transcription factor (TF) binding to DNA can be modeled in a number of different ways. It is highly debated which modeling methods are the best, how the models should be built and what can they be applied to. In this study a linear k-mer model proposed for predicting TF specificity in protein binding microarrays (PBM) is applied to a high-throughput SELEX data and the question of how to choose the most informative k-mers to the binding model is studied. We implemented the standard cross-validation scheme to reduce the number of k-mers in the model and observed that the number of k-mers can often be reduced significantly without a great negative effect on prediction accuracy. We also found that the later SELEX enrichment cycles provide a much better discrimination between bound and unbound sequences as model prediction accuracies increased for all proteins together with the cycle number. We compared prediction performance of k-mer and position specific weight matrix (PWM) models derived from the same SELEX data. Consistent with previous results on PBM data, performance of the k-mer model was on average 9%-units better. For the 15 proteins in the SELEX data set with medium enrichment cycles, classification accuracies were on average 71% and 62% for k-mer and PWMs, respectively. Finally, the k-mer model trained with SELEX data was evaluated on ChIP-seq data demonstrating substantial improvements for some proteins. For protein GATA1 the model can distinquish between true ChIP-seq peaks and negative peaks. For proteins RFX3 and NFATC1 the performance of the model was no better than chance.
PMCID: PMC3750486  PMID: 24267147
10.  Genome-Wide Copy Number Variation Analysis in Extended Families and Unrelated Individuals Characterized for Musical Aptitude and Creativity in Music 
PLoS ONE  2013;8(2):e56356.
Music perception and practice represent complex cognitive functions of the human brain. Recently, evidence for the molecular genetic background of music related phenotypes has been obtained. In order to further elucidate the molecular background of musical phenotypes we analyzed genome wide copy number variations (CNVs) in five extended pedigrees and in 172 unrelated subjects characterized for musical aptitude and creative functions in music. Musical aptitude was defined by combination of the scores of three music tests (COMB scores): auditory structuring ability, Seashores test for pitch and for time. Data on creativity in music (herein composing, improvising and/or arranging music) was surveyed using a web-based questionnaire.
Several CNVRs containing genes that affect neurodevelopment, learning and memory were detected. A deletion at 5q31.1 covering the protocadherin-α gene cluster (Pcdha 1-9) was found co-segregating with low music test scores (COMB) in both sample sets. Pcdha is involved in neural migration, differentiation and synaptogenesis. Creativity in music was found to co-segregate with a duplication covering glucose mutarotase gene (GALM) at 2p22. GALM has influence on serotonin release and membrane trafficking of the human serotonin transporter. Interestingly, genes related to serotonergic systems have been shown to associate not only with psychiatric disorders but also with creativity and music perception. Both, Pcdha and GALM, are related to the serotonergic systems influencing cognitive and motor functions, important for music perception and practice. Finally, a 1.3 Mb duplication was identified in a subject with low COMB scores in the region previously linked with absolute pitch (AP) at 8q24. No differences in the CNV burden was detected among the high/low music test scores or creative/non-creative groups. In summary, CNVs and genes found in this study are related to cognitive functions. Our result suggests new candidate genes for music perception related traits and supports the previous results from AP study.
PMCID: PMC3584088  PMID: 23460800
11.  Proviral Integration Site for Moloney Murine Leukemia Virus (PIM) Kinases Promote Human T Helper 1 Cell Differentiation* 
The Journal of Biological Chemistry  2012;288(5):3048-3058.
Background: T helper (Th) cell differentiation is a complex process regulated by multiple factors.
Results: PIM kinases promote Th1 differentiation through regulating the expression of genes important for this process.
Conclusion: PIM kinases were identified as new regulators of Th1 cell differentiation.
Significance: This study provides new insights into the mechanisms controlling Th cell differentiation.
The differentiation of human primary T helper 1 (Th1) cells from naïve precursor cells is regulated by a complex, interrelated signaling network. The identification of factors regulating the early steps of Th1 cell polarization can provide important insight in the development of therapeutics for many inflammatory and autoimmune diseases. The serine/threonine-specific proviral integration site for Moloney murine leukemia virus (PIM) kinases PIM1 and PIM2 have been implicated in the cytokine-dependent proliferation and survival of lymphocytes. We have established that the third member of this family, PIM3, is also expressed in human primary Th cells and identified a new function for the entire PIM kinase family in T lymphocytes. Although PIM kinases are expressed more in Th1 than Th2 cells, we demonstrate here that these kinases positively influence Th1 cell differentiation. Our RNA interference results from human primary Th cells also suggest that PIM kinases promote the production of IFNγ, the hallmark cytokine produced by Th1 cells. Consistent with this, they also seem to be important for the up-regulation of the critical Th1-driving factor, T box expressed in T cells (T-BET), and the IL-12/STAT4 signaling pathway during the early Th1 differentiation process. In summary, we have identified PIM kinases as new regulators of human primary Th1 cell differentiation, thus providing new insights into the mechanisms controlling the selective development of human Th cell subsets.
PMCID: PMC3561529  PMID: 23209281
Differentiation; Gene Regulation; Immunology; siRNA; T Cell; Kinase
12.  RNA-Binding Protein L1TD1 Interacts with LIN28 via RNA and is Required for Human Embryonic Stem Cell Self-Renewal and Cancer Cell Proliferation 
Stem cells (Dayton, Ohio)  2012;30(3):452-460.
Human embryonic stem cells (hESC) have a unique capacity to self-renew and differentiate into all the cell types found in human body. Although the transcriptional regulators of pluripotency are well studied, the role of cytoplasmic regulators is still poorly characterized. Here, we report a new stem cell-specific RNA-binding protein L1TD1 (ECAT11, FLJ10884) required for hESC self-renewal and cancer cell proliferation. Depletion of L1TD1 results in immediate downregulation of OCT4 and NANOG. Furthermore, we demonstrate that OCT4, SOX2, and NANOG all bind to the promoter of L1TD1. Moreover, L1TD1 is highly expressed in seminomas, and depletion of L1TD1 in these cancer cells influences self-renewal and proliferation. We show that L1TD1 colocalizes and interacts with LIN28 via RNA and directly with RNA helicase A (RHA). LIN28 has been reported to regulate translation of OCT4 in complex with RHA. Thus, we hypothesize that L1TD1 is part of the L1TD1-RHA-LIN28 complex that could influence levels of OCT4. Our results strongly suggest that L1TD1 has an important role in the regulation of stemness.
PMCID: PMC3507993  PMID: 22162396
L1TD1; Pluripotent stem cells; Embryonic stem cells; Embryonal carcinoma; Proliferation
13.  An integrative computational systems biology approach identifies differentially regulated dynamic transcriptome signatures which drive the initiation of human T helper cell differentiation 
BMC Genomics  2012;13:572.
A proper balance between different T helper (Th) cell subsets is necessary for normal functioning of the adaptive immune system. Revealing key genes and pathways driving the differentiation to distinct Th cell lineages provides important insight into underlying molecular mechanisms and new opportunities for modulating the immune response. Previous computational methods to quantify and visualize kinetic differential expression data of three or more lineages to identify reciprocally regulated genes have relied on clustering approaches and regression methods which have time as a factor, but have lacked methods which explicitly model temporal behavior.
We studied transcriptional dynamics of human umbilical cord blood T helper cells cultured in absence and presence of cytokines promoting Th1 or Th2 differentiation. To identify genes that exhibit distinct lineage commitment dynamics and are specific for initiating differentiation to different Th cell subsets, we developed a novel computational methodology (LIGAP) allowing integrative analysis and visualization of multiple lineages over whole time-course profiles. Applying LIGAP to time-course data from multiple Th cell lineages, we identified and experimentally validated several differentially regulated Th cell subset specific genes as well as reciprocally regulated genes. Combining differentially regulated transcriptional profiles with transcription factor binding site and pathway information, we identified previously known and new putative transcriptional mechanisms involved in Th cell subset differentiation. All differentially regulated genes among the lineages together with an implementation of LIGAP are provided as an open-source resource.
The LIGAP method is widely applicable to quantify differential time-course dynamics of many types of datasets and generalizes to any number of conditions. It summarizes all the time-course measurements together with the associated uncertainty for visualization and manual assessment purposes. Here we identified novel human Th subset specific transcripts as well as regulatory mechanisms important for the initiation of the Th cell subset differentiation.
PMCID: PMC3526425  PMID: 23110343
Lineage commitment; Non-parametric analysis; Th cell differentiation; Time-course transcriptomics; Transcription factor binding
14.  A Linear Model for Transcription Factor Binding Affinity Prediction in Protein Binding Microarrays 
PLoS ONE  2011;6(5):e20059.
Protein binding microarrays (PBM) are a high throughput technology used to characterize protein-DNA binding. The arrays measure a protein's affinity toward thousands of double-stranded DNA sequences at once, producing a comprehensive binding specificity catalog. We present a linear model for predicting the binding affinity of a protein toward DNA sequences based on PBM data. Our model represents the measured intensity of an individual probe as a sum of the binding affinity contributions of the probe's subsequences. These subsequences characterize a DNA binding motif and can be used to predict the intensity of protein binding against arbitrary DNA sequences. Our method was the best performer in the Dialogue for Reverse Engineering Assessments and Methods 5 (DREAM5) transcription factor/DNA motif recognition challenge. For the DREAM5 bonus challenge, we also developed an approach for the identification of transcription factors based on their PBM binding profiles. Our approach for TF identification achieved the best performance in the bonus challenge.
PMCID: PMC3102690  PMID: 21637853
15.  Probabilistic analysis of gene expression measurements from heterogeneous tissues 
Bioinformatics  2010;26(20):2571-2577.
Motivation: Tissue heterogeneity, arising from multiple cell types, is a major confounding factor in experiments that focus on studying cell types, e.g. their expression profiles, in isolation. Although sample heterogeneity can be addressed by manual microdissection, prior to conducting experiments, computational treatment on heterogeneous measurements have become a reliable alternative to perform this microdissection in silico. Favoring computation over manual purification has its advantages, such as time consumption, measuring responses of multiple cell types simultaneously, keeping samples intact of external perturbations and unaltered yield of molecular content.
Results: We formalize a probabilistic model, DSection, and show with simulations as well as with real microarray data that DSection attains increased modeling accuracy in terms of (i) estimating cell-type proportions of heterogeneous tissue samples, (ii) estimating replication variance and (iii) identifying differential expression across cell types under various experimental conditions. As our reference we use the corresponding linear regression model, which mirrors the performance of the majority of current non-probabilistic modeling approaches.
Availability and Software: All codes are written in Matlab, and are freely available upon request as well as at the project web page∼erkkila2/. Furthermore, a web-application for DSection exists at
PMCID: PMC2951082  PMID: 20631160
16.  Reconstruction and Validation of RefRec: A Global Model for the Yeast Molecular Interaction Network 
PLoS ONE  2010;5(5):e10662.
Molecular interaction networks establish all cell biological processes. The networks are under intensive research that is facilitated by new high-throughput measurement techniques for the detection, quantification, and characterization of molecules and their physical interactions. For the common model organism yeast Saccharomyces cerevisiae, public databases store a significant part of the accumulated information and, on the way to better understanding of the cellular processes, there is a need to integrate this information into a consistent reconstruction of the molecular interaction network. This work presents and validates RefRec, the most comprehensive molecular interaction network reconstruction currently available for yeast. The reconstruction integrates protein synthesis pathways, a metabolic network, and a protein-protein interaction network from major biological databases. The core of the reconstruction is based on a reference object approach in which genes, transcripts, and proteins are identified using their primary sequences. This enables their unambiguous identification and non-redundant integration. The obtained total number of different molecular species and their connecting interactions is ∼67,000. In order to demonstrate the capacity of RefRec for functional predictions, it was used for simulating the gene knockout damage propagation in the molecular interaction network in ∼590,000 experimentally validated mutant strains. Based on the simulation results, a statistical classifier was subsequently able to correctly predict the viability of most of the strains. The results also showed that the usage of different types of molecular species in the reconstruction is important for accurate phenotype prediction. In general, the findings demonstrate the benefits of global reconstructions of molecular interaction networks. With all the molecular species and their physical interactions explicitly modeled, our reconstruction is able to serve as a valuable resource in additional analyses involving objects from multiple molecular -omes. For that purpose, RefRec is freely available in the Systems Biology Markup Language format.
PMCID: PMC2871048  PMID: 20498836
17.  A data integration framework for prediction of transcription factor targets: a BCL6 case study 
We present a computational framework for predicting targets of transcription factor regulation. The framework is based on the integration of a number of sources of evidence, derived from DNA sequence and gene expression data, using a weighted sum approach. Sources of evidence are prioritized based on a training set, and their relative contributions are then optimized. The performance of the proposed framework is demonstrated in the context of BCL6 target prediction. We show that this framework is able to uncover BCL6 targets reliably when biological prior information is utilized effectively, particularly in the case of sequence analysis. The framework results in a considerable gain in performance over scores in which sequence information was not incorporated. This analysis shows that with assessment of the quality and biological relevance of the data, reliable predictions can be obtained with this computational framework.
PMCID: PMC2771581  PMID: 19348642
network inference; transcription factor binding site prediction; data integration
18.  A protein–protein interaction guided method for competitive transcription factor binding improves target predictions 
Nucleic Acids Research  2009;37(22):e146.
An important milestone in revealing cells' functions is to build a comprehensive understanding of transcriptional regulation processes. These processes are largely regulated by transcription factors (TFs) binding to DNA sites. Several TF binding site (TFBS) prediction methods have been developed, but they usually model binding of a single TF at a time albeit few methods for predicting binding of multiple TFs also exist. In this article, we propose a probabilistic model that predicts binding of several TFs simultaneously. Our method explicitly models the competitive binding between TFs and uses the prior knowledge of existing protein–protein interactions (PPIs), which mimics the situation in the nucleus. Modeling DNA binding for multiple TFs improves the accuracy of binding site prediction remarkably when compared with other programs and the cases where individual binding prediction results of separate TFs have been combined. The traditional TFBS prediction methods usually predict overwhelming number of false positives. This lack of specificity is overcome remarkably with our competitive binding prediction method. In addition, previously unpredictable binding sites can be detected with the help of PPIs. Source codes are available at∼harrila/.
PMCID: PMC2794167  PMID: 19786498
19.  A joint finite mixture model for clustering genes from independent Gaussian and beta distributed data 
BMC Bioinformatics  2009;10:165.
Cluster analysis has become a standard computational method for gene function discovery as well as for more general explanatory data analysis. A number of different approaches have been proposed for that purpose, out of which different mixture models provide a principled probabilistic framework. Cluster analysis is increasingly often supplemented with multiple data sources nowadays, and these heterogeneous information sources should be made as efficient use of as possible.
This paper presents a novel Beta-Gaussian mixture model (BGMM) for clustering genes based on Gaussian distributed and beta distributed data. The proposed BGMM can be viewed as a natural extension of the beta mixture model (BMM) and the Gaussian mixture model (GMM). The proposed BGMM method differs from other mixture model based methods in its integration of two different data types into a single and unified probabilistic modeling framework, which provides a more efficient use of multiple data sources than methods that analyze different data sources separately. Moreover, BGMM provides an exceedingly flexible modeling framework since many data sources can be modeled as Gaussian or beta distributed random variables, and it can also be extended to integrate data that have other parametric distributions as well, which adds even more flexibility to this model-based clustering framework. We developed three types of estimation algorithms for BGMM, the standard expectation maximization (EM) algorithm, an approximated EM and a hybrid EM, and propose to tackle the model selection problem by well-known model selection criteria, for which we test the Akaike information criterion (AIC), a modified AIC (AIC3), the Bayesian information criterion (BIC), and the integrated classification likelihood-BIC (ICL-BIC).
Performance tests with simulated data show that combining two different data sources into a single mixture joint model greatly improves the clustering accuracy compared with either of its two extreme cases, GMM or BMM. Applications with real mouse gene expression data (modeled as Gaussian distribution) and protein-DNA binding probabilities (modeled as beta distribution) also demonstrate that BGMM can yield more biologically reasonable results compared with either of its two extreme cases. One of our applications has found three groups of genes that are likely to be involved in Myd88-dependent Toll-like receptor 3/4 (TLR-3/4) signaling cascades, which might be useful to better understand the TLR-3/4 signal transduction.
PMCID: PMC2717092  PMID: 19480678
20.  Probabilistic Inference of Transcription Factor Binding from Multiple Data Sources 
PLoS ONE  2008;3(3):e1820.
An important problem in molecular biology is to build a complete understanding of transcriptional regulatory processes in the cell. We have developed a flexible, probabilistic framework to predict TF binding from multiple data sources that differs from the standard hypothesis testing (scanning) methods in several ways. Our probabilistic modeling framework estimates the probability of binding and, thus, naturally reflects our degree of belief in binding. Probabilistic modeling also allows for easy and systematic integration of our binding predictions into other probabilistic modeling methods, such as expression-based gene network inference. The method answers the question of whether the whole analyzed promoter has a binding site, but can also be extended to estimate the binding probability at each nucleotide position. Further, we introduce an extension to model combinatorial regulation by several TFs. Most importantly, the proposed methods can make principled probabilistic inference from multiple evidence sources, such as, multiple statistical models (motifs) of the TFs, evolutionary conservation, regulatory potential, CpG islands, nucleosome positioning, DNase hypersensitive sites, ChIP-chip binding segments and other (prior) sequence-based biological knowledge. We developed both a likelihood and a Bayesian method, where the latter is implemented with a Markov chain Monte Carlo algorithm. Results on a carefully constructed test set from the mouse genome demonstrate that principled data fusion can significantly improve the performance of TF binding prediction methods. We also applied the probabilistic modeling framework to all promoters in the mouse genome and the results indicate a sparse connectivity between transcriptional regulators and their target promoters. To facilitate analysis of other sequences and additional data, we have developed an on-line web tool, ProbTF, which implements our probabilistic TF binding prediction method using multiple data sources. Test data set, a web tool, source codes and supplementary data are available at:
PMCID: PMC2268002  PMID: 18364997
21.  Relationships between probabilistic Boolean networks and dynamic Bayesian networks as models of gene regulatory networks 
Signal processing  2006;86(4):814-834.
A significant amount of attention has recently been focused on modeling of gene regulatory networks. Two frequently used large-scale modeling frameworks are Bayesian networks (BNs) and Boolean networks, the latter one being a special case of its recent stochastic extension, probabilistic Boolean networks (PBNs). PBN is a promising model class that generalizes the standard rule-based interactions of Boolean networks into the stochastic setting. Dynamic Bayesian networks (DBNs) is a general and versatile model class that is able to represent complex temporal stochastic processes and has also been proposed as a model for gene regulatory systems. In this paper, we concentrate on these two model classes and demonstrate that PBNs and a certain subclass of DBNs can represent the same joint probability distribution over their common variables. The major benefit of introducing the relationships between the models is that it opens up the possibility of applying the standard tools of DBNs to PBNs and vice versa. Hence, the standard learning tools of DBNs can be applied in the context of PBNs, and the inference methods give a natural way of handling the missing values in PBNs which are often present in gene expression measurements. Conversely, the tools for controlling the stationary behavior of the networks, tools for projecting networks onto sub-networks, and efficient learning schemes can be used for DBNs. In other words, the introduced relationships between the models extend the collection of analysis tools for both model classes.
PMCID: PMC1847796  PMID: 17415411
Gene regulatory networks; Probabilistic Boolean networks; Dynamic Bayesian networks
22.  Robust regression for periodicity detection in non-uniformly sampled time-course gene expression data 
BMC Bioinformatics  2007;8:233.
In practice many biological time series measurements, including gene microarrays, are conducted at time points that seem to be interesting in the biologist's opinion and not necessarily at fixed time intervals. In many circumstances we are interested in finding targets that are expressed periodically. To tackle the problems of uneven sampling and unknown type of noise in periodicity detection, we propose to use robust regression.
The aim of this paper is to develop a general framework for robust periodicity detection and review and rank different approaches by means of simulations. We also show the results for some real measurement data.
The simulation results clearly show that when the sampling of time series gets more and more uneven, the methods that assume even sampling become unusable. We find that M-estimation provides a good compromise between robustness and computational efficiency.
Since uneven sampling occurs often in biological measurements, the robust methods developed in this paper are expected to have many uses. The regression based formulation of the periodicity detection problem easily adapts to non-uniform sampling. Using robust regression helps to reject inconsistently behaving data points.
The implementations are currently available for Matlab and will be made available for the users of R as well. More information can be found in the web-supplement [1].
PMCID: PMC1934414  PMID: 17605777
23.  Robust detection of periodic time series measured from biological systems 
BMC Bioinformatics  2005;6:117.
Periodic phenomena are widespread in biology. The problem of finding periodicity in biological time series can be viewed as a multiple hypothesis testing of the spectral content of a given time series. The exact noise characteristics are unknown in many bioinformatics applications. Furthermore, the observed time series can exhibit other non-idealities, such as outliers, short length and distortion from the original wave form. Hence, the computational methods should preferably be robust against such anomalies in the data.
We propose a general-purpose robust testing procedure for finding periodic sequences in multiple time series data. The proposed method is based on a robust spectral estimator which is incorporated into the hypothesis testing framework using a so-called g-statistic together with correction for multiple testing. This results in a robust testing procedure which is insensitive to heavy contamination of outliers, missing-values, short time series, nonlinear distortions, and is completely insensitive to any monotone nonlinear distortions. The performance of the methods is evaluated by performing extensive simulations. In addition, we compare the proposed method with another recent statistical signal detection estimator that uses Fisher's test, based on the Gaussian noise assumption. The results demonstrate that the proposed robust method provides remarkably better robustness properties. Moreover, the performance of the proposed method is preferable also in the standard Gaussian case. We validate the performance of the proposed method on real data on which the method performs very favorably.
As the time series measured from biological systems are usually short and prone to contain different kinds of non-idealities, we are very optimistic about the multitude of possible applications for our proposed robust statistical periodicity detection method.
The presented methods have been implemented in Matlab and in R. Codes are available on request. Supplementary material is available at: .
PMCID: PMC1168888  PMID: 15892890
24.  In silico microdissection of microarray data from heterogeneous cell populations 
BMC Bioinformatics  2005;6:54.
Very few analytical approaches have been reported to resolve the variability in microarray measurements stemming from sample heterogeneity. For example, tissue samples used in cancer studies are usually contaminated with the surrounding or infiltrating cell types. This heterogeneity in the sample preparation hinders further statistical analysis, significantly so if different samples contain different proportions of these cell types. Thus, sample heterogeneity can result in the identification of differentially expressed genes that may be unrelated to the biological question being studied. Similarly, irrelevant gene combinations can be discovered in the case of gene expression based classification.
We propose a computational framework for removing the effects of sample heterogeneity by "microdissecting" microarray data in silico. The computational method provides estimates of the expression values of the pure (non-heterogeneous) cell samples. The inversion of the sample heterogeneity can be facilitated by providing accurate estimates of the mixing percentages of different cell types in each measurement. For those cases where no such information is available, we develop an optimization-based method for joint estimation of the mixing percentages and the expression values of the pure cell samples. We also consider the problem of selecting the correct number of cell types.
The efficiency of the proposed methods is illustrated by applying them to a carefully controlled cDNA microarray data obtained from heterogeneous samples. The results demonstrate that the methods are capable of reconstructing both the sample and cell type specific expression values from heterogeneous mixtures and that the mixing percentages of different cell types can also be estimated. Furthermore, a general purpose model selection method can be used to select the correct number of cell types.
PMCID: PMC1274251  PMID: 15766384

Results 1-24 (24)