PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (31)
 

Clipboard (0)
None

Select a Filter Below

Journals
more »
Year of Publication
more »
1.  STEME: A Robust, Accurate Motif Finder for Large Data Sets 
PLoS ONE  2014;9(3):e90735.
Motif finding is a difficult problem that has been studied for over 20 years. Some older popular motif finders are not suitable for analysis of the large data sets generated by next-generation sequencing. We recently published an efficient approximation (STEME) to the EM algorithm that is at the core of many motif finders such as MEME. This approximation allows the EM algorithm to be applied to large data sets. In this work we describe several efficient extensions to STEME that are based on the MEME algorithm. Together with the original STEME EM approximation, these extensions make STEME a fully-fledged motif finder with similar properties to MEME. We discuss the difficulty of objectively comparing motif finders. We show that STEME performs comparably to existing prominent discriminative motif finders, DREME and Trawler, on 13 sets of transcription factor binding data in mouse ES cells. We demonstrate the ability of STEME to find long degenerate motifs which these discriminative motif finders do not find. As part of our method, we extend an earlier method due to Nagarajan et al. for the efficient calculation of motif E-values. STEME's source code is available under an open source license and STEME is available via a web interface.
doi:10.1371/journal.pone.0090735
PMCID: PMC3953122  PMID: 24625410
2.  Non-Replicating Mycobacterium tuberculosis Elicits a Reduced Infectivity Profile with Corresponding Modifications to the Cell Wall and Extracellular Matrix 
PLoS ONE  2014;9(2):e87329.
A key feature of Mycobacterium tuberculosis is its ability to become dormant in the host. Little is known of the mechanisms by which these bacilli are able to persist in this state. Therefore, the focus of this study was to emulate environmental conditions encountered by M. tuberculosis in the granuloma, and determine the effect of such conditions on the physiology and infectivity of the organism. Non-replicating persistent (NRP) M. tuberculosis was established by the gradual depletion of nutrients in an oxygen-replete and controlled environment. In contrast to rapidly dividing bacilli, NRP bacteria exhibited a distinct phenotype by accumulating an extracellular matrix rich in free mycolate and lipoglycans, with increased arabinosylation. Microarray studies demonstrated a substantial down-regulation of genes involved in energy metabolism in NRP bacteria. Despite this reduction in metabolic activity, cells were still able to infect guinea pigs, but with a delay in the development of disease when compared to exponential phase bacilli. Using these approaches to investigate the interplay between the changing environment of the host and altered physiology of NRP bacteria, this study sheds new light on the conditions that are pertinent to M. tuberculosis dormancy and how this organism could be establishing latent disease.
doi:10.1371/journal.pone.0087329
PMCID: PMC3916317  PMID: 24516549
3.  Adaptive enrichment in biomarker-stratified clinical trial design 
Trials  2013;14(Suppl 1):O11.
doi:10.1186/1745-6215-14-S1-O11
PMCID: PMC3980333
4.  Two novel pathway analysis methods based on a hierarchical model 
Bioinformatics  2013;30(5):690-697.
Motivation: Over the past few years several pathway analysis methods have been proposed for exploring and enhancing the analysis of genome-wide association data. Hierarchical models have been advocated as a way to integrate SNP and pathway effects in the same model, but their computational complexity has prevented them being applied on a genome-wide scale to date.
Methods: We present two novel methods for identifying associated pathways. In the proposed hierarchical model, the SNP effects are analytically integrated out of the analysis, allowing computationally tractable model fitting to genome-wide data. The first method uses Bayes factors for calculating the effect of the pathways, whereas the second method uses a machine learning algorithm and adaptive lasso for finding a sparse solution of associated pathways.
Results: The performance of the proposed methods was explored on both simulated and real data. The results of the simulation study showed that the methods outperformed some well-established association methods: the commonly used Fisher’s method for combining P-values and also the recently published BGSA. The methods were applied to two genome-wide association study datasets that aimed to find the genetic structure of platelet function and body mass index, respectively. The results of the analyses replicated the results of previously published pathway analysis of these phenotypes but also identified novel pathways that are potentially involved.
Availability: An R package is under preparation. In the meantime, the scripts of the methods are available on request from the authors.
Contact: marina.evangelou@cimr.cam.ac.uk
Supplementary Information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btt583
PMCID: PMC3933872  PMID: 24123673
5.  Testing the Utility of an Integrated Analysis of Copy Number and Transcriptomics Datasets for Inferring Gene Regulatory Relationships 
PLoS ONE  2013;8(5):e63780.
Correlation patterns between matched copy number variation and gene expression data in cancer samples enable the inference of causal gene regulatory relationships by exploiting the natural randomization of such systems. The aim of this study was to test and verify experimentally the accuracy of a causal inference approach based on genomic randomization using esophageal cancer samples. Two candidates with strong regulatory effects emerging from our analysis are components of growth factor receptors, and implicated in cancer development, namely ERBB2 and FGFR2. We tested experimentally two ERBB2 and three FGFR2 regulated interactions predicted by the statistical analysis, all of which were confirmed. We also applied the method in a meta-analysis of 10 cancer datasets and tested 15 of the predicted regulatory interactions experimentally. Three additional predicted ERBB2 regulated interactions were confirmed, as well as interactions regulated by ARPC1A and FANCG. Overall, two thirds of experimentally tested predictions were confirmed.
doi:10.1371/journal.pone.0063780
PMCID: PMC3667814  PMID: 23737949
6.  Cell signalling regulates dynamics of Nanog distribution in embryonic stem cell populations 
A population of mouse embryonic stem (ES) cells is characterized by a distribution of Nanog, a gene whose expression is associated with the degree of pluripotency. Cells exhibiting high levels of Nanog maintain a state of pluripotency, while those with low levels are more likely to undergo differentiation. Using a cell line with a fluorescence tag for Nanog enables measurements of the distribution of Nanog in an ES cell culture in a stationary state or after a perturbation. In order to model the dynamics of the system, we assume that the distribution of Nanog-GFP for single cells shows distinct attractor steady states of Nanog levels, with individual cells moving between these states stochastically. The addition of synthetic inhibitors of signal transduction induces strong shifts in the distribution of Nanog. In particular, the addition of Chiron and PD03, inhibitors for the ERK and GSK3 signalling pathways, induces a high level of Nanog. In this study, we placed ES cells in different culture conditions, including the above inhibitors, and recorded the change in Nanog-GFP distribution over several days. In order to interpret the measurements of Nanog levels, we propose a new stochastic modelling strategy for the dynamics of the system not requiring detailed knowledge of regulatory or signalling mechanisms, while still capturing the stochastic and the deterministic components of the stochastic dynamical system. Despite its relative simplicity, the model provides an insight into key features of the cell population under various conditions, including the level of noise and occupancy and location of attractor steady states, without the need for strong assumptions about the underlying cellular mechanisms. By applying the model to our experimental data, we infer the existence of three stable steady states for Nanog levels, which are the same in all the different conditions of the cell-culture medium. Noise, on the other hand, and the proportion of cells in each steady state are subject to large shifts. Surprisingly, the isolated effects of PD03 and Chiron on noise and dynamics of the system are quite different from their combined effect. Our results show that signalling determines the occupancy of each state, with a particular role for GSK3 in the regulation of the noise across the population.
doi:10.1098/rsif.2012.0525
PMCID: PMC3565777  PMID: 23054952
stochastic system; embryonic stem cell; Nanog regulation
7.  Transcription factor and chromatin features predict genes associated with eQTLs 
Nucleic Acids Research  2012;41(3):1450-1463.
Cell type–specific gene expression in humans involves complex interactions between regulatory factors and DNA at enhancers and promoters. Mapping studies for expression quantitative trait loci (eQTLs), transcription factors (TFs) and chromatin markers have become widely used tools for identifying gene regulatory elements, but prediction of target genes remains a major challenge. Here, we integrate genome-wide data on TF-binding sites, chromatin markers and functional annotations to predict genes associated with human eQTLs. Using the random forest classifier, we found that genomic proximity plus five TF and chromatin features are able to predict >90% of target genes within 1 megabase of eQTLs. Despite being regularly used to map target genes, proximity is not a good indicator of eQTL targets for genes 150 kilobases away, but insulators, TF co-occurrence, open chromatin and functional similarities between TFs and genes are better indicators. Using all six features in the classifier achieved an area under the specificity and sensitivity curve of 0.91, much better compared with at most 0.75 for using any single feature. We hope this study will not only provide validation of eQTL-mapping studies, but also provide insight into the molecular mechanisms explaining how genetic variation can influence gene expression.
doi:10.1093/nar/gks1339
PMCID: PMC3561974  PMID: 23275551
8.  Displaying R spatial statistics on Google dynamic maps with web applications created by Rwui 
Background
The R project includes a large variety of packages designed for spatial statistics. Google dynamic maps provide web based access to global maps and satellite imagery. We describe a method for displaying directly the spatial output from an R script on to a Google dynamic map.
Methods
This is achieved by creating a Java based web application which runs the R script and then displays the results on the dynamic map. In order to make this method easy to implement by those unfamiliar with programming Java based web applications, we have added the method to the options available in the R Web User Interface (Rwui) application. Rwui is an established web application for creating web applications for running R scripts. A feature of Rwui is that all the code for the web application being created is generated automatically so that someone with no knowledge of web programming can make a fully functional web application for running an R script in a matter of minutes.
Results
Rwui can now be used to create web applications that will display the results from an R script on a Google dynamic map. Results may be displayed as discrete markers and/or as continuous overlays. In addition, users of the web application may select regions of interest on the dynamic map with mouse clicks and the coordinates of the region of interest will automatically be made available for use by the R script.
Conclusions
This method of displaying R output on dynamic maps is designed to be of use in a number of areas. Firstly it allows statisticians, working in R and developing methods in spatial statistics, to easily visualise the results of applying their methods to real world data. Secondly, it allows researchers who are using R to study health geographics data, to display their results directly onto dynamic maps. Thirdly, by creating a web application for running an R script, a statistician can enable users entirely unfamiliar with R to run R coded statistical analyses of health geographics data. Fourthly, we envisage an educational role for such applications.
doi:10.1186/1476-072X-11-41
PMCID: PMC3548681  PMID: 22998945
9.  Comparison of Methods for Competitive Tests of Pathway Analysis 
PLoS ONE  2012;7(7):e41018.
It has been suggested that pathway analysis can complement single-SNP analysis in exploring genomewide association data. Pathway analysis incorporates the available biological knowledge of genes and SNPs and is expected to improve the chances of revealing the underlying genetic architecture of complex traits. Methods for pathway analysis can be classified as competitive (enrichment) or self-contained (association) according to the hypothesis tested. Although association tests are statistically more powerful than enrichment tests they can be difficult to calibrate because biases in analysis accumulate across multiple SNPs or genes. Furthermore, enrichment tests can be more scientifically relevant than association tests, as they detect pathways with relatively more evidence for association than the remaining genes. Here we show how some well known association tests can be simply adapted to test for enrichment, and compare their performance to some established enrichment tests. We propose versions of the Adaptive Rank Truncated Product (ARTP), Tail Strength Measure and Fisher’s combination of p-values for testing the enrichment null hypothesis. We compare the behaviour of these proposed methods with the established Hypergeometric Test and Gene-Set Enrichment Analysis (GSEA). The results of the simulation study show that the modified version of the ARTP method has generally the best performance across the situations considered. The methods were also applied for finding enriched pathways for body mass index (BMI) and platelet function phenotypes. The pathway analysis of BMI identified the Vasoactive Intestinal Peptide pathway as significantly associated with BMI. This pathway has been previously reported as associated with BMI and the risk of obesity. The ARTP method was the method that identified the largest number of enriched pathways across all tested pathway databases and phenotypes. The simulation and data application results are in agreement with previous work on association tests and suggests that the ARTP should be preferred for both enrichment and association testing.
doi:10.1371/journal.pone.0041018
PMCID: PMC3409204  PMID: 22859961
10.  Transcription factor co-localization patterns affect human cell type-specific gene expression 
BMC Genomics  2012;13:263.
Background
Cellular development requires the precise control of gene expression states. Transcription factors are involved in this regulatory process through their combinatorial binding with DNA. Information about transcription factor binding sites can help determine which combinations of factors work together to regulate a gene, but it is unclear how far the binding data from one cell type can inform about regulation in other cell types.
Results
By integrating data on co-localized transcription factor binding sites in the K562 cell line with expression data across 38 distinct hematopoietic cell types, we developed regression models to describe the relationship between the expression of target genes and the transcription factors that co-localize nearby. With K562 binding sites identifying the predictors, the proportion of expression explained by the models is statistically significant only for monocytic cells (p-value< 0.001), which are closely related to K562. That is, cell type specific binding patterns are crucial for choosing the correct transcription factors for the model. Comparison of predictors obtained from binding sites in the GM12878 cell line with those from K562 shows that the amount of difference between binding patterns is directly related to the quality of the prediction. By identifying individual genes whose expression is predicted accurately by the binding sites, we are able to link transcription factors FOS, TAF1 and YY1 to a sparsely studied gene LRIG2. We also find that the activity of a transcription factor may be different depending on the cell type and the identity of other co-localized factors.
Conclusion
Our approach shows that gene expression can be explained by a modest number of co-localized transcription factors, however, information on cell-type specific binding is crucial for understanding combinatorial gene regulation.
doi:10.1186/1471-2164-13-263
PMCID: PMC3441573  PMID: 22721266
Transcriptional regulation; Gene expression; ChIP-Seq; Regression modeling
11.  STEME: efficient EM to find motifs in large data sets 
Nucleic Acids Research  2011;39(18):e126.
MEME and many other popular motif finders use the expectation–maximization (EM) algorithm to optimize their parameters. Unfortunately, the running time of EM is linear in the length of the input sequences. This can prohibit its application to data sets of the size commonly generated by high-throughput biological techniques. A suffix tree is a data structure that can efficiently index a set of sequences. We describe an algorithm, Suffix Tree EM for Motif Elicitation (STEME), that approximates EM using suffix trees. To the best of our knowledge, this is the first application of suffix trees to EM. We provide an analysis of the expected running time of the algorithm and demonstrate that STEME runs an order of magnitude more quickly than the implementation of EM used by MEME. We give theoretical bounds for the quality of the approximation and show that, in practice, the approximation has a negligible effect on the outcome. We provide an open source implementation of the algorithm that we hope will be used to speed up existing and future motif search algorithms.
doi:10.1093/nar/gkr574
PMCID: PMC3185442  PMID: 21785132
12.  Lipid composition and transcriptional response of Mycobacterium tuberculosis grown under iron-limitation in continuous culture: identification of a novel wax ester 
Microbiology (Reading, England)  2007;153(Pt 5):1435-1444.
The low level of available iron in vivo is a major obstacle for microbial pathogens and is a stimulus for the expression of virulence genes. In this study, Mycobacterium tuberculosis H37Rv was grown aerobically in the presence of limited iron availability in chemostat culture to determine the physiological response of the organism to iron-limitation. A previously unidentified wax ester accumulated under iron-limited growth, and changes in the abundance of triacylglycerol and menaquinone were also observed between iron-replete and iron-limited chemostat cultures. DNA microarray analysis revealed differential expression of genes involved in glycerolipid metabolism and isoprenoid quinone biosynthesis, providing some insight into the underlying genetic changes that correlate with cell-wall lipid profiles of M. tuberculosis growing in an iron-limited environment.
doi:10.1099/mic.0.2006/004317-0
PMCID: PMC3123377  PMID: 17464057
13.  Creating web applications for spatial epidemiological analysis and mapping in R using Rwui 
Background
Creating a user friendly web based application which executes an R script allows physicians, epidemiologists, and others unfamiliar with the statistical language to perform powerful statistical analyses easily. The geographic mapping of data is an important tool in spatial epidemiological analysis, and the R project includes many tools for such analyses, but few for visualization. Hence, web applications that run R for epidemiological analysis need to be able to present the results in a geographic format.
Results
Rwui is a web application for creating web based applications for running R scripts. We describe updates to Rwui that enable it to create web applications for R scripts which return the results of the analysis to the web page as geographic maps.
Conclusions
Rwui enables statisticians to create web applications for R scripts without the need to learn web programming. Creating a web application provides users access to an R based analysis without the need to learn R. Recent updates to Rwui have increased its applicability in the field of spatial epidemiological analysis.
doi:10.1186/1751-0473-6-6
PMCID: PMC3077334  PMID: 21457547
14.  Empirical Bayesian models for analysing molecular serotyping microarrays 
BMC Bioinformatics  2011;12:88.
Background
Microarrays offer great potential as a platform for molecular diagnostics, testing clinical samples for the presence of numerous biomarkers in highly multiplexed assays. In this study applied to infectious diseases, data from a microarray designed for molecular serotyping of Streptococcus pneumoniae was used, identifying the presence of any one of 91 known pneumococcal serotypes from DNA extracts. This microarray incorporated oligonucleotide probes for all known capsular polysaccharide synthesis genes and required a statistical analysis of the microarray intensity data to determine which serotype, or combination of serotypes, were present within a sample based on the combination of genes detected.
Results
We propose an empirical Bayesian model for calculating the probabilities of combinations of serotypes from the microarray data. The model takes into consideration the dependencies between serotypes, induced by genes they have in common, and by homologous genes which, although not identical, are similar to each other in sequence. For serotypes which are very similar in capsular gene composition, extra probes are included on the microarray, providing additional information which is integrated into the Bayesian model. For each serotype combination with high probability, a second model, a Bayesian random effects model is applied to determine the relative abundance of each serotype.
Conclusions
To assess the accuracy of the proposed analysis we applied our methods to experimental data from samples containing individual serotypes and samples containing combinations of serotypes with known levels of abundance. All but two of the known serotypes of S. pneumoniae that were tested as individual samples could be uniquely determined by the Bayesian model. The model also enabled the presence of combinations of serotypes within samples to be determined. Serotypes with very low abundance within a combination of serotypes can be detected (down to 2% abundance in this study). As well as detecting the presence of serotype combinations, an approximate measure of the percentage abundance of the serotypes within the combination can be obtained.
doi:10.1186/1471-2105-12-88
PMCID: PMC3076268  PMID: 21453458
15.  An integrated machine learning approach for predicting DosR-regulated genes in Mycobacterium tuberculosis 
BMC Systems Biology  2010;4:37.
Background
DosR is an important regulator of the response to stress such as limited oxygen availability in Mycobacterium tuberculosis. Time course gene expression data enable us to dissect this response on the gene regulatory level. The mRNA expression profile of a regulator, however, is not necessarily a direct reflection of its activity. Knowing the transcription factor activity (TFA) can be exploited to predict novel target genes regulated by the same transcription factor. Various approaches have been proposed to reconstruct TFAs from gene expression data. Most of them capture only a first-order approximation to the complex transcriptional processes by assuming linear gene responses and linear dynamics in TFA, or ignore the temporal information in data from such systems.
Results
In this paper, we approach the problem of inferring dynamic hidden TFAs using Gaussian processes (GP). We are able to model dynamic TFAs and to account for both linear and nonlinear gene responses. To test the validity of the proposed approach, we reconstruct the hidden TFA of p53, a tumour suppressor activated by DNA damage, using published time course gene expression data. Our reconstructed TFA is closer to the experimentally determined profile of p53 concentration than that from the original study. We then apply the model to time course gene expression data obtained from chemostat cultures of M. tuberculosis under reduced oxygen availability. After estimation of the TFA of DosR based on a number of known target genes using the GP model, we predict novel DosR-regulated genes: the parameters of the model are interpreted as relevance parameters indicating an existing functional relationship between TFA and gene expression. We further improve the prediction by integrating promoter sequence information in a logistic regression model. Apart from the documented DosR-regulated genes, our prediction yields ten novel genes under direct control of DosR.
Conclusions
Chemostat cultures are an ideal experimental system for controlling noise and variability when monitoring the response of bacterial organisms such as M. tuberculosis to finely controlled changes in culture conditions and available metabolites. Nonlinear hidden TFA dynamics of regulators can be reconstructed remarkably well with Gaussian processes from such data. Moreover, estimated parameters of the GP can be used to assess whether a gene is controlled by the reconstructed TFA or not. It is straightforward to combine these parameters with further information, such as the presence of binding motifs, to increase prediction accuracy.
doi:10.1186/1752-0509-4-37
PMCID: PMC2867773  PMID: 20356371
16.  Statistical model comparison applied to common network motifs 
BMC Systems Biology  2010;4:18.
Background
Network motifs are small modules that show interesting functional and dynamic properties, and are believed to be the building blocks of complex cellular processes. However, the mechanistic details of such modules are often unknown: there is uncertainty about the motif architecture as well as the functional form and parameter values when converted to ordinary differential equations (ODEs). This translates into a number of candidate models being compatible with the system under study. A variety of statistical methods exist for ranking models including maximum likelihood-based and Bayesian methods. Our objective is to show how such methods can be applied in a typical systems biology setting.
Results
We focus on four commonly occurring network motif structures and show that it is possible to differentiate between them using simulated data and any of the model comparison methods tested. We expand one of the motifs, the feed forward (FF) motif, for several possible parameterizations and apply model selection on simulated data. We then use experimental data on three biosynthetic pathways in Escherichia coli to formally assess how current knowledge matches the time series available. Our analysis confirms two of them as FF motifs. Only an expanded set of FF motif parameterisations using time delays is able to fit the third pathway, indicating that the true mechanism might be more complex in this case.
Conclusions
Maximum likelihood as well as Bayesian model comparison methods are suitable for selecting a plausible motif model among a set of candidate models. Our work shows that it is practical to apply model comparison to test ideas about underlying mechanisms of biological pathways in a formal and quantitative way.
doi:10.1186/1752-0509-4-18
PMCID: PMC2855527  PMID: 20199667
17.  Variable structure motifs for transcription factor binding sites 
BMC Genomics  2010;11:30.
Background
Classically, models of DNA-transcription factor binding sites (TFBSs) have been based on relatively few known instances and have treated them as sites of fixed length using position weight matrices (PWMs). Various extensions to this model have been proposed, most of which take account of dependencies between the bases in the binding sites. However, some transcription factors are known to exhibit some flexibility and bind to DNA in more than one possible physical configuration. In some cases this variation is known to affect the function of binding sites. With the increasing volume of ChIP-seq data available it is now possible to investigate models that incorporate this flexibility. Previous work on variable length models has been constrained by: a focus on specific zinc finger proteins in yeast using restrictive models; a reliance on hand-crafted models for just one transcription factor at a time; and a lack of evaluation on realistically sized data sets.
Results
We re-analysed binding sites from the TRANSFAC database and found motivating examples where our new variable length model provides a better fit. We analysed several ChIP-seq data sets with a novel motif search algorithm and compared the results to one of the best standard PWM finders and a recently developed alternative method for finding motifs of variable structure. All the methods performed comparably in held-out cross validation tests. Known motifs of variable structure were recovered for p53, Stat5a and Stat5b. In addition our method recovered a novel generalised version of an existing PWM for Sp1 that allows for variable length binding. This motif improved classification performance.
Conclusions
We have presented a new gapped PWM model for variable length DNA binding sites that is not too restrictive nor over-parameterised. Our comparison with existing tools shows that on average it does not have better predictive accuracy than existing methods. However, it does provide more interpretable models of motifs of variable structure that are suitable for follow-up structural studies. To our knowledge, we are the first to apply variable length motif models to eukaryotic ChIP-seq data sets and consequently the first to show their value in this domain. The results include a novel motif for the ubiquitous transcription factor Sp1.
doi:10.1186/1471-2164-11-30
PMCID: PMC2824720  PMID: 20074339
18.  Transcriptional programs: Modelling higher order structure in transcriptional control 
BMC Bioinformatics  2009;10:218.
Background
Transcriptional regulation is an important part of regulatory control in eukaryotes. Even if binding motifs for transcription factors are known, the task of finding binding sites by scanning sequences is plagued by false positives. One way to improve the detection of binding sites from motifs is by taking cooperativity of transcription factor binding into account. We propose a non-parametric probabilistic model, similar to a document topic model, for detecting transcriptional programs, groups of cooperative transcription factors and co-regulated genes. The analysis results in transcriptional programs which generalise both transcriptional modules and TF-target gene incidence matrices and provide a higher-level summary of these structures. The method is independent of prior specification of training sets of genes, for example, via gene expression data. The analysis is based on known binding motifs.
Results
We applied our method to putative regulatory regions of 18,445 Mus musculus genes. We discovered just 68 transcriptional programs that effectively summarised the action of 149 transcription factors on these genes. Several of these programs were significantly enriched for known biological processes and signalling pathways. One transcriptional program has a significant overlap with a reference set of cell cycle specific transcription factors.
Conclusion
Our method is able to pick out higher order structure from noisy sequence analyses. The transcriptional programs it identifies potentially represent common mechanisms of regulatory control across the genome. It simultaneously predicts which genes are co-regulated and which sets of transcription factors cooperate to achieve this co-regulation. The programs we discovered enable biologists to choose new genes and transcription factors to study in specific transcriptional regulatory systems.
doi:10.1186/1471-2105-10-218
PMCID: PMC2725141  PMID: 19607663
19.  A HaemAtlas: characterizing gene expression in differentiated human blood cells 
Blood  2009;113(19):e1-e9.
Hematopoiesis is a carefully controlled process that is regulated by complex networks of transcription factors that are, in part, controlled by signals resulting from ligand binding to cell-surface receptors. To further understand hematopoiesis, we have compared gene expression profiles of human erythroblasts, megakaryocytes, B cells, cytotoxic and helper T cells, natural killer cells, granulocytes, and monocytes using whole genome microarrays. A bioinformatics analysis of these data was performed focusing on transcription factors, immunoglobulin superfamily members, and lineage-specific transcripts. We observed that the numbers of lineage-specific genes varies by 2 orders of magnitude, ranging from 5 for cytotoxic T cells to 878 for granulocytes. In addition, we have identified novel coexpression patterns for key transcription factors involved in hematopoiesis (eg, GATA3-GFI1 and GATA2-KLF1). This study represents the most comprehensive analysis of gene expression in hematopoietic cells to date and has identified genes that play key roles in lineage commitment and cell function. The data, which are freely accessible, will be invaluable for future studies on hematopoiesis and the role of specific genes and will also aid the understanding of the recent genome-wide association studies.
doi:10.1182/blood-2008-06-162958
PMCID: PMC2680378  PMID: 19228925
20.  Estimating Translational Selection in Eukaryotic Genomes 
Molecular Biology and Evolution  2008;26(2):451-461.
Natural selection on codon usage is a pervasive force that acts on a large variety of prokaryotic and eukaryotic genomes. Despite this, obtaining reliable estimates of selection on codon usage has proved complicated, perhaps due to the fact that the selection coefficients involved are very small. In this work, a population genetics model is used to measure the strength of selected codon usage bias, S, in 10 eukaryotic genomes. It is shown that the strength of selection is closely linked to expression and that reliable estimates of selection coefficients can only be obtained for genes with very similar expression levels. We compare the strength of selected codon usage for orthologous genes across all 10 genomes classified according to expression categories. Fungi genomes present the largest S values (2.24–2.56), whereas multicellular invertebrate and plant genomes present more moderate values (0.61–1.91). The large mammalian genomes (human and mouse) show low S values (0.22–0.51) for the most highly expressed genes. This might not be evidence for selection in these organisms as the technique used here to estimate S does not properly account for nucleotide composition heterogeneity along such genomes. The relationship between estimated S values and empirical estimates of population size is presented here for the first time. It is shown, as theoretically expected, that population size has an important role in the operativity of translational selection.
doi:10.1093/molbev/msn272
PMCID: PMC2639113  PMID: 19033257
translational selection; tAI; tRNA; codon usage bias; population size; eukaryotes
21.  Using Temporal Correlation in Factor Analysis for Reconstructing Transcription Factor Activities 
Two-level gene regulatory networks consist of the transcription factors (TFs) in the top level and their regulated genes in the second level. The expression profiles of the regulated genes are the observed high-throughput data given by experiments such as microarrays. The activity profiles of the TFs are treated as hidden variables as well as the connectivity matrix that indicates the regulatory relationships of TFs with their regulated genes. Factor analysis (FA) as well as other methods, such as the network component algorithm, has been suggested for reconstructing gene regulatory networks and also for predicting TF activities. They have been applied to E. coli and yeast data with the assumption that these datasets consist of identical and independently distributed samples. Thus, the main drawback of these algorithms is that they ignore any time correlation existing within the TF profiles. In this paper, we extend previously studied FA algorithms to include time correlation within the transcription factors. At the same time, we consider connectivity matrices that are sparse in order to capture the existing sparsity present in gene regulatory networks. The TFs activity profiles obtained by this approach are significantly smoother than profiles from previous FA algorithms. The periodicities in profiles from yeast expression data become prominent in our reconstruction. Moreover, the strength of the correlation between time points is estimated and can be used to assess the suitability of the experimental time interval.
doi:10.1155/2008/172840
PMCID: PMC3171388  PMID: 18604288
22.  A Bayesian Change point model for differential gene expression patterns of the DosR regulon of Mycobacterium tuberculosis 
BMC Genomics  2008;9:87.
Background
Low oxygen availability has been shown previously to stimulate M. tuberculosis to establish non-replicative persistence in vitro. The two component sensor/regulator dosRS is a major mediator in the transcriptional response of M. tuberculosis to hypoxia and controls a regulon of approximately 50 genes that are induced under this condition.
The aim of this study was to determine whether the induction of the entire DosR regulon is triggered as a synchronous event or if induction can unfold as a cascade of events as the differential expression of subsets of genes is stimulated by different oxygen availabilities.
Results
A novel aspect of our work is the use of chemostat cultures of M. tuberculosis which allowed us to control environmental conditions very tightly. We exposed M. tuberculosis to a sudden drop in oxygen availability in chemostat culture and studied the transcriptional response of the organism during the transition from a high oxygen level (10% dissolved oxygen tension or DOT) to a low oxygen level (0.2% DOT) using DNA microarrays. We developed a Bayesian change point analysis method that enabled us to detect subtle shifts in the timing of gene induction. It results in probabilities of a change in gene expression at certain time points. A computational analysis of potential binding sites upstream of the DosR-controlled genes shows how the transcriptional responses of these genes are influenced by the affinity of these binding sites to DosR. Our study also indicates that a subgroup of DosR-controlled genes is regulated indirectly.
Conclusion
The majority of the dosR-dependent genes were up-regulated at 0.2% DOT, which confirms previous findings that these genes are triggered by hypoxic environments. However, our change point analysis also highlights genes which were up-regulated earlier at levels of about 8% DOT indicating that they respond to small fluctuations in oxygen availability. Our analysis shows that there are pairs of divergent genes where one gene in the pair is up-regulated before the other, presumably for a flexible response to a constantly changing environment in the host.
doi:10.1186/1471-2164-9-87
PMCID: PMC2275270  PMID: 18294384
23.  Quantification of global transcription patterns in prokaryotes using spotted microarrays 
Genome Biology  2007;8(12):R265.
An analysis is described, applicable to any spotted microarray dataset that is produced using genomic DNA as a reference for quantifying prokaryotic levels of mRNA on a genome-wide scale.
We describe an analysis, applicable to any spotted microarray dataset produced using genomic DNA as a reference, that quantifies prokaryotic levels of mRNA on a genome-wide scale. Applying this to Mycobacterium tuberculosis, we validate the technique, show a correlation between level of expression and biological importance, define the complement of invariant genes and analyze absolute levels of expression by functional class to develop ways of understanding an organism's biology without comparison to another growth condition.
doi:10.1186/gb-2007-8-12-r265
PMCID: PMC2246267  PMID: 18078514
24.  A comparative study of S/MAR prediction tools 
BMC Bioinformatics  2007;8:71.
Background
S/MARs are regions of the DNA that are attached to the nuclear matrix. These regions are known to affect substantially the expression of genes. The computer prediction of S/MARs is a highly significant task which could contribute to our understanding of chromatin organisation in eukaryotic cells, the number and distribution of boundary elements, and the understanding of gene regulation in eukaryotic cells. However, while a number of S/MAR predictors have been proposed, their accuracy has so far not come under scrutiny.
Results
We have selected S/MARs with sufficient experimental evidence and used these to evaluate existing methods of S/MAR prediction. Our main results are: 1.) all existing methods have little predictive power, 2.) a simple rule based on AT-percentage is generally competitive with other methods, 3.) in practice, the different methods will usually identify different sub-sequences as S/MARs, 4.) more research on the H-Rule would be valuable.
Conclusion
A new insight is needed to design a method which will predict S/MARs well. Our data, including the control data, has been deposited as additional material and this may help later researchers test new predictors.
doi:10.1186/1471-2105-8-71
PMCID: PMC1847452  PMID: 17335576
25.  Factor analysis for gene regulatory networks and transcription factor activity profiles 
BMC Bioinformatics  2007;8:61.
Background
Most existing algorithms for the inference of the structure of gene regulatory networks from gene expression data assume that the activity levels of transcription factors (TFs) are proportional to their mRNA levels. This assumption is invalid for most biological systems. However, one might be able to reconstruct unobserved activity profiles of TFs from the expression profiles of target genes. A simple model is a two-layer network with unobserved TF variables in the first layer and observed gene expression variables in the second layer. TFs are connected to regulated genes by weighted edges. The weights, known as factor loadings, indicate the strength and direction of regulation. Of particular interest are methods that produce sparse networks, networks with few edges, since it is known that most genes are regulated by only a small number of TFs, and most TFs regulate only a small number of genes.
Results
In this paper, we explore the performance of five factor analysis algorithms, Bayesian as well as classical, on problems with biological context using both simulated and real data. Factor analysis (FA) models are used in order to describe a larger number of observed variables by a smaller number of unobserved variables, the factors, whereby all correlation between observed variables is explained by common factors. Bayesian FA methods allow one to infer sparse networks by enforcing sparsity through priors. In contrast, in the classical FA, matrix rotation methods are used to enforce sparsity and thus to increase the interpretability of the inferred factor loadings matrix. However, we also show that Bayesian FA models that do not impose sparsity through the priors can still be used for the reconstruction of a gene regulatory network if applied in conjunction with matrix rotation methods. Finally, we show the added advantage of merging the information derived from all algorithms in order to obtain a combined result.
Conclusion
Most of the algorithms tested are successful in reconstructing the connectivity structure as well as the TF profiles. Moreover, we demonstrate that if the underlying network is sparse it is still possible to reconstruct hidden activity profiles of TFs to some degree without prior connectivity information.
doi:10.1186/1471-2105-8-61
PMCID: PMC1821042  PMID: 17319944

Results 1-25 (31)