Correlation patterns between matched copy number variation and gene expression data in cancer samples enable the inference of causal gene regulatory relationships by exploiting the natural randomization of such systems. The aim of this study was to test and verify experimentally the accuracy of a causal inference approach based on genomic randomization using esophageal cancer samples. Two candidates with strong regulatory effects emerging from our analysis are components of growth factor receptors, and implicated in cancer development, namely ERBB2 and FGFR2. We tested experimentally two ERBB2 and three FGFR2 regulated interactions predicted by the statistical analysis, all of which were confirmed. We also applied the method in a meta-analysis of 10 cancer datasets and tested 15 of the predicted regulatory interactions experimentally. Three additional predicted ERBB2 regulated interactions were confirmed, as well as interactions regulated by ARPC1A and FANCG. Overall, two thirds of experimentally tested predictions were confirmed.
A population of mouse embryonic stem (ES) cells is characterized by a distribution of Nanog, a gene whose expression is associated with the degree of pluripotency. Cells exhibiting high levels of Nanog maintain a state of pluripotency, while those with low levels are more likely to undergo differentiation. Using a cell line with a fluorescence tag for Nanog enables measurements of the distribution of Nanog in an ES cell culture in a stationary state or after a perturbation. In order to model the dynamics of the system, we assume that the distribution of Nanog-GFP for single cells shows distinct attractor steady states of Nanog levels, with individual cells moving between these states stochastically. The addition of synthetic inhibitors of signal transduction induces strong shifts in the distribution of Nanog. In particular, the addition of Chiron and PD03, inhibitors for the ERK and GSK3 signalling pathways, induces a high level of Nanog. In this study, we placed ES cells in different culture conditions, including the above inhibitors, and recorded the change in Nanog-GFP distribution over several days. In order to interpret the measurements of Nanog levels, we propose a new stochastic modelling strategy for the dynamics of the system not requiring detailed knowledge of regulatory or signalling mechanisms, while still capturing the stochastic and the deterministic components of the stochastic dynamical system. Despite its relative simplicity, the model provides an insight into key features of the cell population under various conditions, including the level of noise and occupancy and location of attractor steady states, without the need for strong assumptions about the underlying cellular mechanisms. By applying the model to our experimental data, we infer the existence of three stable steady states for Nanog levels, which are the same in all the different conditions of the cell-culture medium. Noise, on the other hand, and the proportion of cells in each steady state are subject to large shifts. Surprisingly, the isolated effects of PD03 and Chiron on noise and dynamics of the system are quite different from their combined effect. Our results show that signalling determines the occupancy of each state, with a particular role for GSK3 in the regulation of the noise across the population.
stochastic system; embryonic stem cell; Nanog regulation
Cell type–specific gene expression in humans involves complex interactions between regulatory factors and DNA at enhancers and promoters. Mapping studies for expression quantitative trait loci (eQTLs), transcription factors (TFs) and chromatin markers have become widely used tools for identifying gene regulatory elements, but prediction of target genes remains a major challenge. Here, we integrate genome-wide data on TF-binding sites, chromatin markers and functional annotations to predict genes associated with human eQTLs. Using the random forest classifier, we found that genomic proximity plus five TF and chromatin features are able to predict >90% of target genes within 1 megabase of eQTLs. Despite being regularly used to map target genes, proximity is not a good indicator of eQTL targets for genes 150 kilobases away, but insulators, TF co-occurrence, open chromatin and functional similarities between TFs and genes are better indicators. Using all six features in the classifier achieved an area under the specificity and sensitivity curve of 0.91, much better compared with at most 0.75 for using any single feature. We hope this study will not only provide validation of eQTL-mapping studies, but also provide insight into the molecular mechanisms explaining how genetic variation can influence gene expression.
The R project includes a large variety of packages designed for spatial statistics. Google dynamic maps provide web based access to global maps and satellite imagery. We describe a method for displaying directly the spatial output from an R script on to a Google dynamic map.
This is achieved by creating a Java based web application which runs the R script and then displays the results on the dynamic map. In order to make this method easy to implement by those unfamiliar with programming Java based web applications, we have added the method to the options available in the R Web User Interface (Rwui) application. Rwui is an established web application for creating web applications for running R scripts. A feature of Rwui is that all the code for the web application being created is generated automatically so that someone with no knowledge of web programming can make a fully functional web application for running an R script in a matter of minutes.
Rwui can now be used to create web applications that will display the results from an R script on a Google dynamic map. Results may be displayed as discrete markers and/or as continuous overlays. In addition, users of the web application may select regions of interest on the dynamic map with mouse clicks and the coordinates of the region of interest will automatically be made available for use by the R script.
This method of displaying R output on dynamic maps is designed to be of use in a number of areas. Firstly it allows statisticians, working in R and developing methods in spatial statistics, to easily visualise the results of applying their methods to real world data. Secondly, it allows researchers who are using R to study health geographics data, to display their results directly onto dynamic maps. Thirdly, by creating a web application for running an R script, a statistician can enable users entirely unfamiliar with R to run R coded statistical analyses of health geographics data. Fourthly, we envisage an educational role for such applications.
It has been suggested that pathway analysis can complement single-SNP analysis in exploring genomewide association data. Pathway analysis incorporates the available biological knowledge of genes and SNPs and is expected to improve the chances of revealing the underlying genetic architecture of complex traits. Methods for pathway analysis can be classified as competitive (enrichment) or self-contained (association) according to the hypothesis tested. Although association tests are statistically more powerful than enrichment tests they can be difficult to calibrate because biases in analysis accumulate across multiple SNPs or genes. Furthermore, enrichment tests can be more scientifically relevant than association tests, as they detect pathways with relatively more evidence for association than the remaining genes. Here we show how some well known association tests can be simply adapted to test for enrichment, and compare their performance to some established enrichment tests. We propose versions of the Adaptive Rank Truncated Product (ARTP), Tail Strength Measure and Fisher’s combination of p-values for testing the enrichment null hypothesis. We compare the behaviour of these proposed methods with the established Hypergeometric Test and Gene-Set Enrichment Analysis (GSEA). The results of the simulation study show that the modified version of the ARTP method has generally the best performance across the situations considered. The methods were also applied for finding enriched pathways for body mass index (BMI) and platelet function phenotypes. The pathway analysis of BMI identified the Vasoactive Intestinal Peptide pathway as significantly associated with BMI. This pathway has been previously reported as associated with BMI and the risk of obesity. The ARTP method was the method that identified the largest number of enriched pathways across all tested pathway databases and phenotypes. The simulation and data application results are in agreement with previous work on association tests and suggests that the ARTP should be preferred for both enrichment and association testing.
Cellular development requires the precise control of gene expression states. Transcription factors are involved in this regulatory process through their combinatorial binding with DNA. Information about transcription factor binding sites can help determine which combinations of factors work together to regulate a gene, but it is unclear how far the binding data from one cell type can inform about regulation in other cell types.
By integrating data on co-localized transcription factor binding sites in the K562 cell line with expression data across 38 distinct hematopoietic cell types, we developed regression models to describe the relationship between the expression of target genes and the transcription factors that co-localize nearby. With K562 binding sites identifying the predictors, the proportion of expression explained by the models is statistically significant only for monocytic cells (p-value< 0.001), which are closely related to K562. That is, cell type specific binding patterns are crucial for choosing the correct transcription factors for the model. Comparison of predictors obtained from binding sites in the GM12878 cell line with those from K562 shows that the amount of difference between binding patterns is directly related to the quality of the prediction. By identifying individual genes whose expression is predicted accurately by the binding sites, we are able to link transcription factors FOS, TAF1 and YY1 to a sparsely studied gene LRIG2. We also find that the activity of a transcription factor may be different depending on the cell type and the identity of other co-localized factors.
Our approach shows that gene expression can be explained by a modest number of co-localized transcription factors, however, information on cell-type specific binding is crucial for understanding combinatorial gene regulation.
Transcriptional regulation; Gene expression; ChIP-Seq; Regression modeling
MEME and many other popular motif finders use the expectation–maximization (EM) algorithm to optimize their parameters. Unfortunately, the running time of EM is linear in the length of the input sequences. This can prohibit its application to data sets of the size commonly generated by high-throughput biological techniques. A suffix tree is a data structure that can efficiently index a set of sequences. We describe an algorithm, Suffix Tree EM for Motif Elicitation (STEME), that approximates EM using suffix trees. To the best of our knowledge, this is the first application of suffix trees to EM. We provide an analysis of the expected running time of the algorithm and demonstrate that STEME runs an order of magnitude more quickly than the implementation of EM used by MEME. We give theoretical bounds for the quality of the approximation and show that, in practice, the approximation has a negligible effect on the outcome. We provide an open source implementation of the algorithm that we hope will be used to speed up existing and future motif search algorithms.
The low level of available iron in vivo is a major obstacle for microbial pathogens and is a stimulus for the expression of virulence genes. In this study, Mycobacterium tuberculosis H37Rv was grown aerobically in the presence of limited iron availability in chemostat culture to determine the physiological response of the organism to iron-limitation. A previously unidentified wax ester accumulated under iron-limited growth, and changes in the abundance of triacylglycerol and menaquinone were also observed between iron-replete and iron-limited chemostat cultures. DNA microarray analysis revealed differential expression of genes involved in glycerolipid metabolism and isoprenoid quinone biosynthesis, providing some insight into the underlying genetic changes that correlate with cell-wall lipid profiles of M. tuberculosis growing in an iron-limited environment.
Creating a user friendly web based application which executes an R script allows physicians, epidemiologists, and others unfamiliar with the statistical language to perform powerful statistical analyses easily. The geographic mapping of data is an important tool in spatial epidemiological analysis, and the R project includes many tools for such analyses, but few for visualization. Hence, web applications that run R for epidemiological analysis need to be able to present the results in a geographic format.
Rwui is a web application for creating web based applications for running R scripts. We describe updates to Rwui that enable it to create web applications for R scripts which return the results of the analysis to the web page as geographic maps.
Rwui enables statisticians to create web applications for R scripts without the need to learn web programming. Creating a web application provides users access to an R based analysis without the need to learn R. Recent updates to Rwui have increased its applicability in the field of spatial epidemiological analysis.
Microarrays offer great potential as a platform for molecular diagnostics, testing clinical samples for the presence of numerous biomarkers in highly multiplexed assays. In this study applied to infectious diseases, data from a microarray designed for molecular serotyping of Streptococcus pneumoniae was used, identifying the presence of any one of 91 known pneumococcal serotypes from DNA extracts. This microarray incorporated oligonucleotide probes for all known capsular polysaccharide synthesis genes and required a statistical analysis of the microarray intensity data to determine which serotype, or combination of serotypes, were present within a sample based on the combination of genes detected.
We propose an empirical Bayesian model for calculating the probabilities of combinations of serotypes from the microarray data. The model takes into consideration the dependencies between serotypes, induced by genes they have in common, and by homologous genes which, although not identical, are similar to each other in sequence. For serotypes which are very similar in capsular gene composition, extra probes are included on the microarray, providing additional information which is integrated into the Bayesian model. For each serotype combination with high probability, a second model, a Bayesian random effects model is applied to determine the relative abundance of each serotype.
To assess the accuracy of the proposed analysis we applied our methods to experimental data from samples containing individual serotypes and samples containing combinations of serotypes with known levels of abundance. All but two of the known serotypes of S. pneumoniae that were tested as individual samples could be uniquely determined by the Bayesian model. The model also enabled the presence of combinations of serotypes within samples to be determined. Serotypes with very low abundance within a combination of serotypes can be detected (down to 2% abundance in this study). As well as detecting the presence of serotype combinations, an approximate measure of the percentage abundance of the serotypes within the combination can be obtained.
DosR is an important regulator of the response to stress such as limited oxygen availability in Mycobacterium tuberculosis. Time course gene expression data enable us to dissect this response on the gene regulatory level. The mRNA expression profile of a regulator, however, is not necessarily a direct reflection of its activity. Knowing the transcription factor activity (TFA) can be exploited to predict novel target genes regulated by the same transcription factor. Various approaches have been proposed to reconstruct TFAs from gene expression data. Most of them capture only a first-order approximation to the complex transcriptional processes by assuming linear gene responses and linear dynamics in TFA, or ignore the temporal information in data from such systems.
In this paper, we approach the problem of inferring dynamic hidden TFAs using Gaussian processes (GP). We are able to model dynamic TFAs and to account for both linear and nonlinear gene responses. To test the validity of the proposed approach, we reconstruct the hidden TFA of p53, a tumour suppressor activated by DNA damage, using published time course gene expression data. Our reconstructed TFA is closer to the experimentally determined profile of p53 concentration than that from the original study. We then apply the model to time course gene expression data obtained from chemostat cultures of M. tuberculosis under reduced oxygen availability. After estimation of the TFA of DosR based on a number of known target genes using the GP model, we predict novel DosR-regulated genes: the parameters of the model are interpreted as relevance parameters indicating an existing functional relationship between TFA and gene expression. We further improve the prediction by integrating promoter sequence information in a logistic regression model. Apart from the documented DosR-regulated genes, our prediction yields ten novel genes under direct control of DosR.
Chemostat cultures are an ideal experimental system for controlling noise and variability when monitoring the response of bacterial organisms such as M. tuberculosis to finely controlled changes in culture conditions and available metabolites. Nonlinear hidden TFA dynamics of regulators can be reconstructed remarkably well with Gaussian processes from such data. Moreover, estimated parameters of the GP can be used to assess whether a gene is controlled by the reconstructed TFA or not. It is straightforward to combine these parameters with further information, such as the presence of binding motifs, to increase prediction accuracy.
Network motifs are small modules that show interesting functional and dynamic properties, and are believed to be the building blocks of complex cellular processes. However, the mechanistic details of such modules are often unknown: there is uncertainty about the motif architecture as well as the functional form and parameter values when converted to ordinary differential equations (ODEs). This translates into a number of candidate models being compatible with the system under study. A variety of statistical methods exist for ranking models including maximum likelihood-based and Bayesian methods. Our objective is to show how such methods can be applied in a typical systems biology setting.
We focus on four commonly occurring network motif structures and show that it is possible to differentiate between them using simulated data and any of the model comparison methods tested. We expand one of the motifs, the feed forward (FF) motif, for several possible parameterizations and apply model selection on simulated data. We then use experimental data on three biosynthetic pathways in Escherichia coli to formally assess how current knowledge matches the time series available. Our analysis confirms two of them as FF motifs. Only an expanded set of FF motif parameterisations using time delays is able to fit the third pathway, indicating that the true mechanism might be more complex in this case.
Maximum likelihood as well as Bayesian model comparison methods are suitable for selecting a plausible motif model among a set of candidate models. Our work shows that it is practical to apply model comparison to test ideas about underlying mechanisms of biological pathways in a formal and quantitative way.
Classically, models of DNA-transcription factor binding sites (TFBSs) have been based on relatively few known instances and have treated them as sites of fixed length using position weight matrices (PWMs). Various extensions to this model have been proposed, most of which take account of dependencies between the bases in the binding sites. However, some transcription factors are known to exhibit some flexibility and bind to DNA in more than one possible physical configuration. In some cases this variation is known to affect the function of binding sites. With the increasing volume of ChIP-seq data available it is now possible to investigate models that incorporate this flexibility. Previous work on variable length models has been constrained by: a focus on specific zinc finger proteins in yeast using restrictive models; a reliance on hand-crafted models for just one transcription factor at a time; and a lack of evaluation on realistically sized data sets.
We re-analysed binding sites from the TRANSFAC database and found motivating examples where our new variable length model provides a better fit. We analysed several ChIP-seq data sets with a novel motif search algorithm and compared the results to one of the best standard PWM finders and a recently developed alternative method for finding motifs of variable structure. All the methods performed comparably in held-out cross validation tests. Known motifs of variable structure were recovered for p53, Stat5a and Stat5b. In addition our method recovered a novel generalised version of an existing PWM for Sp1 that allows for variable length binding. This motif improved classification performance.
We have presented a new gapped PWM model for variable length DNA binding sites that is not too restrictive nor over-parameterised. Our comparison with existing tools shows that on average it does not have better predictive accuracy than existing methods. However, it does provide more interpretable models of motifs of variable structure that are suitable for follow-up structural studies. To our knowledge, we are the first to apply variable length motif models to eukaryotic ChIP-seq data sets and consequently the first to show their value in this domain. The results include a novel motif for the ubiquitous transcription factor Sp1.
Transcriptional regulation is an important part of regulatory control in eukaryotes. Even if binding motifs for transcription factors are known, the task of finding binding sites by scanning sequences is plagued by false positives. One way to improve the detection of binding sites from motifs is by taking cooperativity of transcription factor binding into account. We propose a non-parametric probabilistic model, similar to a document topic model, for detecting transcriptional programs, groups of cooperative transcription factors and co-regulated genes. The analysis results in transcriptional programs which generalise both transcriptional modules and TF-target gene incidence matrices and provide a higher-level summary of these structures. The method is independent of prior specification of training sets of genes, for example, via gene expression data. The analysis is based on known binding motifs.
We applied our method to putative regulatory regions of 18,445 Mus musculus genes. We discovered just 68 transcriptional programs that effectively summarised the action of 149 transcription factors on these genes. Several of these programs were significantly enriched for known biological processes and signalling pathways. One transcriptional program has a significant overlap with a reference set of cell cycle specific transcription factors.
Our method is able to pick out higher order structure from noisy sequence analyses. The transcriptional programs it identifies potentially represent common mechanisms of regulatory control across the genome. It simultaneously predicts which genes are co-regulated and which sets of transcription factors cooperate to achieve this co-regulation. The programs we discovered enable biologists to choose new genes and transcription factors to study in specific transcriptional regulatory systems.
Hematopoiesis is a carefully controlled process that is regulated by complex networks of transcription factors that are, in part, controlled by signals resulting from ligand binding to cell-surface receptors. To further understand hematopoiesis, we have compared gene expression profiles of human erythroblasts, megakaryocytes, B cells, cytotoxic and helper T cells, natural killer cells, granulocytes, and monocytes using whole genome microarrays. A bioinformatics analysis of these data was performed focusing on transcription factors, immunoglobulin superfamily members, and lineage-specific transcripts. We observed that the numbers of lineage-specific genes varies by 2 orders of magnitude, ranging from 5 for cytotoxic T cells to 878 for granulocytes. In addition, we have identified novel coexpression patterns for key transcription factors involved in hematopoiesis (eg, GATA3-GFI1 and GATA2-KLF1). This study represents the most comprehensive analysis of gene expression in hematopoietic cells to date and has identified genes that play key roles in lineage commitment and cell function. The data, which are freely accessible, will be invaluable for future studies on hematopoiesis and the role of specific genes and will also aid the understanding of the recent genome-wide association studies.
Natural selection on codon usage is a pervasive force that acts on a large variety of prokaryotic and eukaryotic genomes. Despite this, obtaining reliable estimates of selection on codon usage has proved complicated, perhaps due to the fact that the selection coefficients involved are very small. In this work, a population genetics model is used to measure the strength of selected codon usage bias, S, in 10 eukaryotic genomes. It is shown that the strength of selection is closely linked to expression and that reliable estimates of selection coefficients can only be obtained for genes with very similar expression levels. We compare the strength of selected codon usage for orthologous genes across all 10 genomes classified according to expression categories. Fungi genomes present the largest S values (2.24–2.56), whereas multicellular invertebrate and plant genomes present more moderate values (0.61–1.91). The large mammalian genomes (human and mouse) show low S values (0.22–0.51) for the most highly expressed genes. This might not be evidence for selection in these organisms as the technique used here to estimate S does not properly account for nucleotide composition heterogeneity along such genomes. The relationship between estimated S values and empirical estimates of population size is presented here for the first time. It is shown, as theoretically expected, that population size has an important role in the operativity of translational selection.
translational selection; tAI; tRNA; codon usage bias; population size; eukaryotes
Two-level gene regulatory networks consist of the transcription factors (TFs) in the top level and their regulated genes in the second level. The expression profiles of the regulated genes are the observed high-throughput data given by experiments such as microarrays. The activity profiles of the TFs are treated as hidden variables as well as the connectivity matrix that indicates the regulatory relationships of TFs with their regulated genes. Factor analysis (FA) as well as other methods, such as the network component algorithm, has been suggested for reconstructing gene regulatory networks and also for predicting TF activities. They have been applied to E. coli and yeast data with the assumption that these datasets consist of identical and independently distributed samples. Thus, the main drawback of these algorithms is that they ignore any time correlation existing within the TF profiles. In this paper, we extend previously studied FA algorithms to include time correlation within the transcription factors. At the same time, we consider connectivity matrices that are sparse in order to capture the existing sparsity present in gene regulatory networks. The TFs activity profiles obtained by this approach are significantly smoother than profiles from previous FA algorithms. The periodicities in profiles from yeast expression data become prominent in our reconstruction. Moreover, the strength of the correlation between time points is estimated and can be used to assess the suitability of the experimental time interval.
Low oxygen availability has been shown previously to stimulate M. tuberculosis to establish non-replicative persistence in vitro. The two component sensor/regulator dosRS is a major mediator in the transcriptional response of M. tuberculosis to hypoxia and controls a regulon of approximately 50 genes that are induced under this condition.
The aim of this study was to determine whether the induction of the entire DosR regulon is triggered as a synchronous event or if induction can unfold as a cascade of events as the differential expression of subsets of genes is stimulated by different oxygen availabilities.
A novel aspect of our work is the use of chemostat cultures of M. tuberculosis which allowed us to control environmental conditions very tightly. We exposed M. tuberculosis to a sudden drop in oxygen availability in chemostat culture and studied the transcriptional response of the organism during the transition from a high oxygen level (10% dissolved oxygen tension or DOT) to a low oxygen level (0.2% DOT) using DNA microarrays. We developed a Bayesian change point analysis method that enabled us to detect subtle shifts in the timing of gene induction. It results in probabilities of a change in gene expression at certain time points. A computational analysis of potential binding sites upstream of the DosR-controlled genes shows how the transcriptional responses of these genes are influenced by the affinity of these binding sites to DosR. Our study also indicates that a subgroup of DosR-controlled genes is regulated indirectly.
The majority of the dosR-dependent genes were up-regulated at 0.2% DOT, which confirms previous findings that these genes are triggered by hypoxic environments. However, our change point analysis also highlights genes which were up-regulated earlier at levels of about 8% DOT indicating that they respond to small fluctuations in oxygen availability. Our analysis shows that there are pairs of divergent genes where one gene in the pair is up-regulated before the other, presumably for a flexible response to a constantly changing environment in the host.
An analysis is described, applicable to any spotted microarray dataset that is produced using genomic DNA as a reference for quantifying prokaryotic levels of mRNA on a genome-wide scale.
We describe an analysis, applicable to any spotted microarray dataset produced using genomic DNA as a reference, that quantifies prokaryotic levels of mRNA on a genome-wide scale. Applying this to Mycobacterium tuberculosis, we validate the technique, show a correlation between level of expression and biological importance, define the complement of invariant genes and analyze absolute levels of expression by functional class to develop ways of understanding an organism's biology without comparison to another growth condition.
S/MARs are regions of the DNA that are attached to the nuclear matrix. These regions are known to affect substantially the expression of genes. The computer prediction of S/MARs is a highly significant task which could contribute to our understanding of chromatin organisation in eukaryotic cells, the number and distribution of boundary elements, and the understanding of gene regulation in eukaryotic cells. However, while a number of S/MAR predictors have been proposed, their accuracy has so far not come under scrutiny.
We have selected S/MARs with sufficient experimental evidence and used these to evaluate existing methods of S/MAR prediction. Our main results are: 1.) all existing methods have little predictive power, 2.) a simple rule based on AT-percentage is generally competitive with other methods, 3.) in practice, the different methods will usually identify different sub-sequences as S/MARs, 4.) more research on the H-Rule would be valuable.
A new insight is needed to design a method which will predict S/MARs well. Our data, including the control data, has been deposited as additional material and this may help later researchers test new predictors.
Most existing algorithms for the inference of the structure of gene regulatory networks from gene expression data assume that the activity levels of transcription factors (TFs) are proportional to their mRNA levels. This assumption is invalid for most biological systems. However, one might be able to reconstruct unobserved activity profiles of TFs from the expression profiles of target genes. A simple model is a two-layer network with unobserved TF variables in the first layer and observed gene expression variables in the second layer. TFs are connected to regulated genes by weighted edges. The weights, known as factor loadings, indicate the strength and direction of regulation. Of particular interest are methods that produce sparse networks, networks with few edges, since it is known that most genes are regulated by only a small number of TFs, and most TFs regulate only a small number of genes.
In this paper, we explore the performance of five factor analysis algorithms, Bayesian as well as classical, on problems with biological context using both simulated and real data. Factor analysis (FA) models are used in order to describe a larger number of observed variables by a smaller number of unobserved variables, the factors, whereby all correlation between observed variables is explained by common factors. Bayesian FA methods allow one to infer sparse networks by enforcing sparsity through priors. In contrast, in the classical FA, matrix rotation methods are used to enforce sparsity and thus to increase the interpretability of the inferred factor loadings matrix. However, we also show that Bayesian FA models that do not impose sparsity through the priors can still be used for the reconstruction of a gene regulatory network if applied in conjunction with matrix rotation methods. Finally, we show the added advantage of merging the information derived from all algorithms in order to obtain a combined result.
Most of the algorithms tested are successful in reconstructing the connectivity structure as well as the TF profiles. Moreover, we demonstrate that if the underlying network is sparse it is still possible to reconstruct hidden activity profiles of TFs to some degree without prior connectivity information.
RNA amplification is necessary for profiling gene expression from small tissue samples. Previous studies have shown that the T7 based amplification techniques are reproducible but may distort the true abundance of targets. However, the consequences of such distortions on the ability to detect biological variation in expression have not been explored sufficiently to define the true extent of usability and limitations of such amplification techniques.
We show that expression ratios are occasionally distorted by amplification using the Affymetrix small sample protocol version 2 due to a disproportional shift in intensity across biological samples. This occurs when a shift in one sample cannot be reflected in the other sample because the intensity would lie outside the dynamic range of the scanner. Interestingly, such distortions most commonly result in smaller ratios with the consequence of reducing the statistical significance of the ratios. This becomes more critical for less pronounced ratios where the evidence for differential expression is not strong. Indeed, statistical analysis by limma suggests that up to 87% of the genes with the largest and therefore most significant ratios (p < 10e-20) in the unamplified group have a p-value below 10e-20 in the amplified group. On the other hand, only 69% of the more moderate ratios (10e-20 < p < 10e-10) in the unamplified group have a p-value below 10e-10 in the amplified group. Our analysis also suggests that, overall, limma shows better overlap of genes found to be significant in the amplified and unamplified groups than the Z-scores statistics.
We conclude that microarray analysis of amplified samples performs best at detecting differences in gene expression, when these are large and when limma statistics are used.
An important step in understanding the regulation of a prokaryotic genome is the generation of its transcription unit map. The current strongest operon predictor depends on the distributions of intergenic distances (IGD) separating adjacent genes within and between operons. Unfortunately, experimental data on these distance distributions are limited to Escherichia coli and Bacillus subtilis. We suggest a new graph algorithmic approach based on comparative genomics to identify clusters of conserved genes independent of IGD and conservation of gene order. As a consequence, distance distributions of operon pairs for any arbitrary prokaryotic genome can be inferred. For E.coli, the algorithm predicts 854 conserved adjacent pairs with a precision of 85%. The IGD distribution for these pairs is virtually identical to the E.coli operon pair distribution. Statistical analysis of the predicted pair IGD distribution allows estimation of a genome-specific operon IGD cut-off, obviating the requirement for a training set in IGD-based operon prediction. We apply the method to a representative set of eight genomes, and show that these genome-specific IGD distributions differ considerably from each other and from the distribution in E.coli.
Translational selection is responsible for the unequal usage of synonymous codons in protein coding genes in a wide variety of organisms. It is one of the most subtle and pervasive forces of molecular evolution, yet, establishing the underlying causes for its idiosyncratic behaviour across living kingdoms has proven elusive to researchers over the past 20 years. In this study, a statistical model for measuring translational selection in any given genome is developed, and the test is applied to 126 fully sequenced genomes, ranging from archaea to eukaryotes. It is shown that tRNA gene redundancy and genome size are interacting forces that ultimately determine the action of translational selection, and that an optimal genome size exists for which this kind of selection is maximal. Accordingly, genome size also presents upper and lower boundaries beyond which selection on codon usage is not possible. We propose a model where the coevolution of genome size and tRNA genes explains the observed patterns in translational selection in all living organisms. This model finally unifies our understanding of codon usage across prokaryotes and eukaryotes. Helicobacter pylori, Saccharomyces cerevisiae and Homo sapiens are codon usage paradigms that can be better understood under the proposed model.
Escherichia coli has long been regarded as a model organism in the study of codon usage bias (CUB). However, most studies in this organism regarding this topic have been computational or, when experimental, restricted to small datasets; particularly poor attention has been given to genes with low CUB. In this work, correspondence analysis on codon usage is used to classify E.coli genes into three groups, and the relationship between them and expression levels from microarray experiments is studied. These groups are: group 1, highly biased genes; group 2, moderately biased genes; and group 3, AT-rich genes with low CUB. It is shown that, surprisingly, there is a negative correlation between codon bias and expression levels for group 3 genes, i.e. genes with extremely low codon adaptation index (CAI) values are highly expressed, while group 2 show the lowest average expression levels and group 1 show the usual expected positive correlation between CAI and expression. This trend is maintained over all functional gene groups, seeming to contradict the E.coli–yeast paradigm on CUB. It is argued that these findings are still compatible with the mutation–selection balance hypothesis of codon usage and that E.coli genes form a dynamic system shaped by these factors.