Search tips
Search criteria

Results 1-6 (6)

Clipboard (0)

Select a Filter Below

Year of Publication
1.  A flexible count data model to fit the wide diversity of expression profiles arising from extensively replicated RNA-seq experiments 
BMC Bioinformatics  2013;14:254.
High-throughput RNA sequencing (RNA-seq) offers unprecedented power to capture the real dynamics of gene expression. Experimental designs with extensive biological replication present a unique opportunity to exploit this feature and distinguish expression profiles with higher resolution. RNA-seq data analysis methods so far have been mostly applied to data sets with few replicates and their default settings try to provide the best performance under this constraint. These methods are based on two well-known count data distributions: the Poisson and the negative binomial. The way to properly calibrate them with large RNA-seq data sets is not trivial for the non-expert bioinformatics user.
Here we show that expression profiles produced by extensively-replicated RNA-seq experiments lead to a rich diversity of count data distributions beyond the Poisson and the negative binomial, such as Poisson-Inverse Gaussian or Pólya-Aeppli, which can be captured by a more general family of count data distributions called the Poisson-Tweedie. The flexibility of the Poisson-Tweedie family enables a direct fitting of emerging features of large expression profiles, such as heavy-tails or zero-inflation, without the need to alter a single configuration parameter. We provide a software package for R called tweeDEseq implementing a new test for differential expression based on the Poisson-Tweedie family. Using simulations on synthetic and real RNA-seq data we show that tweeDEseq yields P-values that are equally or more accurate than competing methods under different configuration parameters. By surveying the tiny fraction of sex-specific gene expression changes in human lymphoblastoid cell lines, we also show that tweeDEseq accurately detects differentially expressed genes in a real large RNA-seq data set with improved performance and reproducibility over the previously compared methodologies. Finally, we compared the results with those obtained from microarrays in order to check for reproducibility.
RNA-seq data with many replicates leads to a handful of count data distributions which can be accurately estimated with the statistical model illustrated in this paper. This method provides a better fit to the underlying biological variability; this may be critical when comparing groups of RNA-seq samples with markedly different count data distributions. The tweeDEseq package forms part of the Bioconductor project and it is available for download at
PMCID: PMC3849762  PMID: 23965047
2.  GSVA: gene set variation analysis for microarray and RNA-Seq data 
BMC Bioinformatics  2013;14:7.
Gene set enrichment (GSE) analysis is a popular framework for condensing information from gene expression profiles into a pathway or signature summary. The strengths of this approach over single gene analysis include noise and dimension reduction, as well as greater biological interpretability. As molecular profiling experiments move beyond simple case-control studies, robust and flexible GSE methodologies are needed that can model pathway activity within highly heterogeneous data sets.
To address this challenge, we introduce Gene Set Variation Analysis (GSVA), a GSE method that estimates variation of pathway activity over a sample population in an unsupervised manner. We demonstrate the robustness of GSVA in a comparison with current state of the art sample-wise enrichment methods. Further, we provide examples of its utility in differential pathway activity and survival analysis. Lastly, we show how GSVA works analogously with data from both microarray and RNA-seq experiments.
GSVA provides increased power to detect subtle pathway activity changes over a sample population in comparison to corresponding methods. While GSE methods are generally regarded as end points of a bioinformatic analysis, GSVA constitutes a starting point to build pathway-centric models of biology. Moreover, GSVA contributes to the current need of GSE methods for RNA-seq data. GSVA is an open source software package for R which forms part of the Bioconductor project and can be downloaded at
PMCID: PMC3618321  PMID: 23323831
3.  Global analysis of alternative splicing regulation by insulin and wingless signaling in Drosophila cells 
Genome Biology  2009;10(1):R11.
A genome-wide analysis of the response to insulin and wingless activation using splicing-sensitive microarrays shows distinct but overlapping programs of transcriptional and posttranscriptional regulation.
Despite the prevalence and biological relevance of both signaling pathways and alternative pre-mRNA splicing, our knowledge of how intracellular signaling impacts on alternative splicing regulation remains fragmentary. We report a genome-wide analysis using splicing-sensitive microarrays of changes in alternative splicing induced by activation of two distinct signaling pathways, insulin and wingless, in Drosophila cells in culture.
Alternative splicing changes induced by insulin affect more than 150 genes and more than 50 genes are regulated by wingless activation. About 40% of the genes showing changes in alternative splicing also show regulation of mRNA levels, suggesting distinct but also significantly overlapping programs of transcriptional and post-transcriptional regulation. Distinct functional sets of genes are regulated by each pathway and, remarkably, a significant overlap is observed between functional categories of genes regulated transcriptionally and at the level of alternative splicing. Functions related to carbohydrate metabolism and cellular signaling are enriched among genes regulated by insulin and wingless, respectively. Computational searches identify pathway-specific sequence motifs enriched near regulated 5' splice sites.
Taken together, our data indicate that signaling cascades trigger pathway-specific and biologically coherent regulatory programs of alternative splicing regulation. They also reveal that alternative splicing can provide a novel molecular mechanism for crosstalk between different signaling pathways.
PMCID: PMC2687788  PMID: 19178699
4.  EGASP: the human ENCODE Genome Annotation Assessment Project 
Genome Biology  2006;7(Suppl 1):S2.
We present the results of EGASP, a community experiment to assess the state-of-the-art in genome annotation within the ENCODE regions, which span 1% of the human genome sequence. The experiment had two major goals: the assessment of the accuracy of computational methods to predict protein coding genes; and the overall assessment of the completeness of the current human genome annotations as represented in the ENCODE regions. For the computational prediction assessment, eighteen groups contributed gene predictions. We evaluated these submissions against each other based on a 'reference set' of annotations generated as part of the GENCODE project. These annotations were not available to the prediction groups prior to the submission deadline, so that their predictions were blind and an external advisory committee could perform a fair assessment.
The best methods had at least one gene transcript correctly predicted for close to 70% of the annotated genes. Nevertheless, the multiple transcript accuracy, taking into account alternative splicing, reached only approximately 40% to 50% accuracy. At the coding nucleotide level, the best programs reached an accuracy of 90% in both sensitivity and specificity. Programs relying on mRNA and protein sequences were the most accurate in reproducing the manually curated annotations. Experimental validation shows that only a very small percentage (3.2%) of the selected 221 computationally predicted exons outside of the existing annotation could be verified.
This is the first such experiment in human DNA, and we have followed the standards established in a similar experiment, GASP1, in Drosophila melanogaster. We believe the results presented here contribute to the value of ongoing large-scale annotation projects and should guide further experimental methods when being scaled up to the entire human genome sequence.
PMCID: PMC1810551  PMID: 16925836
5.  Gene finding in the chicken genome 
BMC Bioinformatics  2005;6:131.
Despite the continuous production of genome sequence for a number of organisms, reliable, comprehensive, and cost effective gene prediction remains problematic. This is particularly true for genomes for which there is not a large collection of known gene sequences, such as the recently published chicken genome. We used the chicken sequence to test comparative and homology-based gene-finding methods followed by experimental validation as an effective genome annotation method.
We performed experimental evaluation by RT-PCR of three different computational gene finders, Ensembl, SGP2 and TWINSCAN, applied to the chicken genome. A Venn diagram was computed and each component of it was evaluated. The results showed that de novo comparative methods can identify up to about 700 chicken genes with no previous evidence of expression, and can correctly extend about 40% of homology-based predictions at the 5' end.
De novo comparative gene prediction followed by experimental verification is effective at enhancing the annotation of the newly sequenced genomes provided by standard homology-based methods.
PMCID: PMC1174864  PMID: 15924626
6.  Comparative gene finding in chicken indicates that we are closing in on the set of multi-exonic widely expressed human genes 
Nucleic Acids Research  2005;33(6):1935-1939.
The recent availability of the chicken genome sequence poses the question of whether there are human protein-coding genes conserved in chicken that are currently not included in the human gene catalog. Here, we show, using comparative gene finding followed by experimental verification of exon pairs by RT–PCR, that the addition to the multi-exonic subset of this catalog could be as little as 0.2%, suggesting that we may be closing in on the human gene set. Our protocol, however, has two shortcomings: (i) the bioinformatic screening of the predicted genes, applied to filter out false positives, cannot handle intronless genes; and (ii) the experimental verification could fail to identify expression at a specific developmental time. This highlights the importance of developing methods that could provide a reliable estimate of the number of these two types of genes.
PMCID: PMC1074396  PMID: 15809229

Results 1-6 (6)