Search tips
Search criteria

Results 1-25 (45)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
more »
1.  Enhanced transcriptome maps from multiple mouse tissues reveal evolutionary constraint in gene expression 
Nature Communications  2015;6:5903.
Mice have been a long-standing model for human biology and disease. Here we characterize, by RNA sequencing, the transcriptional profiles of a large and heterogeneous collection of mouse tissues, augmenting the mouse transcriptome with thousands of novel transcript candidates. Comparison with transcriptome profiles in human cell lines reveals substantial conservation of transcriptional programmes, and uncovers a distinct class of genes with levels of expression that have been constrained early in vertebrate evolution. This core set of genes captures a substantial fraction of the transcriptional output of mammalian cells, and participates in basic functional and structural housekeeping processes common to all cell types. Perturbation of these constrained genes is associated with significant phenotypes including embryonic lethality and cancer. Evolutionary constraint in gene expression levels is not reflected in the conservation of the genomic sequences, but is associated with conserved epigenetic marking, as well as with characteristic post-transcriptional regulatory programme, in which sub-cellular localization and alternative splicing play comparatively large roles.
The analysis of mammalian transcriptomes could provide new insights into human biology. Here the authors carry out RNA sequencing in a large collection of mouse tissues and compare these data to human transcriptome profiles, identifying a set of constrained genes that carry out basic cellular functions with remarkably constant expression levels across tissues and species.
PMCID: PMC4308717  PMID: 25582907
2.  The First Myriapod Genome Sequence Reveals Conservative Arthropod Gene Content and Genome Organisation in the Centipede Strigamia maritima 
Chipman, Ariel D. | Ferrier, David E. K. | Brena, Carlo | Qu, Jiaxin | Hughes, Daniel S. T. | Schröder, Reinhard | Torres-Oliva, Montserrat | Znassi, Nadia | Jiang, Huaiyang | Almeida, Francisca C. | Alonso, Claudio R. | Apostolou, Zivkos | Aqrawi, Peshtewani | Arthur, Wallace | Barna, Jennifer C. J. | Blankenburg, Kerstin P. | Brites, Daniela | Capella-Gutiérrez, Salvador | Coyle, Marcus | Dearden, Peter K. | Du Pasquier, Louis | Duncan, Elizabeth J. | Ebert, Dieter | Eibner, Cornelius | Erikson, Galina | Evans, Peter D. | Extavour, Cassandra G. | Francisco, Liezl | Gabaldón, Toni | Gillis, William J. | Goodwin-Horn, Elizabeth A. | Green, Jack E. | Griffiths-Jones, Sam | Grimmelikhuijzen, Cornelis J. P. | Gubbala, Sai | Guigó, Roderic | Han, Yi | Hauser, Frank | Havlak, Paul | Hayden, Luke | Helbing, Sophie | Holder, Michael | Hui, Jerome H. L. | Hunn, Julia P. | Hunnekuhl, Vera S. | Jackson, LaRonda | Javaid, Mehwish | Jhangiani, Shalini N. | Jiggins, Francis M. | Jones, Tamsin E. | Kaiser, Tobias S. | Kalra, Divya | Kenny, Nathan J. | Korchina, Viktoriya | Kovar, Christie L. | Kraus, F. Bernhard | Lapraz, François | Lee, Sandra L. | Lv, Jie | Mandapat, Christigale | Manning, Gerard | Mariotti, Marco | Mata, Robert | Mathew, Tittu | Neumann, Tobias | Newsham, Irene | Ngo, Dinh N. | Ninova, Maria | Okwuonu, Geoffrey | Ongeri, Fiona | Palmer, William J. | Patil, Shobha | Patraquim, Pedro | Pham, Christopher | Pu, Ling-Ling | Putman, Nicholas H. | Rabouille, Catherine | Ramos, Olivia Mendivil | Rhodes, Adelaide C. | Robertson, Helen E. | Robertson, Hugh M. | Ronshaugen, Matthew | Rozas, Julio | Saada, Nehad | Sánchez-Gracia, Alejandro | Scherer, Steven E. | Schurko, Andrew M. | Siggens, Kenneth W. | Simmons, DeNard | Stief, Anna | Stolle, Eckart | Telford, Maximilian J. | Tessmar-Raible, Kristin | Thornton, Rebecca | van der Zee, Maurijn | von Haeseler, Arndt | Williams, James M. | Willis, Judith H. | Wu, Yuanqing | Zou, Xiaoyan | Lawson, Daniel | Muzny, Donna M. | Worley, Kim C. | Gibbs, Richard A. | Akam, Michael | Richards, Stephen
PLoS Biology  2014;12(11):e1002005.
Myriapods (e.g., centipedes and millipedes) display a simple homonomous body plan relative to other arthropods. All members of the class are terrestrial, but they attained terrestriality independently of insects. Myriapoda is the only arthropod class not represented by a sequenced genome. We present an analysis of the genome of the centipede Strigamia maritima. It retains a compact genome that has undergone less gene loss and shuffling than previously sequenced arthropods, and many orthologues of genes conserved from the bilaterian ancestor that have been lost in insects. Our analysis locates many genes in conserved macro-synteny contexts, and many small-scale examples of gene clustering. We describe several examples where S. maritima shows different solutions from insects to similar problems. The insect olfactory receptor gene family is absent from S. maritima, and olfaction in air is likely effected by expansion of other receptor gene families. For some genes S. maritima has evolved paralogues to generate coding sequence diversity, where insects use alternate splicing. This is most striking for the Dscam gene, which in Drosophila generates more than 100,000 alternate splice forms, but in S. maritima is encoded by over 100 paralogues. We see an intriguing linkage between the absence of any known photosensory proteins in a blind organism and the additional absence of canonical circadian clock genes. The phylogenetic position of myriapods allows us to identify where in arthropod phylogeny several particular molecular mechanisms and traits emerged. For example, we conclude that juvenile hormone signalling evolved with the emergence of the exoskeleton in the arthropods and that RR-1 containing cuticle proteins evolved in the lineage leading to Mandibulata. We also identify when various gene expansions and losses occurred. The genome of S. maritima offers us a unique glimpse into the ancestral arthropod genome, while also displaying many adaptations to its specific life history.
Author Summary
Arthropods are the most abundant animals on earth. Among them, insects clearly dominate on land, whereas crustaceans hold the title for the most diverse invertebrates in the oceans. Much is known about the biology of these groups, not least because of genomic studies of the fruit fly Drosophila, the water flea Daphnia, and other species used in research. Here we report the first genome sequence from a species belonging to a lineage that has previously received very little attention—the myriapods. Myriapods were among the first arthropods to invade the land over 400 million years ago, and survive today as the herbivorous millipedes and venomous centipedes, one of which—Strigamia maritima—we have sequenced here. We find that the genome of this centipede retains more characteristics of the presumed arthropod ancestor than other sequenced insect genomes. The genome provides access to many aspects of myriapod biology that have not been studied before, suggesting, for example, that they have diversified receptors for smell that are quite different from those used by insects. In addition, it shows specific consequences of the largely subterranean life of this particular species, which seems to have lost the genes for all known light-sensing molecules, even though it still avoids light.
PMCID: PMC4244043  PMID: 25423365
3.  Assessment of transcript reconstruction methods for RNA-seq 
Nature methods  2013;10(12):10.1038/nmeth.2714.
RNA sequencing (RNA-seq) is transforming genome biology, enabling comprehensive transcriptome profiling with unprecendented accuracy and detail. Due to technical limitations of current high-throughput sequencing platforms, transcript identity, structure and expression level must be inferred programmatically from partial sequence reads of fragmented gene products. We evaluated 24 protocol variants of 14 independent computational methods for exon identification, transcript reconstruction and expression level quantification from RNA-seq data. Our results show that most algorithms are able to identify discrete transcript components with high success rates, but that assembly of complete isoform structures poses a major challenge even when all constituent elements are identified. Expression level estimates also varied widely across methods, even when based on similar transcript models. Consequently, the complexity of higher eukaryotic genomes imposes severe limitations in transcript recall and splice product discrimination that are likely to remain limiting factors for the analysis of current-generation RNA-seq data.
PMCID: PMC3851240  PMID: 24185837
4.  Systematic evaluation of spliced alignment programs for RNA-seq data 
Nature methods  2013;10(12):1185-1191.
High-throughput RNA sequencing is an increasingly accessible method for studying gene structure and activity on a genome-wide scale. A critical step in RNA-seq data analysis is the alignment of partial transcript reads to a reference genome sequence. to assess the performance of current mapping software, we invited developers of RNA-seq aligners to process four large human and mouse RNA-seq data sets. in total, we compared 26 mapping protocols based on 11 programs and pipelines and found major performance differences between methods on numerous benchmarks, including alignment yield, basewise accuracy, mismatch and gap placement, exon junction discovery and suitability of alignments for transcript reconstruction. We observed concordant results on real and simulated RNA-seq data, confirming the relevance of the metrics employed. Future developments in RNA-seq alignment methods would benefit from improved placement of multimapped reads, balanced utilization of existing gene annotation and a reduced false discovery rate for splice junctions.
PMCID: PMC4018468  PMID: 24185836
5.  Genome-wide profiling of the cardiac transcriptome after myocardial infarction identifies novel heart-specific long non-coding RNAs 
European Heart Journal  2014;36(6):353-368.
Heart disease is recognized as a consequence of dysregulation of cardiac gene regulatory networks. Previously, unappreciated components of such networks are the long non-coding RNAs (lncRNAs). Their roles in the heart remain to be elucidated. Thus, this study aimed to systematically characterize the cardiac long non-coding transcriptome post-myocardial infarction and to elucidate their potential roles in cardiac homoeostasis.
Methods and results
We annotated the mouse transcriptome after myocardial infarction via RNA sequencing and ab initio transcript reconstruction, and integrated genome-wide approaches to associate specific lncRNAs with developmental processes and physiological parameters. Expression of specific lncRNAs strongly correlated with defined parameters of cardiac dimensions and function. Using chromatin maps to infer lncRNA function, we identified many with potential roles in cardiogenesis and pathological remodelling. The vast majority was associated with active cardiac-specific enhancers. Importantly, oligonucleotide-mediated knockdown implicated novel lncRNAs in controlling expression of key regulatory proteins involved in cardiogenesis. Finally, we identified hundreds of human orthologues and demonstrate that particular candidates were differentially modulated in human heart disease.
These findings reveal hundreds of novel heart-specific lncRNAs with unique regulatory and functional characteristics relevant to maladaptive remodelling, cardiac function and possibly cardiac regeneration. This new class of molecules represents potential therapeutic targets for cardiac disease. Furthermore, their exquisite correlation with cardiac physiology renders them attractive candidate biomarkers to be used in the clinic.
PMCID: PMC4320320  PMID: 24786300
Myocardial infarction; Heart failure; Transcriptome; Long non-coding RNAs; Next-generation sequencing
6.  Transcriptome and genome sequencing uncovers functional variation in humans 
Nature  2013;501(7468):506-511.
Genome sequencing projects are discovering millions of genetic variants in humans, and interpretation of their functional effects is essential for understanding the genetic basis of variation in human traits. Here we report sequencing and deep analysis of mRNA and miRNA from lymphoblastoid cell lines of 462 individuals from the 1000 Genomes Project – the first uniformly processed RNA-seq data from multiple human populations with high-quality genome sequences. We discovered extremely widespread genetic variation affecting regulation of the majority of genes, with transcript structure and expression level variation being equally common but genetically largely independent. Our characterization of causal regulatory variation sheds light on cellular mechanisms of regulatory and loss-of-function variation, and allowed us to infer putative causal variants for dozens of disease-associated loci. Altogether, this study provides a deep understanding of the cellular mechanisms of transcriptome variation and of the landscape of functional variants in the human genome.
PMCID: PMC3918453  PMID: 24037378
7.  ASPic-GeneID: A Lightweight Pipeline for Gene Prediction and Alternative Isoforms Detection 
BioMed Research International  2013;2013:502827.
New genomes are being sequenced at an increasingly rapid rate, far outpacing the rate at which manual gene annotation can be performed. Automated genome annotation is thus necessitated by this growth in genome projects; however, full-fledged annotation systems are usually home-grown and customized to a particular genome. There is thus a renewed need for accurate ab initio gene prediction methods. However, it is apparent that fully ab initio methods fall short of the required level of sensitivity and specificity for a quality annotation. Evidence in the form of expressed sequences gives the single biggest improvement in accuracy when used to inform gene predictions. Here, we present a lightweight pipeline for first-pass gene prediction on newly sequenced genomes. The two main components are ASPic, a program that derives highly accurate, albeit not necessarily complete, EST-based transcript annotations from EST alignments, and GeneID, a standard gene prediction program, which we have modified to take as evidence intron annotations. The introns output by ASPic CDS predictions is given to GeneID to constrain the exon-chaining process and produce predictions consistent with the underlying EST alignments. The pipeline was successfully tested on the entire C. elegans genome and the 44 ENCODE human pilot regions.
PMCID: PMC3838850  PMID: 24308000
8.  SelenoDB 2.0: annotation of selenoprotein genes in animals and their genetic diversity in humans 
Nucleic Acids Research  2013;42(Database issue):D437-D443.
SelenoDB ( aims to provide high-quality annotations of selenoprotein genes, proteins and SECIS elements. Selenoproteins are proteins that contain the amino acid selenocysteine (Sec) and the first release of the database included annotations for eight species. Since the release of SelenoDB 1.0 many new animal genomes have been sequenced. The annotations of selenoproteins in new genomes usually contain many errors in major databases. For this reason, we have now fully annotated selenoprotein genes in 58 animal genomes. We provide manually curated annotations for human selenoproteins, whereas we use an automatic annotation pipeline to annotate selenoprotein genes in other animal genomes. In addition, we annotate the homologous genes containing cysteine (Cys) instead of Sec. Finally, we have surveyed genetic variation in the annotated genes in humans. We use exon capture and resequencing approaches to identify single-nucleotide polymorphisms in more than 50 human populations around the world. We thus present a detailed view of the genetic divergence of Sec- and Cys-containing genes in animals and their diversity in humans. The addition of these datasets into the second release of the database provides a valuable resource for addressing medical and evolutionary questions in selenium biology.
PMCID: PMC3965025  PMID: 24194593
9.  Topoisomerase II regulates yeast genes with singular chromatin architectures 
Nucleic Acids Research  2013;41(20):9243-9256.
Eukaryotic topoisomerase II (topo II) is the essential decatenase of newly replicated chromosomes and the main relaxase of nucleosomal DNA. Apart from these general tasks, topo II participates in more specialized functions. In mammals, topo IIα interacts with specific RNA polymerases and chromatin-remodeling complexes, whereas topo IIβ regulates developmental genes in conjunction with chromatin remodeling and heterochromatin transitions. Here we show that in budding yeast, topo II regulates the expression of specific gene subsets. To uncover this, we carried out a genomic transcription run-on shortly after the thermal inactivation of topo II. We identified a modest number of genes not involved in the general stress response but strictly dependent on topo II. These genes present distinctive functional and structural traits in comparison with the genome average. Yeast topo II is a positive regulator of genes with well-defined promoter architecture that associates to chromatin remodeling complexes; it is a negative regulator of genes extremely hypo-acetylated with complex promoters and undefined nucleosome positioning, many of which are involved in polyamine transport. These findings indicate that yeast topo II operates on singular chromatin architectures to activate or repress DNA transcription and that this activity produces functional responses to ensure chromatin stability.
PMCID: PMC3814376  PMID: 23935120
10.  Variation in Novel Exons (RACEfrags) of the MECP2 Gene in Rett Syndrome Patients and Controls 
Human mutation  2009;30(9):E866-E879.
The study of transcription using genomic tiling arrays has lead to the identification of numerous additional exons. One example is the MECP2 gene on the X chromosome; using 5’RACE and RT-PCR in human tissues and cell lines, we have found more than 70 novel exons (RACEfrags) connecting to at least one annotated exon.. We sequenced all MECP2-connected exons and flanking sequences in 3 groups: 46 patients with the Rett syndrome and without mutations in the currently annotated exons of the MECP2 and CDKL5 genes; 32 patients with the Rett syndrome and identified mutations in the MECP2 gene; 100 control individuals from the same geoethnic group. Approximately 13kb were sequenced per sample, (2.4Mb of DNA resequencing). A total of 75 individuals had novel rare variants (mostly private variants) but no statistically significant difference was found among the 3 groups. These results suggest that variants in the newly discovered exons may not contribute to Rett syndrome. Interestingly however, there are about twice more variants in the novel exons than in the flanking sequences (44 vs. 21 for approximately 1.3 Mb sequenced for each class of sequences, p = 0.0025). Thus the evolutionary forces that shape these novel exons may be different than those of neighboring sequences.
PMCID: PMC3708316  PMID: 19562714
MECP2; Rett syndrome; RACEfrags; SNP; rare variants; positive selection
11.  Landscape of transcription in human cells 
Djebali, Sarah | Davis, Carrie A. | Merkel, Angelika | Dobin, Alex | Lassmann, Timo | Mortazavi, Ali M. | Tanzer, Andrea | Lagarde, Julien | Lin, Wei | Schlesinger, Felix | Xue, Chenghai | Marinov, Georgi K. | Khatun, Jainab | Williams, Brian A. | Zaleski, Chris | Rozowsky, Joel | Röder, Maik | Kokocinski, Felix | Abdelhamid, Rehab F. | Alioto, Tyler | Antoshechkin, Igor | Baer, Michael T. | Bar, Nadav S. | Batut, Philippe | Bell, Kimberly | Bell, Ian | Chakrabortty, Sudipto | Chen, Xian | Chrast, Jacqueline | Curado, Joao | Derrien, Thomas | Drenkow, Jorg | Dumais, Erica | Dumais, Jacqueline | Duttagupta, Radha | Falconnet, Emilie | Fastuca, Meagan | Fejes-Toth, Kata | Ferreira, Pedro | Foissac, Sylvain | Fullwood, Melissa J. | Gao, Hui | Gonzalez, David | Gordon, Assaf | Gunawardena, Harsha | Howald, Cedric | Jha, Sonali | Johnson, Rory | Kapranov, Philipp | King, Brandon | Kingswood, Colin | Luo, Oscar J. | Park, Eddie | Persaud, Kimberly | Preall, Jonathan B. | Ribeca, Paolo | Risk, Brian | Robyr, Daniel | Sammeth, Michael | Schaffer, Lorian | See, Lei-Hoon | Shahab, Atif | Skancke, Jorgen | Suzuki, Ana Maria | Takahashi, Hazuki | Tilgner, Hagen | Trout, Diane | Walters, Nathalie | Wang, Huaien | Wrobel, John | Yu, Yanbao | Ruan, Xiaoan | Hayashizaki, Yoshihide | Harrow, Jennifer | Gerstein, Mark | Hubbard, Tim | Reymond, Alexandre | Antonarakis, Stylianos E. | Hannon, Gregory | Giddings, Morgan C. | Ruan, Yijun | Wold, Barbara | Carninci, Piero | Guigó, Roderic | Gingeras, Thomas R.
Nature  2012;489(7414):101-108.
Eukaryotic cells make many types of primary and processed RNAs that are found either in specific sub-cellular compartments or throughout the cells. A complete catalogue of these RNAs is not yet available and their characteristic sub-cellular localizations are also poorly understood. Since RNA represents the direct output of the genetic information encoded by genomes and a significant proportion of a cell’s regulatory capabilities are focused on its synthesis, processing, transport, modifications and translation, the generation of such a catalogue is crucial for understanding genome function. Here we report evidence that three quarters of the human genome is capable of being transcribed, as well as observations about the range and levels of expression, localization, processing fates, regulatory regions and modifications of almost all currently annotated and thousands of previously unannotated RNAs. These observations taken together prompt to a redefinition of the concept of a gene.
PMCID: PMC3684276  PMID: 22955620
12.  Unravelling the hidden DNA structural/physical code provides novel insights on promoter location 
Nucleic Acids Research  2013;41(15):7220-7230.
Although protein recognition of DNA motifs in promoter regions has been traditionally considered as a critical regulatory element in transcription, the location of promoters, and in particular transcription start sites (TSSs), still remains a challenge. Here we perform a comprehensive analysis of putative core promoter sequences relative to non-annotated predicted TSSs along the human genome, which were defined by distinct DNA physical properties implemented in our ProStar computational algorithm. A representative sampling of predicted regions was subjected to extensive experimental validation and analyses. Interestingly, the vast majority proved to be transcriptionally active despite the lack of specific sequence motifs, indicating that physical signaling is indeed able to detect promoter activity beyond conventional TSS prediction methods. Furthermore, highly active regions displayed typical chromatin features associated to promoters of housekeeping genes. Our results enable to redefine the promoter signatures and analyze the diversity, evolutionary conservation and dynamic regulation of human core promoters at large-scale. Moreover, the present study strongly supports the hypothesis of an ancient regulatory mechanism encoded by the intrinsic physical properties of the DNA that may contribute to the complexity of transcription regulation in the human genome.
PMCID: PMC3753636  PMID: 23761436
13.  Transcriptome analyses of primitively eusocial wasps reveal novel insights into the evolution of sociality and the origin of alternative phenotypes 
Genome Biology  2013;14(2):R20.
Understanding how alternative phenotypes arise from the same genome is a major challenge in modern biology. Eusociality in insects requires the evolution of two alternative phenotypes - workers, who sacrifice personal reproduction, and queens, who realize that reproduction. Extensive work on honeybees and ants has revealed the molecular basis of derived queen and worker phenotypes in highly eusocial lineages, but we lack equivalent deep-level analyses of wasps and of primitively eusocial species, the latter of which can reveal how phenotypic decoupling first occurs in the early stages of eusocial evolution.
We sequenced 20 Gbp of transcriptomes derived from brains of different behavioral castes of the primitively eusocial tropical paper wasp Polistes canadensis. Surprisingly, 75% of the 2,442 genes differentially expressed between phenotypes were novel, having no significant homology with described sequences. Moreover, 90% of these novel genes were significantly upregulated in workers relative to queens. Differential expression of novel genes in the early stages of sociality may be important in facilitating the evolution of worker behavioral complexity in eusocial evolution. We also found surprisingly low correlation in the identity and direction of expression of differentially expressed genes across similar phenotypes in different social lineages, supporting the idea that social evolution in different lineages requires substantial de novo rewiring of molecular pathways.
These genomic resources for aculeate wasps and first transcriptome-wide insights into the origin of castes bring us closer to a more general understanding of eusocial evolution and how phenotypic diversity arises from the same genome.
PMCID: PMC4053794  PMID: 23442883
14.  Grape RNA-Seq analysis pipeline environment 
Bioinformatics  2013;29(5):614-621.
Motivation: The avalanche of data arriving since the development of NGS technologies have prompted the need for developing fast, accurate and easily automated bioinformatic tools capable of dealing with massive datasets. Among the most productive applications of NGS technologies is the sequencing of cellular RNA, known as RNA-Seq. Although RNA-Seq provides similar or superior dynamic range than microarrays at similar or lower cost, the lack of standard and user-friendly pipelines is a bottleneck preventing RNA-Seq from becoming the standard for transcriptome analysis.
Results: In this work we present a pipeline for processing and analyzing RNA-Seq data, that we have named Grape (Grape RNA-Seq Analysis Pipeline Environment). Grape supports raw sequencing reads produced by a variety of technologies, either in FASTA or FASTQ format, or as prealigned reads in SAM/BAM format. A minimal Grape configuration consists of the file location of the raw sequencing reads, the genome of the species and the corresponding gene and transcript annotation.
Grape first runs a set of quality control steps, and then aligns the reads to the genome, a step that is omitted for prealigned read formats. Grape next estimates gene and transcript expression levels, calculates exon inclusion levels and identifies novel transcripts.
Grape can be run on a single computer or in parallel on a computer cluster. It is distributed with specific mapping and quantification tools, but given its modular design, any tool supporting popular data interchange formats can be integrated.
Availability: Grape can be obtained from the Bioinformatics and Genomics website at:
Contact: or
PMCID: PMC3582270  PMID: 23329413
15.  Intron-centric estimation of alternative splicing from RNA-seq data 
Bioinformatics  2012;29(2):273-274.
Motivation: Novel technologies brought in unprecedented amounts of high-throughput sequencing data along with great challenges in their analysis and interpretation. The percent-spliced-in (PSI, ) metric estimates the incidence of single-exon–skipping events and can be computed directly by counting reads that align to known or predicted splice junctions. However, the majority of human splicing events are more complex than single-exon skipping.
Results: In this short report, we present a framework that generalizes the metric to arbitrary classes of splicing events. We change the view from exon centric to intron centric and split the value of into two indices, and , measuring the rate of splicing at the 5′ and 3′ end of the intron, respectively. The advantage of having two separate indices is that they deconvolute two distinct elementary acts of the splicing reaction. The completeness of splicing index is decomposed in a similar way. This framework is implemented as bam2ssj, a BAM-file–processing pipeline for strand-specific counting of reads that align to splice junctions or overlap with splice sites. It can be used as a consistent protocol for quantifying splice junctions from RNA-seq data because no such standard procedure currently exists.
Availability: The C code of bam2ssj is open source and is available at
PMCID: PMC3546801  PMID: 23172860
16.  Modelling and simulating generic RNA-Seq experiments with the flux simulator 
Nucleic Acids Research  2012;40(20):10073-10083.
High-throughput sequencing of cDNA libraries constructed from cellular RNA complements (RNA-Seq) naturally provides a digital quantitative measurement for every expressed RNA molecule. Nature, impact and mutual interference of biases in different experimental setups are, however, still poorly understood—mostly due to the lack of data from intermediate protocol steps. We analysed multiple RNA-Seq experiments, involving different sample preparation protocols and sequencing platforms: we broke them down into their common—and currently indispensable—technical components (reverse transcription, fragmentation, adapter ligation, PCR amplification, gel segregation and sequencing), investigating how such different steps influence abundance and distribution of the sequenced reads. For each of those steps, we developed universally applicable models, which can be parameterised by empirical attributes of any experimental protocol. Our models are implemented in a computer simulation pipeline called the Flux Simulator, and we show that read distributions generated by different combinations of these models reproduce well corresponding evidence obtained from the corresponding experimental setups. We further demonstrate that our in silico RNA-Seq provides insights about hidden precursors that determine the final configuration of reads along gene bodies; enhancing or compensatory effects that explain apparently controversial observations can be observed. Moreover, our simulations identify hitherto unreported sources of systematic bias from RNA hydrolysis, a fragmentation technique currently employed by most RNA-Seq protocols.
PMCID: PMC3488205  PMID: 22962361
17.  Modeling gene expression using chromatin features in various cellular contexts 
Genome Biology  2012;13(9):R53.
Previous work has demonstrated that chromatin feature levels correlate with gene expression. The ENCODE project enables us to further explore this relationship using an unprecedented volume of data. Expression levels from more than 100,000 promoters were measured using a variety of high-throughput techniques applied to RNA extracted by different protocols from different cellular compartments of several human cell lines. ENCODE also generated the genome-wide mapping of eleven histone marks, one histone variant, and DNase I hypersensitivity sites in seven cell lines.
We built a novel quantitative model to study the relationship between chromatin features and expression levels. Our study not only confirms that the general relationships found in previous studies hold across various cell lines, but also makes new suggestions about the relationship between chromatin features and gene expression levels. We found that expression status and expression levels can be predicted by different groups of chromatin features, both with high accuracy. We also found that expression levels measured by CAGE are better predicted than by RNA-PET or RNA-Seq, and different categories of chromatin features are the most predictive of expression for different RNA measurement methods. Additionally, PolyA+ RNA is overall more predictable than PolyA- RNA among different cell compartments, and PolyA+ cytosolic RNA measured with RNA-Seq is more predictable than PolyA+ nuclear RNA, while the opposite is true for PolyA- RNA.
Our study provides new insights into transcriptional regulation by analyzing chromatin features in different cellular contexts.
PMCID: PMC3491397  PMID: 22950368
18.  Whole-genome sequencing identifies recurrent mutations in chronic lymphocytic leukaemia 
Nature  2011;475(7354):101-105.
Chronic lymphocytic leukaemia (CLL), the most frequent leukaemia in adults in Western countries, is a heterogeneous disease with variable clinical presentation and evolution1,2. Two major molecular subtypes can be distinguished, characterized respectively by a high or low number of somatic hypermutations in the variable region of immunoglobulin genes3,4. The molecular changes leading to the pathogenesis of the disease are still poorly understood. Here we performed whole-genome sequencing of four cases of CLL and identified 46 somatic mutations that potentially affect gene function. Further analysis of these mutations in 363 patients with CLL identified four genes that are recurrently mutated: notch 1 (NOTCH1), exportin 1 (XPO1), myeloid differentiation primary response gene 88 (MYD88) and kelch-like 6 (KLHL6). Mutations in MYD88 and KLHL6 are predominant in cases of CLL with mutated immunoglobulin genes, whereas NOTCH1 and XPO1 mutations are mainly detected in patients with unmutated immunoglobulins. The patterns of somatic mutation, supported by functional and clinical analyses, strongly indicate that the recurrent NOTCH1, MYD88 and XPO1 mutations are oncogenic changes that contribute to the clinical evolution of the disease. To our knowledge, this is the first comprehensive analysis of CLL combining whole-genome sequencing with clinical characteristics and clinical outcomes. It highlights the usefulness of this approach for the identification of clinically relevant mutations in cancer.
PMCID: PMC3322590  PMID: 21642962
19.  Fast Computation and Applications of Genome Mappability 
PLoS ONE  2012;7(1):e30377.
We present a fast mapping-based algorithm to compute the mappability of each region of a reference genome up to a specified number of mismatches. Knowing the mappability of a genome is crucial for the interpretation of massively parallel sequencing experiments. We investigate the properties of the mappability of eukaryotic DNA/RNA both as a whole and at the level of the gene family, providing for various organisms tracks which allow the mappability information to be visually explored. In addition, we show that mappability varies greatly between species and gene classes. Finally, we suggest several practical applications where mappability can be used to refine the analysis of high-throughput sequencing data (SNP calling, gene expression quantification and paired-end experiments). This work highlights mappability as an important concept which deserves to be taken into full account, in particular when massively parallel sequencing technologies are employed. The GEM mappability program belongs to the GEM (GEnome Multitool) suite of programs, which can be freely downloaded for any use from its website (
PMCID: PMC3261895  PMID: 22276185
20.  The Long Non-Coding RNAs: A New (P)layer in the “Dark Matter” 
Frontiers in Genetics  2012;2:107.
The transcriptome of a cell is represented by a myriad of different RNA molecules with and without protein-coding capacities. In recent years, advances in sequencing technologies have allowed researchers to more fully appreciate the complexity of whole transcriptomes, showing that the vast majority of the genome is transcribed, producing a diverse population of non-protein coding RNAs (ncRNAs). Thus, the biological significance of non-coding RNAs (ncRNAs) have been largely underestimated. Amongst these multiple classes of ncRNAs, the long non-coding RNAs (lncRNAs) are apparently the most numerous and functionally diverse. A small but growing number of lncRNAs have been experimentally studied, and a view is emerging that these are key regulators of epigenetic gene regulation in mammalian cells. LncRNAs have already been implicated in human diseases such as cancer and neurodegeneration, highlighting the importance of this emergent field. In this article, we review the catalogs of annotated lncRNAs and the latest advances in our understanding of lncRNAs.
PMCID: PMC3266617  PMID: 22303401
non-coding RNAs; regulation; long non-coding RNA; epigenetics
21.  Evidence for Transcript Networks Composed of Chimeric RNAs in Human Cells 
PLoS ONE  2012;7(1):e28213.
The classic organization of a gene structure has followed the Jacob and Monod bacterial gene model proposed more than 50 years ago. Since then, empirical determinations of the complexity of the transcriptomes found in yeast to human has blurred the definition and physical boundaries of genes. Using multiple analysis approaches we have characterized individual gene boundaries mapping on human chromosomes 21 and 22. Analyses of the locations of the 5′ and 3′ transcriptional termini of 492 protein coding genes revealed that for 85% of these genes the boundaries extend beyond the current annotated termini, most often connecting with exons of transcripts from other well annotated genes. The biological and evolutionary importance of these chimeric transcripts is underscored by (1) the non-random interconnections of genes involved, (2) the greater phylogenetic depth of the genes involved in many chimeric interactions, (3) the coordination of the expression of connected genes and (4) the close in vivo and three dimensional proximity of the genomic regions being transcribed and contributing to parts of the chimeric RNAs. The non-random nature of the connection of the genes involved suggest that chimeric transcripts should not be studied in isolation, but together, as an RNA network.
PMCID: PMC3251577  PMID: 22238572
22.  Genome-wide CTCF distribution in vertebrates defines equivalent sites that can aid in the identification of disease-associated genes 
Many genomic alterations associated to human diseases localize in non-coding regulatory elements located far from the promoters they regulate, making the association of non-coding mutations or risk associated variants to target genes challenging. The range of action of a given set of enhancers is thought to be defined by insulator elements bound by CTCF. Here, we analyzed the genomic distribution of CTCF in various human, mouse and chicken cell types, demonstrating the existence of evolutionarily conserved CTCF-bound sites beyond mammals. These sites preferentially flank transcription factor-encoding genes, often associated to human diseases, and function as enhancer blockers in vivo, suggesting that they act as evolutionary invariant gene boundaries. We then applied this concept to predict and functionally demonstrate that the polymorphic variants associated to multiple sclerosis located within the EVI5 gene are actually impinging on the adjacent gene GFI1.
PMCID: PMC3196567  PMID: 21602820
23.  Interplay between BRCA1 and RHAMM Regulates Epithelial Apicobasal Polarization and May Influence Risk of Breast Cancer 
Maxwell, Christopher A. | Benítez, Javier | Gómez-Baldó, Laia | Osorio, Ana | Bonifaci, Núria | Fernández-Ramires, Ricardo | Costes, Sylvain V. | Guinó, Elisabet | Chen, Helen | Evans, Gareth J. R. | Mohan, Pooja | Català, Isabel | Petit, Anna | Aguilar, Helena | Villanueva, Alberto | Aytes, Alvaro | Serra-Musach, Jordi | Rennert, Gad | Lejbkowicz, Flavio | Peterlongo, Paolo | Manoukian, Siranoush | Peissel, Bernard | Ripamonti, Carla B. | Bonanni, Bernardo | Viel, Alessandra | Allavena, Anna | Bernard, Loris | Radice, Paolo | Friedman, Eitan | Kaufman, Bella | Laitman, Yael | Dubrovsky, Maya | Milgrom, Roni | Jakubowska, Anna | Cybulski, Cezary | Gorski, Bohdan | Jaworska, Katarzyna | Durda, Katarzyna | Sukiennicki, Grzegorz | Lubiński, Jan | Shugart, Yin Yao | Domchek, Susan M. | Letrero, Richard | Weber, Barbara L. | Hogervorst, Frans B. L. | Rookus, Matti A. | Collee, J. Margriet | Devilee, Peter | Ligtenberg, Marjolijn J. | van der Luijt, Rob B. | Aalfs, Cora M. | Waisfisz, Quinten | Wijnen, Juul | van Roozendaal, Cornelis E. P. | Easton, Douglas F. | Peock, Susan | Cook, Margaret | Oliver, Clare | Frost, Debra | Harrington, Patricia | Evans, D. Gareth | Lalloo, Fiona | Eeles, Rosalind | Izatt, Louise | Chu, Carol | Eccles, Diana | Douglas, Fiona | Brewer, Carole | Nevanlinna, Heli | Heikkinen, Tuomas | Couch, Fergus J. | Lindor, Noralane M. | Wang, Xianshu | Godwin, Andrew K. | Caligo, Maria A. | Lombardi, Grazia | Loman, Niklas | Karlsson, Per | Ehrencrona, Hans | von Wachenfeldt, Anna | Bjork Barkardottir, Rosa | Hamann, Ute | Rashid, Muhammad U. | Lasa, Adriana | Caldés, Trinidad | Andrés, Raquel | Schmitt, Michael | Assmann, Volker | Stevens, Kristen | Offit, Kenneth | Curado, João | Tilgner, Hagen | Guigó, Roderic | Aiza, Gemma | Brunet, Joan | Castellsagué, Joan | Martrat, Griselda | Urruticoechea, Ander | Blanco, Ignacio | Tihomirova, Laima | Goldgar, David E. | Buys, Saundra | John, Esther M. | Miron, Alexander | Southey, Melissa | Daly, Mary B. | Schmutzler, Rita K. | Wappenschmidt, Barbara | Meindl, Alfons | Arnold, Norbert | Deissler, Helmut | Varon-Mateeva, Raymonda | Sutter, Christian | Niederacher, Dieter | Imyamitov, Evgeny | Sinilnikova, Olga M. | Stoppa-Lyonne, Dominique | Mazoyer, Sylvie | Verny-Pierre, Carole | Castera, Laurent | de Pauw, Antoine | Bignon, Yves-Jean | Uhrhammer, Nancy | Peyrat, Jean-Philippe | Vennin, Philippe | Fert Ferrer, Sandra | Collonge-Rame, Marie-Agnès | Mortemousque, Isabelle | Spurdle, Amanda B. | Beesley, Jonathan | Chen, Xiaoqing | Healey, Sue | Barcellos-Hoff, Mary Helen | Vidal, Marc | Gruber, Stephen B. | Lázaro, Conxi | Capellá, Gabriel | McGuffog, Lesley | Nathanson, Katherine L. | Antoniou, Antonis C. | Chenevix-Trench, Georgia | Fleisch, Markus C. | Moreno, Víctor | Pujana, Miguel Angel
PLoS Biology  2011;9(11):e1001199.
Genetic analysis identifies the HMMR gene as a modifier of the breast cancer risk associated with BRCA1 gene mutation, while cell biological analysis of the protein product suggests a function in regulating development of the mammary gland.
Differentiated mammary epithelium shows apicobasal polarity, and loss of tissue organization is an early hallmark of breast carcinogenesis. In BRCA1 mutation carriers, accumulation of stem and progenitor cells in normal breast tissue and increased risk of developing tumors of basal-like type suggest that BRCA1 regulates stem/progenitor cell proliferation and differentiation. However, the function of BRCA1 in this process and its link to carcinogenesis remain unknown. Here we depict a molecular mechanism involving BRCA1 and RHAMM that regulates apicobasal polarity and, when perturbed, may increase risk of breast cancer. Starting from complementary genetic analyses across families and populations, we identified common genetic variation at the low-penetrance susceptibility HMMR locus (encoding for RHAMM) that modifies breast cancer risk among BRCA1, but probably not BRCA2, mutation carriers: n = 7,584, weighted hazard ratio (wHR) = 1.09 (95% CI 1.02–1.16), ptrend = 0.017; and n = 3,965, wHR = 1.04 (95% CI 0.94–1.16), ptrend = 0.43; respectively. Subsequently, studies of MCF10A apicobasal polarization revealed a central role for BRCA1 and RHAMM, together with AURKA and TPX2, in essential reorganization of microtubules. Mechanistically, reorganization is facilitated by BRCA1 and impaired by AURKA, which is regulated by negative feedback involving RHAMM and TPX2. Taken together, our data provide fundamental insight into apicobasal polarization through BRCA1 function, which may explain the expanded cell subsets and characteristic tumor type accompanying BRCA1 mutation, while also linking this process to sporadic breast cancer through perturbation of HMMR/RHAMM.
Author Summary
Mutations in two genes that were initially identified as predisposing carriers to early-onset breast cancer, BRCA1 and BRCA2, cause similar perturbations in cellular responses to DNA damage but predispose carriers to distinct tumor types. Thus, the two genes may trigger different carcinogenic processes. We have used genetic analyses of affected families to uncover additional genetic variation that is linked to the risk of developing cancer for carriers of BRCA1 mutations. This variation falls within a centrosomal gene, named HMMR. The protein product of HMMR, which is called RHAMM, works in concert with BRCA1 to regulate the structure of normal breast cells as they grow and become polarized. This polarization process depends upon a balance between the activities of BRCA1 and the Aurora kinase A, with the kinase opposing BRCA1 function and promoting growth. Our findings provide new insights into the mechanism through which BRCA1 may promote commitment of initially bipotent mammary cells towards the luminal lineage, and how loss of this function may predispose cells to become breast tumors of a basal-like type.
PMCID: PMC3217025  PMID: 22110403
24.  The Origins, Evolution, and Functional Potential of Alternative Splicing in Vertebrates 
Molecular Biology and Evolution  2011;28(10):2949-2959.
Alternative splicing (AS) has the potential to greatly expand the functional repertoire of mammalian transcriptomes. However, few variant transcripts have been characterized functionally, making it difficult to assess the contribution of AS to the generation of phenotypic complexity and to study the evolution of splicing patterns. We have compared the AS of 309 protein-coding genes in the human ENCODE pilot regions against their mouse orthologs in unprecedented detail, utilizing traditional transcriptomic and RNAseq data. The conservation status of every transcript has been investigated, and each functionally categorized as coding (separated into coding sequence [CDS] or nonsense-mediated decay [NMD] linked) or noncoding. In total, 36.7% of human and 19.3% of mouse coding transcripts are species specific, and we observe a 3.6 times excess of human NMD transcripts compared with mouse; in contrast to previous studies, the majority of species-specific AS is unlinked to transposable elements. We observe one conserved CDS variant and one conserved NMD variant per 2.3 and 11.4 genes, respectively. Subsequently, we identify and characterize equivalent AS patterns for 22.9% of these CDS or NMD-linked events in nonmammalian vertebrate genomes, and our data indicate that functional NMD-linked AS is more widespread and ancient than previously thought. Furthermore, although we observe an association between conserved AS and elevated sequence conservation, as previously reported, we emphasize that 30% of conserved AS exons display sequence conservation below the average score for constitutive exons. In conclusion, we demonstrate the value of detailed comparative annotation in generating a comprehensive set of AS transcripts, increasing our understanding of AS evolution in vertebrates. Our data supports a model whereby the acquisition of functional AS has occurred throughout vertebrate evolution and is considered alongside amino acid change as a key mechanism in gene evolution.
PMCID: PMC3176834  PMID: 21551269
alternative splicing; nonsense-mediated decay; vertebrate evolution; RBM39
25.  Structural constraints revealed in consistent nucleosome positions in the genome of S. cerevisiae 
Recent advances in the field of high-throughput genomics have rendered possible the performance of genome-scale studies to define the nucleosomal landscapes of eukaryote genomes. Such analyses are aimed towards providing a better understanding of the process of nucleosome positioning, for which several models have been suggested. Nevertheless, questions regarding the sequence constraints of nucleosomal DNA and how they may have been shaped through evolution remain open. In this paper, we analyze in detail different experimental nucleosome datasets with the aim of providing a hypothesis for the emergence of nucleosome-forming sequences.
We compared the complete sets of nucleosome positions for the budding yeast (Saccharomyces cerevisiae) as defined in the output of two independent experiments with the use of two different experimental techniques. We found that < 10% of the experimentally defined nucleosome positions were consistently positioned in both datasets. This subset of well-positioned nucleosomes, when compared with the bulk, was shown to have particular properties at both sequence and structural levels. Consistently positioned nucleosomes were also shown to occur preferentially in pairs of dinucleosomes, and to be surprisingly less conserved compared with their adjacent nucleosome-free linkers.
Our findings may be combined into a hypothesis for the emergence of a weak nucleosome-positioning code. According to this hypothesis, consistent nucleosomes may be partly guided by nearby nucleosome-free regions through statistical positioning. Once established, a set of well-positioned consistent nucleosomes may impose secondary constraints that further shape the structure of the underlying DNA. We were able to capture these constraints through the application of a recently introduced structural property that is related to the symmetry of DNA curvature. Furthermore, we found that both consistently positioned nucleosomes and their adjacent nucleosome-free regions show an increased tendency towards conservation of this structural feature.
PMCID: PMC2994855  PMID: 21073701

Results 1-25 (45)