1.  Medial HOXA genes demarcate haematopoietic stem cell fate during human development 
Nature cell biology  2016;18(6):595-606.
Pluripotent stem cells (PSC) may provide a potential source of haematopoietic stem/progenitor cells (HSPCs) for transplantation; however, unknown molecular barriers prevent the self-renewal of PSC-HSPCs. Using two-step differentiation, human embryonic stem cells (hESCs) differentiated in vitro into multipotent haematopoietic cells that had CD34+CD38−/loCD90+CD45+GPI-80+ foetal liver (FL) HSC immunophenotype, but displayed poor expansion potential and engraftment ability. Transcriptome analysis of immunophenotypic hESC-HSPCs revealed that, despite their molecular resemblance to FL-HSPCs, medial HOXA genes remained suppressed. Knockdown of HOXA7 disrupted FL-HSPC function and caused transcriptome dysregulation that resembled hESC-derived progenitors. Overexpression of medial HOXA genes prolonged FL-HSPC maintenance but was insufficient to confer self-renewal to hESC-HSPCs. Stimulation of retinoic acid signalling during endothelial-to-haematopoietic transition induced the HOXA cluster and other HSC/definitive haemogenic endothelium genes, and prolonged HSPC maintenance in culture. Thus, retinoic acid signalling-induced medial HOXA gene expression marks the establishment of the definitive HSC fate and controls HSC identity and function.
PMCID: PMC4981340  PMID: 27183470
2.  An expanded evaluation of protein function prediction methods shows an improvement in accuracy 
Jiang, Yuxiang | Oron, Tal Ronnen | Clark, Wyatt T. | Bankapur, Asma R. | D’Andrea, Daniel | Lepore, Rosalba | Funk, Christopher S. | Kahanda, Indika | Verspoor, Karin M. | Ben-Hur, Asa | Koo, Da Chen Emily | Penfold-Brown, Duncan | Shasha, Dennis | Youngs, Noah | Bonneau, Richard | Lin, Alexandra | Sahraeian, Sayed M. E. | Martelli, Pier Luigi | Profiti, Giuseppe | Casadio, Rita | Cao, Renzhi | Zhong, Zhaolong | Cheng, Jianlin | Altenhoff, Adrian | Skunca, Nives | Dessimoz, Christophe | Dogan, Tunca | Hakala, Kai | Kaewphan, Suwisa | Mehryary, Farrokh | Salakoski, Tapio | Ginter, Filip | Fang, Hai | Smithers, Ben | Oates, Matt | Gough, Julian | Törönen, Petri | Koskinen, Patrik | Holm, Liisa | Chen, Ching-Tai | Hsu, Wen-Lian | Bryson, Kevin | Cozzetto, Domenico | Minneci, Federico | Jones, David T. | Chapman, Samuel | BKC, Dukka | Khan, Ishita K. | Kihara, Daisuke | Ofer, Dan | Rappoport, Nadav | Stern, Amos | Cibrian-Uhalte, Elena | Denny, Paul | Foulger, Rebecca E. | Hieta, Reija | Legge, Duncan | Lovering, Ruth C. | Magrane, Michele | Melidoni, Anna N. | Mutowo-Meullenet, Prudence | Pichler, Klemens | Shypitsyna, Aleksandra | Li, Biao | Zakeri, Pooya | ElShal, Sarah | Tranchevent, Léon-Charles | Das, Sayoni | Dawson, Natalie L. | Lee, David | Lees, Jonathan G. | Sillitoe, Ian | Bhat, Prajwal | Nepusz, Tamás | Romero, Alfonso E. | Sasidharan, Rajkumar | Yang, Haixuan | Paccanaro, Alberto | Gillis, Jesse | Sedeño-Cortés, Adriana E. | Pavlidis, Paul | Feng, Shou | Cejuela, Juan M. | Goldberg, Tatyana | Hamp, Tobias | Richter, Lothar | Salamov, Asaf | Gabaldon, Toni | Marcet-Houben, Marina | Supek, Fran | Gong, Qingtian | Ning, Wei | Zhou, Yuanpeng | Tian, Weidong | Falda, Marco | Fontana, Paolo | Lavezzo, Enrico | Toppo, Stefano | Ferrari, Carlo | Giollo, Manuel | Piovesan, Damiano | Tosatto, Silvio C.E. | del Pozo, Angela | Fernández, José M. | Maietta, Paolo | Valencia, Alfonso | Tress, Michael L. | Benso, Alfredo | Di Carlo, Stefano | Politano, Gianfranco | Savino, Alessandro | Rehman, Hafeez Ur | Re, Matteo | Mesiti, Marco | Valentini, Giorgio | Bargsten, Joachim W. | van Dijk, Aalt D. J. | Gemovic, Branislava | Glisic, Sanja | Perovic, Vladmir | Veljkovic, Veljko | Veljkovic, Nevena | Almeida-e-Silva, Danillo C. | Vencio, Ricardo Z. N. | Sharan, Malvika | Vogel, Jörg | Kansakar, Lakesh | Zhang, Shanshan | Vucetic, Slobodan | Wang, Zheng | Sternberg, Michael J. E. | Wass, Mark N. | Huntley, Rachael P. | Martin, Maria J. | O’Donovan, Claire | Robinson, Peter N. | Moreau, Yves | Tramontano, Anna | Babbitt, Patricia C. | Brenner, Steven E. | Linial, Michal | Orengo, Christine A. | Rost, Burkhard | Greene, Casey S. | Mooney, Sean D. | Friedberg, Iddo | Radivojac, Predrag
Genome Biology  2016;17(1):184.
A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging.
We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2.
The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-016-1037-6) contains supplementary material, which is available to authorized users.
PMCID: PMC5015320  PMID: 27604469
Protein function prediction; Disease gene prioritization
3.  Scl binds to primed enhancers in mesoderm to regulate hematopoietic and cardiac fate divergence 
The EMBO Journal  2015;34(6):759-777.
Scl/Tal1 confers hemogenic competence and prevents ectopic cardiomyogenesis in embryonic endothelium by unknown mechanisms. We discovered that Scl binds to hematopoietic and cardiac enhancers that become epigenetically primed in multipotent cardiovascular mesoderm, to regulate the divergence of hematopoietic and cardiac lineages. Scl does not act as a pioneer factor but rather exploits a pre-established epigenetic landscape. As the blood lineage emerges, Scl binding and active epigenetic modifications are sustained in hematopoietic enhancers, whereas cardiac enhancers are decommissioned by removal of active epigenetic marks. Our data suggest that, rather than recruiting corepressors to enhancers, Scl prevents ectopic cardiogenesis by occupying enhancers that cardiac factors, such as Gata4 and Hand1, use for gene activation. Although hematopoietic Gata factors bind with Scl to both activated and repressed genes, they are dispensable for cardiac repression, but necessary for activating genes that enable hematopoietic stem/progenitor cell development. These results suggest that a unique subset of enhancers in lineage-specific genes that are accessible for regulators of opposing fates during the time of the fate decision provide a platform where the divergence of mutually exclusive fates is orchestrated.
PMCID: PMC4369313  PMID: 25564442
cardiac specification; enhancer; hematopoiesis; mesoderm diversification; transcriptional regulation
5.  c-Met-dependent multipotent labyrinth trophoblast progenitors establish placental exchange interface 
Developmental cell  2013;27(4):373-386.
The placenta provides the interface for gas and nutrient exchange between the mother and the fetus. Despite its critical function in sustaining pregnancy, the stem/progenitor cell hierarchy and molecular mechanisms responsible for the development of the placental exchange interface are poorly understood. We identified an Epcamhi labyrinth trophoblast progenitor (LaTP) in mouse placenta that at a clonal level generates all labyrinth trophoblast subtypes, syncytiotrophoblasts I and II and sinusoidal trophoblast giant cells. Moreover, we discovered that Hgf/c-Met signaling is required for sustaining proliferation of LaTP during midgestation. Loss of trophoblast c-Met also disrupted terminal differentiation and polarization of syncytiotrophoblasts, leading to intrauterine fetal growth restriction, fetal liver hypocellularity and demise. Identification of a this c-Met dependent multipotent labyrinth trophoblast progenitor provides a landmark in the poorly defined placental stem/progenitor cell hierarchy and may help understand pregnancy complications caused by a defective placental exchange.
PMCID: PMC3950757  PMID: 24286824
6.  Hemogenic endocardium contributes to transient definitive hematopoiesis 
Nature communications  2013;4:1564.
Hematopoietic cells arise from spatiotemporally restricted domains in the developing embryo. Although studies of non-mammalian animal and in vitro embryonic stem cell models suggest a close relationship among cardiac, endocardial, and hematopoietic lineages, it remains unknown whether the mammalian heart tube serves as a hemogenic organ akin to the dorsal aorta. Here we examine the hemogenic activity of the developing endocardium. Mouse heart explants generate myeloid and erythroid colonies in the absence of circulation. Hemogenic activity arises from a subset of endocardial cells in the outflow cushion and atria earlier than in the aorta-gonad-mesonephros region, and is transient and definitive in nature. Interestingly, key cardiac transcription factors, Nkx2-5 and Isl1, are expressed in and required for the hemogenic population of the endocardium. Together, these data suggest that a subset of endocardial/endothelial cells expressing cardiac markers serve as a de novo source for transient definitive hematopoietic progenitors.
PMCID: PMC3612528  PMID: 23463007
7.  Scl represses cardiomyogenesis in prospective hemogenic endothelium and endocardium 
Cell  2012;150(3):590-605.
Endothelium in embryonic hematopoietic tissues generates hematopoietic stem/progenitor cells; however, it is unknown how its unique potential is specified. We show that transcription factor Scl/Tal1 is essential for both establishing the hematopoietic transcriptional program in hemogenic endothelium and preventing its misspecification to a cardiomyogenic fate. Scl−/− embryos activated a cardiac transcriptional program in yolk sac endothelium, leading to the emergence of CD31+Pdgfrα+ cardiogenic precursors that generated spontaneously beating cardiomyocytes. Ectopic cardiogenesis was also observed in Scl−/− hearts, where the disorganized endocardium precociously differentiated into cardiomyocytes. Induction of mosaic deletion of Scl in Sclfl/fl Rosa26Cre-ERT2 embryos revealed a cell-intrinsic, temporal requirement for Scl to prevent cardiomyogenesis from endothelium. Scl−/− endothelium also upregulated the expression of Wnt antagonists, which promoted rapid cardiomyocyte differentiation of ectopic cardiogenic cells. These results reveal unexpected plasticity in embryonic endothelium such that loss of a single master regulator can induce ectopic cardiomyogenesis from endothelial cells.
PMCID: PMC3624753  PMID: 22863011
8.  Lymphoid Priming in Human Bone Marrow Begins Prior to CD10 Expression with Up-Regulation of L-selectin 
Nature immunology  2012;13(10):963-971.
The expression of CD10 has long been used to define human lymphoid commitment. We report a unique lymphoid-primed population in human bone marrow that was generated from hematopoietic stem cells (HSCs) before the onset of CD10 expression and B cell commitment. This subset was identified by high expression of the homing molecule L-selectin (CD62L). CD10−CD62Lhi progenitors possessed full lymphoid and monocytic potential, but lacked erythroid potential. Gene expression profiling placed the CD10−CD62Lhi population at an intermediate stage of differentiation between HSCs and lineage-negative (Lin−) CD34+CD10+ progenitors. L-selectin was expressed on immature thymocytes and its ligands were expressed at the cortico-medullary junction, suggesting a possible role in thymic homing. These studies identify the earliest stage of lymphoid priming in human bone marrow.
PMCID: PMC3448017  PMID: 22941246
9.  Expansion on Stromal Cells Preserves the Undifferentiated State of Human Hematopoietic Stem Cells Despite Compromised Reconstitution Ability 
PLoS ONE  2013;8(1):e53912.
Lack of HLA-matched hematopoietic stem cells (HSC) limits the number of patients with life-threatening blood disorders that can be treated by HSC transplantation. So far, insufficient understanding of the regulatory mechanisms governing human HSC has precluded the development of effective protocols for culturing HSC for therapeutic use and molecular studies. We defined a culture system using OP9M2 mesenchymal stem cell (MSC) stroma that protects human hematopoietic stem/progenitor cells (HSPC) from differentiation and apoptosis. In addition, it facilitates a dramatic expansion of multipotent progenitors that retain the immunophenotype (CD34+CD38−CD90+) characteristic of human HSPC and proliferative potential over several weeks in culture. In contrast, transplantable HSC could be maintained, but not significantly expanded, during 2-week culture. Temporal analysis of the transcriptome of the ex vivo expanded CD34+CD38−CD90+ cells documented remarkable stability of most transcriptional regulators known to govern the undifferentiated HSC state. Nevertheless, it revealed dynamic fluctuations in transcriptional programs that associate with HSC behavior and may compromise HSC function, such as dysregulation of PBX1 regulated genetic networks. This culture system serves now as a platform for modeling human multilineage hematopoietic stem/progenitor cell hierarchy and studying the complex regulation of HSC identity and function required for successful ex vivo expansion of transplantable HSC.
PMCID: PMC3547050  PMID: 23342037
10.  GFam: a platform for automatic annotation of gene families 
Nucleic Acids Research  2012;40(19):e152.
We have developed GFam, a platform for automatic annotation of gene/protein families. GFam provides a framework for genome initiatives and model organism resources to build domain-based families, derive meaningful functional labels and offers a seamless approach to propagate functional annotation across periodic genome updates. GFam is a hybrid approach that uses a greedy algorithm to chain component domains from InterPro annotation provided by its 12 member resources followed by a sequence-based connected component analysis of un-annotated sequence regions to derive consensus domain architecture for each sequence and subsequently generate families based on common architectures. Our integrated approach increases sequence coverage by 7.2 percentage points and residue coverage by 14.6 percentage points higher than the coverage relative to the best single-constituent database within InterPro for the proteome of Arabidopsis. The true power of GFam lies in maximizing annotation provided by the different InterPro data sources that offer resource-specific coverage for different regions of a sequence. GFam’s capability to capture higher sequence and residue coverage can be useful for genome annotation, comparative genomics and functional studies. GFam is a general-purpose software and can be used for any collection of protein sequences. The software is open source and can be obtained from
PMCID: PMC3479161  PMID: 22790981
11.  The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools 
Nucleic Acids Research  2011;40(Database issue):D1202-D1210.
The Arabidopsis Information Resource (TAIR, is a genome database for Arabidopsis thaliana, an important reference organism for many fundamental aspects of biology as well as basic and applied plant biology research. TAIR serves as a central access point for Arabidopsis data, annotates gene function and expression patterns using controlled vocabulary terms, and maintains and updates the A. thaliana genome assembly and annotation. TAIR also provides researchers with an extensive set of visualization and analysis tools. Recent developments include several new genome releases (TAIR8, TAIR9 and TAIR10) in which the A. thaliana assembly was updated, pseudogenes and transposon genes were re-annotated, and new data from proteomics and next generation transcriptome sequencing were incorporated into gene models and splice variants. Other highlights include progress on functional annotation of the genome and the release of several new tools including Textpresso for Arabidopsis which provides the capability to carry out full text searches on a large body of research literature.
PMCID: PMC3245047  PMID: 22140109
12.  Integrative Analysis of the Caenorhabditis elegans Genome by the modENCODE Project 
Gerstein, Mark B. | Lu, Zhi John | Van Nostrand, Eric L. | Cheng, Chao | Arshinoff, Bradley I. | Liu, Tao | Yip, Kevin Y. | Robilotto, Rebecca | Rechtsteiner, Andreas | Ikegami, Kohta | Alves, Pedro | Chateigner, Aurelien | Perry, Marc | Morris, Mitzi | Auerbach, Raymond K. | Feng, Xin | Leng, Jing | Vielle, Anne | Niu, Wei | Rhrissorrakrai, Kahn | Agarwal, Ashish | Alexander, Roger P. | Barber, Galt | Brdlik, Cathleen M. | Brennan, Jennifer | Brouillet, Jeremy Jean | Carr, Adrian | Cheung, Ming-Sin | Clawson, Hiram | Contrino, Sergio | Dannenberg, Luke O. | Dernburg, Abby F. | Desai, Arshad | Dick, Lindsay | Dosé, Andréa C. | Du, Jiang | Egelhofer, Thea | Ercan, Sevinc | Euskirchen, Ghia | Ewing, Brent | Feingold, Elise A. | Gassmann, Reto | Good, Peter J. | Green, Phil | Gullier, Francois | Gutwein, Michelle | Guyer, Mark S. | Habegger, Lukas | Han, Ting | Henikoff, Jorja G. | Henz, Stefan R. | Hinrichs, Angie | Holster, Heather | Hyman, Tony | Iniguez, A. Leo | Janette, Judith | Jensen, Morten | Kato, Masaomi | Kent, W. James | Kephart, Ellen | Khivansara, Vishal | Khurana, Ekta | Kim, John K. | Kolasinska-Zwierz, Paulina | Lai, Eric C. | Latorre, Isabel | Leahey, Amber | Lewis, Suzanna | Lloyd, Paul | Lochovsky, Lucas | Lowdon, Rebecca F. | Lubling, Yaniv | Lyne, Rachel | MacCoss, Michael | Mackowiak, Sebastian D. | Mangone, Marco | McKay, Sheldon | Mecenas, Desirea | Merrihew, Gennifer | Miller, David M. | Muroyama, Andrew | Murray, John I. | Ooi, Siew-Loon | Pham, Hoang | Phippen, Taryn | Preston, Elicia A. | Rajewsky, Nikolaus | Rätsch, Gunnar | Rosenbaum, Heidi | Rozowsky, Joel | Rutherford, Kim | Ruzanov, Peter | Sarov, Mihail | Sasidharan, Rajkumar | Sboner, Andrea | Scheid, Paul | Segal, Eran | Shin, Hyunjin | Shou, Chong | Slack, Frank J. | Slightam, Cindie | Smith, Richard | Spencer, William C. | Stinson, E. O. | Taing, Scott | Takasaki, Teruaki | Vafeados, Dionne | Voronina, Ksenia | Wang, Guilin | Washington, Nicole L. | Whittle, Christina M. | Wu, Beijing | Yan, Koon-Kiu | Zeller, Georg | Zha, Zheng | Zhong, Mei | Zhou, Xingliang | Ahringer, Julie | Strome, Susan | Gunsalus, Kristin C. | Micklem, Gos | Liu, X. Shirley | Reinke, Valerie | Kim, Stuart K. | Hillier, LaDeana W. | Henikoff, Steven | Piano, Fabio | Snyder, Michael | Stein, Lincoln | Lieb, Jason D. | Waterston, Robert H.
Science (New York, N.Y.)  2010;330(6012):1775-1787.
We systematically generated large-scale data sets to improve genome annotation for the nematode Caenorhabditis elegans, a key model organism. These data sets include transcriptome profiling across a developmental time course, genome-wide identification of transcription factor–binding sites, and maps of chromatin organization. From this, we created more complete and accurate gene models, including alternative splice forms and candidate noncoding RNAs. We constructed hierarchical networks of transcription factor–binding and microRNA interactions and discovered chromosomal locations bound by an unusually large number of transcription factors. Different patterns of chromatin composition and histone modification were revealed between chromosome arms and centers, with similarly prominent differences between autosomes and the X chromosome. Integrating data types, we built statistical models relating chromatin, transcription factor binding, and gene expression. Overall, our analyses ascribed putative functions to most of the conserved genome.
PMCID: PMC3142569  PMID: 21177976
13.  Comparison and calibration of transcriptome data from RNA-Seq and tiling arrays 
BMC Genomics  2010;11:383.
Tiling arrays have been the tool of choice for probing an organism's transcriptome without prior assumptions about the transcribed regions, but RNA-Seq is becoming a viable alternative as the costs of sequencing continue to decrease. Understanding the relative merits of these technologies will help researchers select the appropriate technology for their needs.
Here, we compare these two platforms using a matched sample of poly(A)-enriched RNA isolated from the second larval stage of C. elegans. We find that the raw signals from these two technologies are reasonably well correlated but that RNA-Seq outperforms tiling arrays in several respects, notably in exon boundary detection and dynamic range of expression. By exploring the accuracy of sequencing as a function of depth of coverage, we found that about 4 million reads are required to match the sensitivity of two tiling array replicates. The effects of cross-hybridization were analyzed using a "nearest neighbor" classifier applied to array probes; we describe a method for determining potential "black list" regions whose signals are unreliable. Finally, we propose a strategy for using RNA-Seq data as a gold standard set to calibrate tiling array data. All tiling array and RNA-Seq data sets have been submitted to the modENCODE Data Coordinating Center.
Tiling arrays effectively detect transcript expression levels at a low cost for many species while RNA-Seq provides greater accuracy in several regards. Researchers will need to carefully select the technology appropriate to the biological investigations they are undertaking. It will also be important to reconsider a comparison such as ours as sequencing technologies continue to evolve.
PMCID: PMC3091629  PMID: 20565764
14.  SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale 
BMC Bioinformatics  2010;11:120.
An important problem in genomics is the automatic inference of groups of homologous proteins from pairwise sequence similarities. Several approaches have been proposed for this task which are "local" in the sense that they assign a protein to a cluster based only on the distances between that protein and the other proteins in the set. It was shown recently that global methods such as spectral clustering have better performance on a wide variety of datasets. However, currently available implementations of spectral clustering methods mostly consist of a few loosely coupled Matlab scripts that assume a fair amount of familiarity with Matlab programming and hence they are inaccessible for large parts of the research community.
SCPS (Spectral Clustering of Protein Sequences) is an efficient and user-friendly implementation of a spectral method for inferring protein families. The method uses only pairwise sequence similarities, and is therefore practical when only sequence information is available. SCPS was tested on difficult sets of proteins whose relationships were extracted from the SCOP database, and its results were extensively compared with those obtained using other popular protein clustering algorithms such as TribeMCL, hierarchical clustering and connected component analysis. We show that SCPS is able to identify many of the family/superfamily relationships correctly and that the quality of the obtained clusters as indicated by their F-scores is consistently better than all the other methods we compared it with. We also demonstrate the scalability of SCPS by clustering the entire SCOP database (14,183 sequences) and the complete genome of the yeast Saccharomyces cerevisiae (6,690 sequences).
Besides the spectral method, SCPS also implements connected component analysis and hierarchical clustering, it integrates TribeMCL, it provides different cluster quality tools, it can extract human-readable protein descriptions using GI numbers from NCBI, it interfaces with external tools such as BLAST and Cytoscape, and it can produce publication-quality graphical representations of the clusters obtained, thus constituting a comprehensive and effective tool for practical research in computational biology. Source code and precompiled executables for Windows, Linux and Mac OS X are freely available at
PMCID: PMC2841596  PMID: 20214776
15.  An approach to compare genome tiling microarray and MPSS sequencing data for transcript mapping 
BMC Research Notes  2009;2:211.
We are correcting the abstract of our published article ([1]). The sentence that starts "We observe that 4.5% of MPSS tags...." was not scientifically complete in the original abstract, having only two of the four numbers required to describe a comparison of two technologies in two different organisms. The abstract below more accurately describes our findings, as documented in Figure 1 of the manuscript.
PMCID: PMC2770075
16.  An approach to comparing tiling array and high throughput sequencing technologies for genomic transcript mapping 
BMC Research Notes  2009;2:150.
There are two main technologies for transcriptome profiling, namely, tiling microarrays and high-throughput sequencing. Recently there has been a tremendous amount of excitement about the latter because of the advent of next-generation sequencing technologies and its promises. Consequently, the question of the moment is how these two technologies compare. Here we attempt to develop an approach to do a fair comparison of transcripts identified from tiling microarray and MPSS sequencing data.
This comparison is a challenging task because the sequencing data is discrete while the tiling array data is continuous. We use the published rice and Arabidopsis datasets which provide currently best matched sets of arrays and sequencing experiments using a slightly earlier generation of sequencing, the MPSS tag sequencing technology. After scoring the arrays consistently in both the organisms, a first pass comparison reveals a surprisingly small overlap in transcripts of 22% and 66% respectively, in rice and Arabidopsis. However, when we do the analysis in detail, we find that this is an underestimate. In particular, when we map the probe intensities onto the sequencing tags and then look at their intensity distribution, we see that they are very similar to exons. Furthermore, restricting our comparison to only protein-coding gene loci revealed a very good overlap between the two technologies.
Our approach to compare genome tiling microarray and MPSS sequencing data suggests that there is actually a reasonable overlap in transcripts identified by the two technologies. This overlap is distorted by the scoring and thresholding in the tiling array scoring procedure.
PMCID: PMC2764720  PMID: 19630981
17.  Domain Insertions in Protein Structures 
Journal of molecular biology  2004;338(4):633-641.
Domains are the structural, functional or evolutionary units of proteins. Proteins can comprise a single domain or a combination of domains. In multi-domain proteins, the domains almost always occur end-to-end, i.e., one domain follows the C-terminal end of another domain. However, there are exceptions to this common pattern, where multi-domain proteins are formed by insertion of one domain (insert) into another domain (parent). Here, we provide a quantitative description of known insertions in the Protein Data Bank (PDB). We found that 9% of domain combinations observed in non-redundant PDB are insertions. Although 90% of all insertions involve only one insert, proteins can clearly have multiple (nested, two-domain and three-domain) inserts. We also observed correlations between the structure and function of a domain and its tendency to be found as a parent or an insert. There is a bias in insert position towards the C terminus of parents. We observed that the atomic distance between the N and C terminus of an insert is significantly smaller when compared to the N-to-C distance in a parent context or a single domain context. Insertions are found always to occur in loop regions of parent domains. Our observations regarding the relationship between domain insertions and the structure, function and evolution of proteins have implications for protein engineering.
PMCID: PMC2665287  PMID: 15099733
domain insertion; inserted domain; discontinuous domains; non-contiguous domains; protein engineering
18.  Transmembrane Protein Oxygen Content and Compartmentalization of Cells 
PLoS ONE  2008;3(7):e2726.
Recently, there was a report that explored the oxygen content of transmembrane proteins over macroevolutionary time scales where the authors observed a correlation between the geological time of appearance of compartmentalized cells with atmospheric oxygen concentration. The authors predicted, characterized and correlated the differences in the structure and composition of transmembrane proteins from the three kingdoms of life with atmospheric oxygen concentrations in geological timescale. They hypothesized that transmembrane proteins in ancient taxa were selectively excluding oxygen and as this constraint relaxed over time with increase in the levels of atmospheric oxygen the size and number of communication-related transmembrane proteins increased. In summary, they concluded that compartmentalized and non-compartmentalized cells can be distinguished by how oxygen is partitioned at the proteome level. They derived this conclusion from an analysis of 19 taxa. We extended their analysis on a larger sample of taxa comprising 309 eubacterial, 34 archaeal, and 30 eukaryotic complete proteomes and observed that one can not absolutely separate the two groups of cells based on partition of oxygen in their membrane proteins. In addition, the origin of compartmentalized cells is likely to have been driven by an innovation than happened 2700 million years ago in the membrane composition of cells that led to the evolution of endocytosis and exocytosis rather than due to the rise in concentration of atmospheric oxygen.
PMCID: PMC2443287  PMID: 18628944
20.  Global Identification and Characterization of Transcriptionally Active Regions in the Rice Genome 
PLoS ONE  2007;2(3):e294.
Genome tiling microarray studies have consistently documented rich transcriptional activity beyond the annotated genes. However, systematic characterization and transcriptional profiling of the putative novel transcripts on the genome scale are still lacking. We report here the identification of 25,352 and 27,744 transcriptionally active regions (TARs) not encoded by annotated exons in the rice (Oryza. sativa) subspecies japonica and indica, respectively. The non-exonic TARs account for approximately two thirds of the total TARs detected by tiling arrays and represent transcripts likely conserved between japonica and indica. Transcription of 21,018 (83%) japonica non-exonic TARs was verified through expression profiling in 10 tissue types using a re-array in which annotated genes and TARs were each represented by five independent probes. Subsequent analyses indicate that about 80% of the japonica TARs that were not assigned to annotated exons can be assigned to various putatively functional or structural elements of the rice genome, including splice variants, uncharacterized portions of incompletely annotated genes, antisense transcripts, duplicated gene fragments, and potential non-coding RNAs. These results provide a systematic characterization of non-exonic transcripts in rice and thus expand the current view of the complexity and dynamics of the rice transcriptome.
PMCID: PMC1808428  PMID: 17372628
21.  DomIns: a web resource for domain insertions in known protein structures 
Nucleic Acids Research  2004;32(Database issue):D193-D195.
Proteins can be formed by single or multiple domains. The process of recombination at the molecular level has generated a wide variety of multi-domain proteins with specific domain organization to cater to the functional requirements of an organism. The functional and structural costs of inserting a domain into another means that multi-domain proteins are usually formed by covalently linking the N-terminus of one domain to the C-terminus of the preceding domain. While this is true in a large proportion of multi-domain proteins, we find a significant fraction of proteins that are the result of domain insertion. The inserted domain breaks the sequence contiguity of the domain into which it is inserted leading to a novel domain organization. This web resource aims to document domain insertions in known protein structures that are classified in the SCOP database. The web server can be accessed from
PMCID: PMC308781  PMID: 14681392

