1.  Pancreatic Inflammation Redirects Acinar to Beta Cell Reprogramming 
Cell reports  2016;17(8):2028-2041.
Using a transgenic mouse model to express MafA, Pdx1, and Neurog3 (3TF) in a pancreatic acinar cell- and doxycycline-dependent manner, we discovered that the outcome of transcription factor-mediated acinar to β-like cellular reprogramming is dependent on both the magnitude of 3TF expression and on reprogramming-induced inflammation. Overly robust 3TF expression causes acinar cell necrosis resulting in marked inflammation and acinar-to-ductal metaplasia. Generation of new β-like cells requires limiting reprogramming-induced inflammation, either by reducing 3TF expression or by eliminating macrophages. The new β-like cells were able to reverse streptozotocin-induced diabetes 6 days after inducing 3TF expression but failed to sustain their function after removal of the reprogramming factors.
Graphical Abstract
PMCID: PMC5131369  PMID: 27851966
2.  OBIB-a novel ontology for biobanking 
Biobanking necessitates extensive integration of data to allow data analysis and specimen sharing. Ontologies have been demonstrated to be a promising approach in fostering better semantic integration of biobank-related data. Hitherto no ontology provided the coverage needed to capture a broad spectrum of biobank user scenarios.
Based in the principles laid out by the Open Biological and Biomedical Ontologies Foundry two biobanking ontologies have been developed. These two ontologies were merged using a modular approach consistent with the initial development principles. The merging was facilitated by the fact that both ontologies use the same Upper Ontology and re-use classes from a similar set of pre-existing ontologies.
Based on the two previous ontologies the Ontology for Biobanking ( was created. Due to the fact that there was no overlap between the two source ontologies the coverage of the resulting ontology is significantly larger than of the two source ontologies. The ontology is successfully used in managing biobank information of the Penn Medicine BioBank.
Sharing development principles and Upper Ontologies facilitates subsequent merging of ontologies to achieve a broader coverage.
PMCID: PMC4855778  PMID: 27148435
Ontologies; Biobanking; Biorepository; Terminology
3.  The Ontology for Biomedical Investigations 
PLoS ONE  2016;11(4):e0154556.
The Ontology for Biomedical Investigations (OBI) is an ontology that provides terms with precisely defined meanings to describe all aspects of how investigations in the biological and medical domains are conducted. OBI re-uses ontologies that provide a representation of biomedical knowledge from the Open Biological and Biomedical Ontologies (OBO) project and adds the ability to describe how this knowledge was derived. We here describe the state of OBI and several applications that are using it, such as adding semantic expressivity to existing databases, building data entry forms, and enabling interoperability between knowledge resources. OBI covers all phases of the investigation process, such as planning, execution and reporting. It represents information and material entities that participate in these processes, as well as roles and functions. Prior to OBI, it was not possible to use a single internally consistent resource that could be applied to multiple types of experiments for these applications. OBI has made this possible by creating terms for entities involved in biological and medical investigations and by importing parts of other biomedical ontologies such as GO, Chemical Entities of Biological Interest (ChEBI) and Phenotype Attribute and Trait Ontology (PATO) without altering their meaning. OBI is being used in a wide range of projects covering genomics, multi-omics, immunology, and catalogs of services. OBI has also spawned other ontologies (Information Artifact Ontology) and methods for importing parts of ontologies (Minimum information to reference an external ontology term (MIREOT)). The OBI project is an open cross-disciplinary collaborative effort, encompassing multiple research communities from around the globe. To date, OBI has created 2366 classes and 40 relations along with textual and formal definitions. The OBI Consortium maintains a web resource ( providing details on the people, policies, and issues being addressed in association with OBI. The current release of OBI is available at
PMCID: PMC4851331  PMID: 27128319
4.  Integrated Regional Cardiac Hemodynamic Imaging and RNA Sequencing Reveal Corresponding Heterogeneity of Ventricular Wall Shear Stress and Endocardial Transcriptome 
Unlike arteries, in which regionally distinct hemodynamics are associated with phenotypic heterogeneity, the relationships between endocardial endothelial cell phenotype and intraventricular flow remain largely unexplored. We investigated regional differences in left ventricular wall shear stress and their association with endocardial endothelial cell gene expression.
Methods and Results
Local wall shear stress was calculated from 4‐dimensional flow magnetic resonance imaging in 3 distinct regions of human (n=8) and pig (n=5) left ventricle: base, adjacent to the outflow tract; midventricle; and apex. In both species, wall shear stress values were significantly lower in the apex and midventricle relative to the base; oscillatory shear index was elevated in the apex. RNA sequencing of the endocardial endothelial cell transcriptome in pig left ventricle (n=8) at a false discovery rate ≤10% identified 1051 genes differentially expressed between the base and the apex and 327 between the base and the midventricle; no differentially expressed genes were detected at this false discovery rate between the apex and the midventricle. Enrichment analyses identified apical upregulation of genes associated with translation initiation including mammalian target of rapamycin, and eukaryotic initiation factor 2 signaling. Genes of mitochondrial dysfunction and oxidative phosphorylation were also consistently upregulated in the left ventricular apex, as were tissue factor pathway inhibitor (mean 50‐fold) and prostacyclin synthase (5‐fold)—genes prominently associated with antithrombotic protection.
We report the first spatiotemporal measurements of wall shear stress within the left ventricle and linked regional hemodynamics to heterogeneity in ventricular endothelial gene expression, most notably to translation initiation and anticoagulation properties in the left ventricular apex, in which oscillatory shear index is increased and wall shear stress is decreased.
PMCID: PMC4859290  PMID: 27091183
4‐dimensional flow magnetic resonance imaging; endocardium; gene expression; hemodynamics; ventricle; Vascular Biology; Magnetic Resonance Imaging (MRI); Functional Genomics
5.  A Framework for Global Collaborative Data Management for Malaria Research 
Data generated during the course of research activities carried out by the International Centers of Excellence for Malaria Research (ICEMR) is heterogeneous, large, and multi-scaled. The complexity of federated and global data operations and the diverse uses planned for the data pose tremendous challenges and opportunities for collaborative research. In this article, we present the foundational principles for data management across the ICEMR Program, the logistics associated with multiple aspects of the data life cycle, and describe a pilot centralized web information system created in PlasmoDB to query a subset of this data. The paradigm proposed as a solution for the data operations in the ICEMR Program is widely applicable to large, multifaceted research projects, and could be reproduced in other contexts that require sophisticated data management.
PMCID: PMC4574270  PMID: 26259944
6.  Emerging topic: flow-related epigenetic regulation of endothelial phenotype through DNA methylation 
Vascular pharmacology  2014;62(2):88-93.
Atherosclerosis is a multi-focal disease; it is associated with arterial curvatures, asymmetries and branches/bifurcations where non-uniform arterial geometry generates patterns of blood flow that are considerably more complex than elsewhere, and are collectively referred to as disturbed flow. Such regions are predisposed to atherosclerosis and are the sites of ‘athero-susceptible’ endothelial cells that express regionally different cell phenotypes than endothelium in nearby athero-protected locations. The regulatory hierarchy of endothelial function includes control at the epigenetic level. MicroRNAs and histone modifications are established epigenetic regulators that respond to disturbed flow. However, very recent reports have linked transcriptional regulation by DNA methylation to endothelial gene expression in disturbed flow in vivo and in vitro. We outline these in the context of site-specific atherosusceptibility mediated by local hemodynamics.
PMCID: PMC4116435  PMID: 24874278
Endothelial gene expression; Hemodynamic disturbed flow; Differential methylation region; Methylome; Atherosclerosis; KLF4; HOX genes
7.  Arterial endothelial methylome: differential DNA methylation in athero-susceptible disturbed flow regions in vivo 
BMC Genomics  2015;16(1):506.
Atherosclerosis is a heterogeneously distributed disease of arteries in which the endothelium plays an important central role. Spatial transcriptome profiling of endothelium in pre-lesional arteries has demonstrated differential phenotypes primed for athero-susceptibility at hemodynamic sites associated with disturbed blood flow. DNA methylation is a powerful epigenetic regulator of endothelial transcription recently associated with flow characteristics. We investigated differential DNA methylation in flow region-specific aortic endothelial cells in vivo in adult domestic male and female swine.
Genome-wide DNA methylation was profiled in endothelial cells (EC) isolated from two robust locations of differing patho-susceptibility: − an athero-susceptible site located at the inner curvature of the aortic arch (AA) and an athero-protected region in the descending thoracic (DT) aorta. Complete methylated DNA immunoprecipitation sequencing (MeDIP-seq) identified over 5500 endothelial differentially methylated regions (DMRs). DMR density was significantly enriched in exons and 5’UTR sequences of annotated genes, 60 of which are linked to cardiovascular disease. The set of DMR-associated genes was enriched in transcriptional regulation, pattern specification HOX loci, oxidative stress and the ER stress adaptive pathway, all categories linked to athero-susceptible endothelium. Examination of the relationship between DMR and mRNA in HOXA genes demonstrated a significant inverse relationship between CpG island promoter methylation and gene expression. Methylation-specific PCR (MSP) confirmed differential CpG methylation of HOXA genes, the ER stress gene ATF4, inflammatory regulator microRNA-10a and ARHGAP25 that encodes a negative regulator of Rho GTPases involved in cytoskeleton remodeling. Gender-specific DMRs associated with ciliogenesis that may be linked to defects in cilia development were also identified in AA DMRs.
An endothelial methylome analysis identifies epigenetic DMR characteristics associated with transcriptional regulation in regions of atherosusceptibility in swine aorta in vivo. The data represent the first methylome blueprint for spatio-temporal analyses of lesion susceptibility predisposing to endothelial dysfunction in complex flow environments in vivo.
Electronic supplementary material
The online version of this article (doi:10.1186/s12864-015-1656-4) contains supplementary material, which is available to authorized users.
PMCID: PMC4492093  PMID: 26148682
Endothelium; DNA Methylation; Epigenetics; Hemodynamics; Disturbed Flow; HOX Genes; Atherosclerosis; Endothelial Gene Transcription
8.  Ontodog: a web-based ontology community view generation tool 
Bioinformatics  2014;30(9):1340-1342.
Summary: Biomedical ontologies are often very large and complex. Only a subset of the ontology may be needed for a specified application or community. For ontology end users, it is desirable to have community-based labels rather than the labels generated by ontology developers. Ontodog is a web-based system that can generate an ontology subset based on Excel input, and support generation of an ontology community view, which is defined as the whole or a subset of the source ontology with user-specified annotations including user-preferred labels. Ontodog allows users to easily generate community views with minimal ontology knowledge and no programming skills or installation required. Currently >100 ontologies including all OBO Foundry ontologies are available to generate the views based on user needs. We demonstrate the application of Ontodog for the generation of community views using the Ontology for Biomedical Investigations as the source ontology.
PMCID: PMC3998133  PMID: 24413522
9.  Minimum Information Specification For In Situ Hybridization and Immunohistochemistry Experiments (MISFISHIE) 
Nature biotechnology  2008;26(3):305-312.
One purpose of the biomedical literature is to report results in sufficient detail so that the methods of data collection and analysis can be independently replicated and verified. Here we present for consideration a minimum information specification for gene expression localization experiments, called the “Minimum Information Specification For In Situ Hybridization and Immunohistochemistry Experiments (MISFISHIE)”. It is modelled after the MIAME (Minimum Information About a Microarray Experiment) specification for microarray experiments. Data specifications like MIAME and MISFISHIE specify the information content without dictating a format for encoding that information. The MISFISHIE specification describes six types of information that should be provided for each experiment: Experimental Design, Biomaterials and Treatments, Reporters, Staining, Imaging Data, and Image Characterizations. This specification has benefited the consortium within which it was initially developed and is expected to benefit the wider research community. We welcome feedback from the scientific community to help improve our proposal.
PMCID: PMC4367930  PMID: 18327244
10.  EuPathDB: The Eukaryotic Pathogen database 
Nucleic Acids Research  2012;41(Database issue):D684-D691.
EuPathDB ( resources include 11 databases supporting eukaryotic pathogen genomic and functional genomic data, isolate data and phylogenomics. EuPathDB resources are built using the same infrastructure and provide a sophisticated search strategy system enabling complex interrogations of underlying data. Recent advances in EuPathDB resources include the design and implementation of a new data loading workflow, a new database supporting Piroplasmida (i.e. Babesia and Theileria), the addition of large amounts of new data and data types and the incorporation of new analysis tools. New data include genome sequences and annotation, strand-specific RNA-seq data, splice junction predictions (based on RNA-seq), phosphoproteomic data, high-throughput phenotyping data, single nucleotide polymorphism data based on high-throughput sequencing (HTS) and expression quantitative trait loci data. New analysis tools enable users to search for DNA motifs and define genes based on their genomic colocation, view results from searches graphically (i.e. genes mapped to chromosomes or isolates displayed on a map) and analyze data from columns in result tables (word cloud and histogram summaries of column content). The manuscript herein describes updates to EuPathDB since the previous report published in NAR in 2010.
PMCID: PMC3531183  PMID: 23175615
11.  AmoebaDB and MicrosporidiaDB: functional genomic resources for Amoebozoa and Microsporidia species 
Nucleic Acids Research  2010;39(Database issue):D612-D619.
AmoebaDB ( and MicrosporidiaDB ( are new functional genomic databases serving the amoebozoa and microsporidia research communities, respectively. AmoebaDB contains the genomes of three Entamoeba species (E. dispar, E. invadens and E. histolityca) and microarray expression data for E. histolytica. MicrosporidiaDB contains the genomes of Encephalitozoon cuniculi, E. intestinalis and E. bieneusi. The databases belong to the National Institute of Allergy and Infectious Diseases (NIAID) funded EuPathDB ( Bioinformatics Resource Center family of integrated databases and assume the same architectural and graphical design as other EuPathDB resources such as PlasmoDB and TriTrypDB. Importantly they utilize the graphical strategy builder that affords a database user the ability to ask complex multi-data-type questions with relative ease and versatility. Genomic scale data can be queried based on BLAST searches, annotation keywords and gene ID searches, GO terms, sequence motifs, protein characteristics, phylogenetic relationships and functional data such as transcript (microarray and EST evidence) and protein expression data. Search strategies can be saved within a user’s profile for future retrieval and may also be shared with other researchers using a unique strategy web address.
PMCID: PMC3013638  PMID: 20974635
12.  Data Standards for Omics Data: The Basis of Data Sharing and Reuse 
To facilitate sharing of Omics data, many groups of scientists have been working to establish the relevant data standards. The main components of data sharing standards are experiment description standards, data exchange standards, terminology standards, and experiment execution standards. Here we provide a survey of existing and emerging standards that are intended to assist the free and open exchange of large-format data.
PMCID: PMC4152841  PMID: 21370078
Data sharing; Data exchange; Data standards; MGED; MIAME; Ontology; Data format; Microarray; Proteomics; Metabolomics
13.  CLO: The cell line ontology 
Cell lines have been widely used in biomedical research. The community-based Cell Line Ontology (CLO) is a member of the OBO Foundry library that covers the domain of cell lines. Since its publication two years ago, significant updates have been made, including new groups joining the CLO consortium, new cell line cells, upper level alignment with the Cell Ontology (CL) and the Ontology for Biomedical Investigation, and logical extensions.
Construction and content
Collaboration among the CLO, CL, and OBI has established consensus definitions of cell line-specific terms such as ‘cell line’, ‘cell line cell’, ‘cell line culturing’, and ‘mortal’ vs. ‘immortal cell line cell’. A cell line is a genetically stable cultured cell population that contains individual cell line cells. The hierarchical structure of the CLO is built based on the hierarchy of the in vivo cell types defined in CL and tissue types (from which cell line cells are derived) defined in the UBERON cross-species anatomy ontology. The new hierarchical structure makes it easier to browse, query, and perform automated classification. We have recently added classes representing more than 2,000 cell line cells from the RIKEN BRC Cell Bank to CLO. Overall, the CLO now contains ~38,000 classes of specific cell line cells derived from over 200 in vivo cell types from various organisms.
Utility and discussion
The CLO has been applied to different biomedical research studies. Example case studies include annotation and analysis of EBI ArrayExpress data, bioassays, and host-vaccine/pathogen interaction. CLO’s utility goes beyond a catalogue of cell line types. The alignment of the CLO with related ontologies combined with the use of ontological reasoners will support sophisticated inferencing to advance translational informatics development.
PMCID: PMC4387853  PMID: 25852852
Cell line; Cell line cell; Immortal cell line cell; Mortal cell line cell; Cell line cell culturing; Anatomy
14.  Insm1 promotes endocrine cell differentiation by modulating the expression of a network of genes that includes Neurog3 and Ripply3 
Development (Cambridge, England)  2014;141(15):2939-2949.
Insulinoma associated 1 (Insm1) plays an important role in regulating the development of cells in the central and peripheral nervous systems, olfactory epithelium and endocrine pancreas. To better define the role of Insm1 in pancreatic endocrine cell development we generated mice with an Insm1GFPCre reporter allele and used them to study Insm1-expressing and null populations. Endocrine progenitor cells lacking Insm1 were less differentiated and exhibited broad defects in hormone production, cell proliferation and cell migration. Embryos lacking Insm1 contained greater amounts of a non-coding Neurog3 mRNA splice variant and had fewer Neurog3/Insm1 co-expressing progenitor cells, suggesting that Insm1 positively regulates Neurog3. Moreover, endocrine progenitor cells that express either high or low levels of Pdx1, and thus may be biased towards the formation of specific cell lineages, exhibited cell type-specific differences in the genes regulated by Insm1. Analysis of the function of Ripply3, an Insm1-regulated gene enriched in the Pdx1-high cell population, revealed that it negatively regulates the proliferation of early endocrine cells. Taken together, these findings indicate that in developing pancreatic endocrine cells Insm1 promotes the transition from a ductal progenitor to a committed endocrine cell by repressing a progenitor cell program and activating genes essential for RNA splicing, cell migration, controlled cellular proliferation, vasculogenesis, extracellular matrix and hormone secretion.
PMCID: PMC4197673  PMID: 25053427
Pancreas development; Endocrine progenitor cells; Gene expression; Transcription factors; Mouse
15.  EuPathDB: a portal to eukaryotic pathogen databases 
Nucleic Acids Research  2009;38(Database issue):D415-D419.
EuPathDB (; formerly ApiDB) is an integrated database covering the eukaryotic pathogens of the genera Cryptosporidium, Giardia, Leishmania, Neospora, Plasmodium, Toxoplasma, Trichomonas and Trypanosoma. While each of these groups is supported by a taxon-specific database built upon the same infrastructure, the EuPathDB portal offers an entry point to all these resources, and the opportunity to leverage orthology for searches across genera. The most recent release of EuPathDB includes updates and changes affecting data content, infrastructure and the user interface, improving data access and enhancing the user experience. EuPathDB currently supports more than 80 searches and the recently-implemented ‘search strategy’ system enables users to construct complex multi-step searches via a graphical interface. Search results are dynamically displayed as the strategy is constructed or modified, and can be downloaded, saved, revised, or shared with other database users.
PMCID: PMC2808945  PMID: 19914931
16.  Standardized Metadata for Human Pathogen/Vector Genomic Sequences 
PLoS ONE  2014;9(6):e99979.
High throughput sequencing has accelerated the determination of genome sequences for thousands of human infectious disease pathogens and dozens of their vectors. The scale and scope of these data are enabling genotype-phenotype association studies to identify genetic determinants of pathogen virulence and drug/insecticide resistance, and phylogenetic studies to track the origin and spread of disease outbreaks. To maximize the utility of genomic sequences for these purposes, it is essential that metadata about the pathogen/vector isolate characteristics be collected and made available in organized, clear, and consistent formats. Here we report the development of the GSCID/BRC Project and Sample Application Standard, developed by representatives of the Genome Sequencing Centers for Infectious Diseases (GSCIDs), the Bioinformatics Resource Centers (BRCs) for Infectious Diseases, and the U.S. National Institute of Allergy and Infectious Diseases (NIAID), part of the National Institutes of Health (NIH), informed by interactions with numerous collaborating scientists. It includes mapping to terms from other data standards initiatives, including the Genomic Standards Consortium’s minimal information (MIxS) and NCBI’s BioSample/BioProjects checklists and the Ontology for Biomedical Investigations (OBI). The standard includes data fields about characteristics of the organism or environmental source of the specimen, spatial-temporal information about the specimen isolation event, phenotypic characteristics of the pathogen/vector isolated, and project leadership and support. By modeling metadata fields into an ontology-based semantic framework and reusing existing ontologies and minimum information checklists, the application standard can be extended to support additional project-specific data fields and integrated with other data represented with comparable standards. The use of this metadata standard by all ongoing and future GSCID sequencing projects will provide a consistent representation of these data in the BRC resources and other repositories that leverage these data, allowing investigators to identify relevant genomic sequences and perform comparative genomics analyses that are both statistically meaningful and biologically relevant.
PMCID: PMC4061050  PMID: 24936976
17.  PlasmoDB: a functional genomic database for malaria parasites 
Nucleic Acids Research  2008;37(Database issue):D539-D543.
PlasmoDB ( is a functional genomic database for Plasmodium spp. that provides a resource for data analysis and visualization in a gene-by-gene or genome-wide scale. PlasmoDB belongs to a family of genomic resources that are housed under the EuPathDB ( Bioinformatics Resource Center (BRC) umbrella. The latest release, PlasmoDB 5.5, contains numerous new data types from several broad categories—annotated genomes, evidence of transcription, proteomics evidence, protein function evidence, population biology and evolution. Data in PlasmoDB can be queried by selecting the data of interest from a query grid or drop down menus. Various results can then be combined with each other on the query history page. Search results can be downloaded with associated functional data and registered users can store their query history for future retrieval or analysis.
PMCID: PMC2686598  PMID: 18957442
18.  GiardiaDB and TrichDB: integrated genomic resources for the eukaryotic protist pathogens Giardia lamblia and Trichomonas vaginalis 
Nucleic Acids Research  2008;37(Database issue):D526-D530.
GiardiaDB ( and TrichDB ( house the genome databases for Giardia lamblia and Trichomonas vaginalis, respectively, and represent the latest additions to the EuPathDB ( family of functional genomic databases. GiardiaDB and TrichDB employ the same framework as other EuPathDB sites (CryptoDB, PlasmoDB and ToxoDB), supporting fully integrated and searchable databases. Genomic-scale data available via these resources may be queried based on BLAST searches, annotation keywords and gene ID searches, GO terms, sequence motifs and other protein characteristics. Functional queries may also be formulated, based on transcript and protein expression data from a variety of platforms. Phylogenetic relationships may also be interrogated. The ability to combine the results from independent queries, and to store queries and query results for future use facilitates complex, genome-wide mining of functional genomic data.
PMCID: PMC2686445  PMID: 18824479
19.  ApiDB: integrated resources for the apicomplexan bioinformatics resource center 
Nucleic Acids Research  2006;35(Database issue):D427-D430.
ApiDB () represents a unified entry point for the NIH-funded Apicomplexan Bioinformatics Resource Center (BRC) that integrates numerous database resources and multiple data types. The phylum Apicomplexa comprises numerous veterinary and medically important parasitic protozoa including human pathogenic species of the genera Cryptosporidium, Plasmodium and Toxoplasma. ApiDB serves not only as a database in its own right, but as a single web-based point of entry that unifies access to three major existing individual organism databases (, and, and integrates these databases with data available from additional sources. Through the ApiDB site, users may pose queries and search all available apicomplexan data and tools, or they may visit individual component organism databases.
PMCID: PMC1669770  PMID: 17098930
20.  Discovery Approaches to UPR in Athero-Susceptible Endothelium In Vivo 
Methods in enzymology  2011;489:10.1016/B978-0-12-385116-1.00007-8.
The endothelium is a monolayer of cells that lines the entire inner surface of the cardiovascular and lymphatic circulations where it controls normal physiological functions through both systemic and local regulation. Endothelial phenotypes are heterogeneous, dynamic and malleable, properties that in large- and medium-sized arteries lead to a central role in the development of focal and regional atherosclerosis. The endothelial phenotype in athero-susceptible sites is different from that in nearby athero-resistant regions. Understanding the in vivo gene, protein, and metabolic expression profiles of susceptible endothelium is, therefore, an important spatiotemporal challenge in atherosclerosis research. Recent studies have demonstrated that endoplasmic reticulum (ER) stress and the UPR are characteristics of susceptible endothelium. Here, we outline global genomic profiling, pathway analyses, and gene connectivity approaches to the identification of UPR and associated pathways as discrete markers of athero-susceptibility in arterial endothelium.
PMCID: PMC3833809  PMID: 21266227
21.  Dual lineage-specific expression of Sox17 during mouse embryogenesis 
Stem cells (Dayton, Ohio)  2012;30(10):2297-2308.
Sox17 is essential for both endoderm development and fetal hematopoietic stem cell (HSC) maintenance. While endoderm-derived organs are well known to originate from Sox17-expressing cells it is less certain whether fetal HSCs also originate from Sox17-expressing cells. By generating a Sox17GFPCre allele and using it to assess the fate of Sox17-expressing cells during embryogenesis we confirmed that both endodermal and a part of definitive hematopoietic cells are derived from Sox17-positive cells. Prior to E9.5 the expression of Sox17 is restricted to the endoderm lineage. However, at E9.5 Sox17 is expressed in the endothelial cells (ECs) at the para-aortic splanchnopleural (P-Sp) region that contribute to the formation of HSCs at a later stage. The identification of two distinct progenitor cell populations that express Sox17 at E9.5 was confirmed using FACS together with RNA-Seq to determine the gene expression profiles of the two cell populations. Interestingly, this analysis revealed differences in the RNA processing of the Sox17 mRNA during embryogenesis. Taken together, these results indicate that Sox17 is expressed in progenitor cells derived from two different germ layers, further demonstrating the complex expression pattern of this gene and suggesting caution when using Sox17 as a lineage-specific marker.
PMCID: PMC3448801  PMID: 22865702
22.  Stat and interferon genes identified by network analysis differentially regulate primitive and definitive erythropoiesis 
BMC Systems Biology  2013;7:38.
Hematopoietic ontogeny is characterized by overlapping waves of primitive, fetal definitive, and adult definitive erythroid lineages. Our aim is to identify differences in the transcriptional control of these distinct erythroid cell maturation pathways by inferring and analyzing gene-interaction networks from lineage-specific expression datasets. Inferred networks are strongly connected and do not fit a scale-free model, making it difficult to identify essential regulators using the hub-essentiality standard.
We employed a semi-supervised machine learning approach to integrate measures of network topology with expression data to score gene essentiality. The algorithm was trained and tested on the adult and fetal definitive erythroid lineages. When applied to the primitive erythroid lineage, 144 high scoring transcription factors were found to be differentially expressed between the primitive and adult definitive erythroid lineages, including all expressed STAT-family members. Differential responses of primitive and definitive erythroblasts to a Stat3 inhibitor and IFNγ in vitro supported the results of the computational analysis. Further investigation of the original expression data revealed a striking signature of Stat1-related genes in the adult definitive erythroid network. Among the potential pathways known to utilize Stat1, interferon (IFN) signaling-related genes were expressed almost exclusively within the adult definitive erythroid network.
In vitro results support the computational prediction that differential regulation and downstream effectors of STAT signaling are key factors that distinguish the transcriptional control of primitive and definitive erythroid cell maturation.
PMCID: PMC3668222  PMID: 23675896
Primitive erythropoiesis; Definitive erythropoiesis; Stat1; Stat3; IFN-signaling; Gene-regulatory networks; Co-expression network inference
23.  Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM) 
Bioinformatics  2011;27(18):2518-2528.
Motivation: A critical task in high-throughput sequencing is aligning millions of short reads to a reference genome. Alignment is especially complicated for RNA sequencing (RNA-Seq) because of RNA splicing. A number of RNA-Seq algorithms are available, and claim to align reads with high accuracy and efficiency while detecting splice junctions. RNA-Seq data are discrete in nature; therefore, with reasonable gene models and comparative metrics RNA-Seq data can be simulated to sufficient accuracy to enable meaningful benchmarking of alignment algorithms. The exercise to rigorously compare all viable published RNA-Seq algorithms has not been performed previously.
Results: We developed an RNA-Seq simulator that models the main impediments to RNA alignment, including alternative splicing, insertions, deletions, substitutions, sequencing errors and intron signal. We used this simulator to measure the accuracy and robustness of available algorithms at the base and junction levels. Additionally, we used reverse transcription–polymerase chain reaction (RT–PCR) and Sanger sequencing to validate the ability of the algorithms to detect novel transcript features such as novel exons and alternative splicing in RNA-Seq data from mouse retina. A pipeline based on BLAT was developed to explore the performance of established tools for this problem, and to compare it to the recently developed methods. This pipeline, the RNA-Seq Unified Mapper (RUM), performs comparably to the best current aligners and provides an advantageous combination of accuracy, speed and usability.
Availability: The RUM pipeline is distributed via the Amazon Cloud and for computing clusters using the Sun Grid Engine (
Supplementary Information:The RNA-Seq sequence reads described in the article are deposited at GEO, accession GSE26248.
PMCID: PMC3167048  PMID: 21775302
24.  Using OrthoMCL to assign proteins to OrthoMCL-DB groups or to cluster proteomes into new ortholog groups 
OrthoMCL is an algorithm for grouping proteins into ortholog groups based on their sequence similarity. OrthoMCL-DB is a public database that allows users to browse and view ortholog groups that were pre-computed using the OrthoMCL algorithm. Version 4 of this database contained 116,536 ortholog groups clustered from 1,270,853 proteins obtained from 88 eukaryotic genomes, 16 archaeal genomes and 34 bacterial genomes. Future versions of OrthoMCL-DB will include more proteomes as more genomes are sequenced. Here, we describe how you can group your proteins of interest into ortholog clusters using two different means provided by the OrthoMCL system. The OrthoMCL-DB website has a tool for uploading and grouping a set of protein sequences, typically representing a proteome. This method maps the uploaded proteins to existing groups in OrthoMCL-DB. Alternatively, if you have proteins from a set of genomes that need to be grouped, you can download, install and run the standalone OrthoMCL software.
PMCID: PMC3196566  PMID: 21901743
OrthoMCL; ortholog groups; paralog; proteome; Markov clustering; reciprocal best hits; MCL
25.  Coronary Artery Endothelial Transcriptome In Vivo: Identification of Endoplasmic Reticulum Stress and Enhanced ROS by Gene Connectivity Network Analysis 
Endothelial function is central to the localization of atherosclerosis. The in vivo endothelial phenotypic footprints of arterial bed identity and site-specific athero-susceptibility are addressed.
Methods and Results
98 endothelial cell samples from 13 discrete coronary and non-coronary arterial regions of varying susceptibilities to atherosclerosis were isolated from 76 normal swine. Transcript profiles were analyzed to determine the steady state in vivo endothelial phenotypes. An unsupervised systems biology approach utilizing weighted gene co-expression networks determined highly correlated endothelial genes. Connectivity network analysis identified 19 gene modules, 12 of which showed significant association with circulatory bed classification. Differential expression of 1,300 genes between coronary and non-coronary artery endothelium suggested distinct coronary endothelial phenotypes with highest significance expressed in gene modules enriched for biological functions related to endoplasmic reticulum (ER) stress and unfolded protein binding, regulation of transcription and translation, and redox homeostasis. Furthermore, within coronary arteries comparison of endothelial transcript profiles of susceptible proximal regions to protected distal regions suggested the presence of ER stress conditions in susceptible sites. Accumulation of reactive oxygen species (ROS) throughout coronary endothelium was greater than in non-coronary endothelium consistent with coronary artery ER stress and the lower endothelial expression of anti-oxidant genes in coronary arteries.
Gene connectivity analyses discriminated between coronary and non-coronary endothelial transcript profiles and identified differential transcript levels associated with increased ER and oxidative stress in coronary arteries, consistent with enhanced susceptibility to atherosclerosis.
PMCID: PMC3116084  PMID: 21493819
weighted gene co-expression networks; microarray; endoplasmic reticulum stress; unfolded protein response; reactive oxygen species

