Motivation: A critical task in high-throughput sequencing is aligning millions of short reads to a reference genome. Alignment is especially complicated for RNA sequencing (RNA-Seq) because of RNA splicing. A number of RNA-Seq algorithms are available, and claim to align reads with high accuracy and efficiency while detecting splice junctions. RNA-Seq data are discrete in nature; therefore, with reasonable gene models and comparative metrics RNA-Seq data can be simulated to sufficient accuracy to enable meaningful benchmarking of alignment algorithms. The exercise to rigorously compare all viable published RNA-Seq algorithms has not been performed previously.
Results: We developed an RNA-Seq simulator that models the main impediments to RNA alignment, including alternative splicing, insertions, deletions, substitutions, sequencing errors and intron signal. We used this simulator to measure the accuracy and robustness of available algorithms at the base and junction levels. Additionally, we used reverse transcription–polymerase chain reaction (RT–PCR) and Sanger sequencing to validate the ability of the algorithms to detect novel transcript features such as novel exons and alternative splicing in RNA-Seq data from mouse retina. A pipeline based on BLAT was developed to explore the performance of established tools for this problem, and to compare it to the recently developed methods. This pipeline, the RNA-Seq Unified Mapper (RUM), performs comparably to the best current aligners and provides an advantageous combination of accuracy, speed and usability.
Availability: The RUM pipeline is distributed via the Amazon Cloud and for computing clusters using the Sun Grid Engine (http://cbil.upenn.edu/RUM).
Contact: email@example.com; firstname.lastname@example.org
Supplementary Information:The RNA-Seq sequence reads described in the article are deposited at GEO, accession GSE26248.
OrthoMCL is an algorithm for grouping proteins into ortholog groups based on their sequence similarity. OrthoMCL-DB is a public database that allows users to browse and view ortholog groups that were pre-computed using the OrthoMCL algorithm. Version 4 of this database contained 116,536 ortholog groups clustered from 1,270,853 proteins obtained from 88 eukaryotic genomes, 16 archaeal genomes and 34 bacterial genomes. Future versions of OrthoMCL-DB will include more proteomes as more genomes are sequenced. Here, we describe how you can group your proteins of interest into ortholog clusters using two different means provided by the OrthoMCL system. The OrthoMCL-DB website has a tool for uploading and grouping a set of protein sequences, typically representing a proteome. This method maps the uploaded proteins to existing groups in OrthoMCL-DB. Alternatively, if you have proteins from a set of genomes that need to be grouped, you can download, install and run the standalone OrthoMCL software.
OrthoMCL; ortholog groups; paralog; proteome; Markov clustering; reciprocal best hits; MCL
Endothelial function is central to the localization of atherosclerosis. The in vivo endothelial phenotypic footprints of arterial bed identity and site-specific athero-susceptibility are addressed.
Methods and Results
98 endothelial cell samples from 13 discrete coronary and non-coronary arterial regions of varying susceptibilities to atherosclerosis were isolated from 76 normal swine. Transcript profiles were analyzed to determine the steady state in vivo endothelial phenotypes. An unsupervised systems biology approach utilizing weighted gene co-expression networks determined highly correlated endothelial genes. Connectivity network analysis identified 19 gene modules, 12 of which showed significant association with circulatory bed classification. Differential expression of 1,300 genes between coronary and non-coronary artery endothelium suggested distinct coronary endothelial phenotypes with highest significance expressed in gene modules enriched for biological functions related to endoplasmic reticulum (ER) stress and unfolded protein binding, regulation of transcription and translation, and redox homeostasis. Furthermore, within coronary arteries comparison of endothelial transcript profiles of susceptible proximal regions to protected distal regions suggested the presence of ER stress conditions in susceptible sites. Accumulation of reactive oxygen species (ROS) throughout coronary endothelium was greater than in non-coronary endothelium consistent with coronary artery ER stress and the lower endothelial expression of anti-oxidant genes in coronary arteries.
Gene connectivity analyses discriminated between coronary and non-coronary endothelial transcript profiles and identified differential transcript levels associated with increased ER and oxidative stress in coronary arteries, consistent with enhanced susceptibility to atherosclerosis.
weighted gene co-expression networks; microarray; endoplasmic reticulum stress; unfolded protein response; reactive oxygen species
Genome sequencing of many eukaryotic pathogens and the volume of data available on public resources have created a clear requirement for a consistent vocabulary to describe the range of developmental forms of parasites. Consistent labeling of experimental data and external data, in databases and the literature, is essential for integration, cross database comparison, and knowledge discovery. The primary objective of this work was to develop a dynamic and controlled vocabulary that can be used for various parasites. The paper describes the Ontology for Parasite Lifecycle (OPL) and discusses its application in parasite research.
The OPL is based on the Basic Formal Ontology (BFO) and follows the rules set by the OBO Foundry consortium. The first version of the OPL models complex life cycle stage details of a range of parasites, such as Trypanosoma sp., Leishmaniasp., Plasmodium sp., and Shicstosoma sp. In addition, the ontology also models necessary contextual details, such as host information, vector information, and anatomical locations. OPL is primarily designed to serve as a reference ontology for parasite life cycle stages that can be used for database annotation purposes and in the lab for data integration or information retrieval as exemplified in the application section below.
OPL is freely available at http://purl.obolibrary.org/obo/opl.owl and has been submitted to the BioPortal site of NCBO and to the OBO Foundry. We believe that database and phenotype annotations using OPL will help run fundamental queries on databases to know more about gene functions and to find intervention targets for various parasites. The OPL is under continuous development and new parasites and/or terms are being added.
The ever-increasing scale of biological data sets, particularly those arising in the context of high-throughput technologies, requires the development of rich data exploration tools. In this article, we present AnnotCompute, an information discovery platform for repositories of functional genomics experiments such as ArrayExpress. Our system leverages semantic annotations of functional genomics experiments with controlled vocabulary and ontology terms, such as those from the MGED Ontology, to compute conceptual dissimilarities between pairs of experiments. These dissimilarities are then used to support two types of exploratory analysis—clustering and query-by-example. We show that our proposed dissimilarity measures correspond to a user's intuition about conceptual dissimilarity, and can be used to support effective query-by-example. We also evaluate the quality of clustering based on these measures. While AnnotCompute can support a richer data exploration experience, its effectiveness is limited in some cases, due to the quality of available annotations. Nonetheless, tools such as AnnotCompute may provide an incentive for richer annotations of experiments. Code is available for download at http://www.cbil.upenn.edu/downloads/AnnotCompute.
Database URL: http://www.cbil.upenn.edu/annotCompute/
Web sites associated with the Eukaryotic Pathogen Bioinformatics Resource Center (EuPathDB.org) have recently introduced a graphical user interface, the Strategies WDK, intended to make advanced searching and set and interval operations easy and accessible to all users. With a design guided by usability studies, the system helps motivate researchers to perform dynamic computational experiments and explore relationships across data sets. For example, PlasmoDB users seeking novel therapeutic targets may wish to locate putative enzymes that distinguish pathogens from their hosts, and that are expressed during appropriate developmental stages. When a researcher runs one of the approximately 100 searches available on the site, the search is presented as a first step in a strategy. The strategy is extended by running additional searches, which are combined with set operators (union, intersect or minus), or genomic interval operators (overlap, contains). A graphical display uses Venn diagrams to make the strategy’s flow obvious. The interface facilitates interactive adjustment of the component searches with changes propagating forward through the strategy. Users may save their strategies, creating protocols that can be shared with colleagues. The strategy system has now been deployed on all EuPathDB databases, and successfully deployed by other projects. The Strategies WDK uses a configurable MVC architecture that is compatible with most genomics and biological warehouse databases, and is available for download at code.google.com/p/strategies-wdk.
Database URL: www.eupathdb.org
Endothelial function and dysfunction are central to the focal origin and regional development of atherosclerosis; however, an in vivo endothelial phenotypic footprint of susceptibility to atherosclerosis preceding pathological change remains elusive.
To conduct a comparative multi-site genomics study of arterial endothelial phenotype in athero-susceptible and athero-protected regions.
Methods and Results
Transcript profiles of freshly isolated endothelial cells from 7 discrete arterial regions in normal swine were analyzed to determine the steady state in vivo endothelial phenotypes in regions of varying susceptibilities to atherosclerosis. The most abundant common feature of the endothelium of all athero-susceptible regions was the upregulation of genes associated with endoplasmic reticulum (ER) stress. The unfolded protein response (UPR) pathway, induced by ER stress, was therefore investigated in detail in endothelium of the athero-susceptible aortic arch and was found to be partially activated. ER transmembrane signal transducers IRE1α and ATF6α and their downstream effectors, but not PERK, were activated concomitant with a higher transcript expression of protein folding enzymes and chaperones, indicative of ER stress in vivo.
The findings demonstrate the prevalence of chronic endothelial ER stress and activated UPR in vivo at athero-susceptible arterial sites. We propose that chronic localized biological stress is linked to spatial susceptibility of the endothelium to the initiation of atherosclerosis.
hemodynamics; DNA microarrays; gene expression
Summary: Computational methods in molecular biology will increasingly depend on standards-based annotations that describe biological experiments in an unambiguous manner. Annotare is a software tool that enables biologists to easily annotate their high-throughput experiments, biomaterials and data in a standards-compliant way that facilitates meaningful search and analysis.
Availability and Implementation: Annotare is available from http://code.google.com/p/annotare/ under the terms of the open-source MIT License (http://www.opensource.org/licenses/mit-license.php). It has been tested on both Mac and Windows.
Pancreatic β cells, organized in the islets of Langerhans, sense glucose and secrete appropriate amounts of insulin. We have studied the roles of LKB1, a conserved kinase implicated in the control of cell polarity and energy metabolism, in adult β cells. LKB1-deficient β cells show a dramatic increase in insulin secretion in vivo. Histologically, LKB1-deficient β cells have striking alterations in the localization of the nucleus and cilia relative to blood vessels, suggesting a shift from hepatocyte-like to columnar polarity. Additionally, LKB1 deficiency causes a 65% increase in β cell volume. We show that distinct targets of LKB1 mediate these effects. LKB1 controls β cell size, but not polarity, via the mTOR pathway. Conversely, the precise position of the β cell nucleus, but not cell size, is controlled by the LKB1 target Par1b. Insulin secretion and content are restricted by LKB1, at least in part, via AMPK. These results expose a molecular mechanism, orchestrated by LKB1, for the coordinated maintenance of β cell size, form, and function.
The apicomplexans are a diverse phylum of parasites causing an assortment of diseases including malaria in a wide variety of animals and lymphoproliferation in cattle. Little is known about how these varied parasites regulate their transcriptional regulons. Even less is known about how regulon systems, consisting of transcription factors and target genes together with their associated biological process, evolve in these diverse parasites.
In order to obtain insights into the differences in transcriptional regulation between these parasites we compared the orthology profiles of putative malaria transcription factors across species and examined the enrichment patterns of four binding sites across eleven apicomplexans.
About three-fifths of the factors are broadly conserved in several phylogenetic orders of sequenced apicomplexans. This observation suggests the existence of regulons whose regulation is conserved across this ancient phylum. Transcription factors not broadly conserved across the phylum are possibly involved in regulon systems that have diverged between species. Examining binding site enrichment patterns in light of transcription factor conservation patterns suggests a second mode via which regulon systems may diverge - rewiring of existing transcription factors and their associated binding sites in specific ways. Integrating binding sites with transcription factor conservation patterns also facilitated prediction of putative regulators for one of the binding sites.
Even though transcription factors are underrepresented in apicomplexans, the distribution of these factors and their associated regulons reflect common and family-specific transcriptional regulatory processes.
The development of the Functional Genomics Investigation Ontology (FuGO) is a collaborative, international effort that will provide a resource for annotating functional genomics investigations, including the study design, protocols and instrumentation used, the data generated and the types of analysis performed on the data. FuGO will contain both terms that are universal to all functional genomics investigations and those that are domain specific. In this way, the ontology will serve as the “semantic glue” to provide a common understanding of data from across these disparate data sources. In addition, FuGO will reference out to existing mature ontologies to avoid the need to duplicate these resources, and will do so in such a way as to enable their ease of use in annotation. This project is in the early stages of development; the paper will describe efforts to initiate the project, the scope and organization of the project, the work accomplished to date, and the challenges encountered, as well as future plans.
Malaria-causing Plasmodium species exhibit marked differences including host choice and preference for invading particular cell types. The genetic bases of phenotypic differences between parasites can be understood, in part, by investigating constraints on gene expression and genic sequences, both coding and regulatory.
We investigated the evolutionary constraints on sequence and expression of parasitic genes by applying comparative genomics approaches to 6 Plasmodium genomes and 2 genome-wide expression studies. We found that the coding regions of Plasmodium transcription factor and sexual development genes are relatively less constrained, as are those of genes encoding CCCH zinc fingers and invasion proteins, which all play important roles in these parasites. Transcription factors and genes with stage-restricted expression have conserved upstream regions and so do several gene classes critical to the parasite's lifestyle, namely, ion transport, invasion, chromatin assembly and CCCH zinc fingers. Additionally, a cross-species comparison of expression patterns revealed that Plasmodium-specific genes exhibit significant expression divergence.
Overall, constraints on Plasmodium's protein coding regions confirm observations from other eukaryotes in that transcription factors are under relatively lower constraint. Proteins relevant to the parasite's unique lifestyle also have lower constraint on their coding regions. Greater conservation between Plasmodium species in terms of promoter motifs suggests tight regulatory control of lifestyle genes. However, an interspecies divergence in expression patterns of these genes suggests that either expression is controlled via genomic or epigenomic features not encoded in the proximal promoter sequence, or alternatively, the combinatorial interactions between motifs confer species-specific expression patterns.
ToxoDB (http://ToxoDB.org) is a genome and functional genomic database for the protozoan parasite Toxoplasma gondii. It incorporates the sequence and annotation of the T. gondii ME49 strain, as well as genome sequences for the GT1, VEG and RH (Chr Ia, Chr Ib) strains. Sequence information is integrated with various other genomic-scale data, including community annotation, ESTs, gene expression and proteomics data. ToxoDB has matured significantly since its initial release. Here we outline the numerous updates with respect to the data and increased functionality available on the website.
Serum response factor (SRF) binds a 1,216-fold degenerate cis element known as the CArG box. CArG boxes are found primarily in muscle and growth factor-associated genes though the full spectrum of functional CArG elements in the genome (the CArGome) has yet to be defined. Here we describe a genome-wide screen to further define the functional mammalian CArGome. A computational approach involving comparative genomic analyses of human and mouse orthologous genes uncovered over 100 hypothetical SRF-dependent genes, including 10 previously identified SRF targets, harboring a conserved CArG element within 4,000 base pairs (bp) of the annotated transcription start site (TSS). We PCR- cloned 89 hypothetical SRF targets and subjected each of them to at least two of several validations including luciferase reporter, gel shift, chromatin immunoprecipitation, and mRNA expression following RNAi knockdown of SRF; 60/89 (67%) of the targets were validated. Interestingly, 26 of the validated SRF target genes encode for cytoskeletal/contractile or adhesion proteins. RNAi knockdown of SRF diminishes expression of several SRF-dependent cytoskeletal genes and elicits an attending perturbation in the cytoarchitecture of both human and rodent cells. These data illustrate the power of integrating existing algorithms to interrogate the genome in a relatively unbiased fashion for cis-regulatory element discovery. In this manner, we have further expanded the mammalian CArGome with the discovery of an array of cyto-contractile genes that coordinate normal cytoskeletal homeostasis. We suggest one function of SRF is that of an ancient master regulator of the actin cytoskeleton.
Two Supplemental Tables are included with this manuscript. Dr. Robert J. Schwartz provided unpublished data that we refer to in Discussion and Dr. Bradford C. Berk provided us with human umbilical vein endothelial cells for studies detailed in Figure 5.
serum response factor; comparative genomics; regulatory element; RNAi; genome
Genomic aberrations recurrent in a particular cancer type can be important prognostic markers for tumor progression. Typically in early tumorigenesis, cells incur a breakdown of the DNA replication machinery that results in an accumulation of genomic aberrations in the form of duplications, deletions, translocations, and other genomic alterations. Microarray methods allow for finer mapping of these aberrations than has previously been possible; however, data processing and analysis methods have not taken full advantage of this higher resolution. Attention has primarily been given to analysis on the single sample level, where multiple adjacent probes are necessarily used as replicates for the local region containing their target sequences. However, regions of concordant aberration can be short enough to be detected by only one, or very few, array elements. We describe a method called Multiple Sample Analysis for assessing the significance of concordant genomic aberrations across multiple experiments that does not require a-priori definition of aberration calls for each sample. If there are multiple samples, representing a class, then by exploiting the replication across samples our method can detect concordant aberrations at much higher resolution than can be derived from current single sample approaches. Additionally, this method provides a meaningful approach to addressing population-based questions such as determining important regions for a cancer subtype of interest or determining regions of copy number variation in a population. Multiple Sample Analysis also provides single sample aberration calls in the locations of significant concordance, producing high resolution calls per sample, in concordant regions. The approach is demonstrated on a dataset representing a challenging but important resource: breast tumors that have been formalin-fixed, paraffin-embedded, archived, and subsequently UV-laser capture microdissected and hybridized to two-channel BAC arrays using an amplification protocol. We demonstrate the accurate detection on simulated data, and on real datasets involving known regions of aberration within subtypes of breast cancer at a resolution consistent with that of the array. Similarly, we apply our method to previously published datasets, including a 250K SNP array, and verify known results as well as detect novel regions of concordant aberration. The algorithm has been fully implemented and tested and is freely available as a Java application at http://www.cbil.upenn.edu/MSA.
Cancer is a genetic disease caused by genomic mutations that confer an increased ability to proliferate and survive in a specific environment. It is now known that many regions of genomic DNA are deleted or amplified in specific cancer types. These aberrations are believed to occur randomly in the genome. If these aberrations overlap more than would be expected by chance across individual occurrences of the cancer this suggests a selective pressure on this aberration. These conserved aberrations likely represent regions that are important for the development, progression, and survival of a specific cancer type in its environment. We present a method for identifying these conserved aberrations within a class of samples. The applications for this method include accurate high resolution mapping of aberrations characteristic of cancer subtypes as well as other genetic diseases and determination of conserved copy number variations in the population. With the use of high resolution microarray methods we have profiled different tumor types. We have been able to create high resolution profiles of conserved aberrations in specific cancer types. These conserved aberrations are prime targets for cancer therapies and many of these regions have already been used to develop effective cancer therapeutics.
COGRIM, an implementation that integrates gene expression, ChIP binding and transcription factor motif data, is described and applied to both unicellular and mammalian organisms.
We present a Bayesian hierarchical model and Gibbs Sampling implementation that integrates gene expression, ChIP binding, and transcription factor motif data in a principled and robust fashion. COGRIM was applied to both unicellular and mammalian organisms under different scenarios of available data. In these applications, we demonstrate the ability to predict gene-transcription factor interactions with reduced numbers of false-positive findings and to make predictions beyond what is obtained when single types of data are considered.
ApiDB () represents a unified entry point for the NIH-funded Apicomplexan Bioinformatics Resource Center (BRC) that integrates numerous database resources and multiple data types. The phylum Apicomplexa comprises numerous veterinary and medically important parasitic protozoa including human pathogenic species of the genera Cryptosporidium, Plasmodium and Toxoplasma. ApiDB serves not only as a database in its own right, but as a single web-based point of entry that unifies access to three major existing individual organism databases (, and CryptoDB.org), and integrates these databases with data available from additional sources. Through the ApiDB site, users may pose queries and search all available apicomplexan data and tools, or they may visit individual component organism databases.
EPConDB () is a public web site that supports research in diabetes, pancreatic development and beta-cell function by providing information about genes expressed in cells of the pancreas. EPConDB displays expression profiles for individual genes and information about transcripts, promoter elements and transcription factor binding sites. Gene expression results are obtained from studies examining tissue expression, pancreatic development and growth, differentiation of insulin-producing cells, islet or beta-cell injury, and genetic models of impaired beta-cell function. The expression datasets are derived using different microarray platforms, including the BCBC PancChips and Affymetrix gene expression arrays. Other datasets include semi-quantitative RT–PCR and MPSS expression studies. For selected microarray studies, lists of differentially expressed genes, derived from PaGE analysis, are displayed on the site. EPConDB provides database queries and tools to examine the relationship between a gene, its transcriptional regulation, protein function and expression in pancreatic tissues.
The OrthoMCL database () houses ortholog group predictions for 55 species, including 16 bacterial and 4 archaeal genomes representing phylogenetically diverse lineages, and most currently available complete eukaryotic genomes: 24 unikonts (12 animals, 9 fungi, microsporidium, Dictyostelium, Entamoeba), 4 plants/algae and 7 apicomplexan parasites. OrthoMCL software was used to cluster proteins based on sequence similarity, using an all-against-all BLAST search of each species' proteome, followed by normalization of inter-species differences, and Markov clustering. A total of 511 797 proteins (81.6% of the total dataset) were clustered into 70 388 ortholog groups. The ortholog database may be queried based on protein or group accession numbers, keyword descriptions or BLAST similarity. Ortholog groups exhibiting specific phyletic patterns may also be identified, using either a graphical interface or a text-based Phyletic Pattern Expression grammar. Information for ortholog groups includes the phyletic profile, the list of member proteins and a multiple sequence alignment, a statistical summary and graphical view of similarities, and a graphical representation of domain architecture. OrthoMCL software, the entire FASTA dataset employed and clustering results are available for download. OrthoMCL-DB provides a centralized warehouse for orthology prediction among multiple species, and will be updated and expanded as additional genome sequence data become available.
A genome-wide analysis of promoters was carried out in the context of gene expression patterns in tissue surveys using human microarray and EST-based expression data. The study revealed that most genes show statistically significant tissue-dependent variations of expression level and identified components of promoters that distinguish tissue-specific from ubiquitous genes.
The regulatory mechanisms underlying tissue specificity are a crucial part of the development and maintenance of multicellular organisms. A genome-wide analysis of promoters in the context of gene-expression patterns in tissue surveys provides a means of identifying the general principles for these mechanisms.
We introduce a definition of tissue specificity based on Shannon entropy to rank human genes according to their overall tissue specificity and by their specificity to particular tissues. We apply our definition to microarray-based and expressed sequence tag (EST)-based expression data for human genes and use similar data for mouse genes to validate our results. We show that most genes show statistically significant tissue-dependent variations in expression level. We find that the most tissue-specific genes typically have a TATA box, no CpG island, and often code for extracellular proteins. As expected, CpG islands are found in most of the least tissue-specific genes, which often code for proteins located in the nucleus or mitochondrion. The class of genes with no CpG island or TATA box are the most common mid-specificity genes and commonly code for proteins located in a membrane. Sp1 was found to be a weak indicator of less-specific expression. YY1 binding sites, either as initiators or as downstream sites, were strongly associated with the least-specific genes.
We have begun to understand the components of promoters that distinguish tissue-specific from ubiquitous genes, to identify associations that can predict the broad class of gene expression from sequence data alone.
A great deal of data in functional genomics studies needs to be annotated with
low-resolution anatomical terms. For example, gene expression assays based on
manually dissected samples (microarray, SAGE, etc.) need high-level anatomical
terms to describe sample origin. First-pass annotation in high-throughput assays (e.g.
large-scale in situ gene expression screens or phenotype screens) and bibliographic
applications, such as selection of keywords, would also benefit from a minimum
set of standard anatomical terms. Although only simple terms are required, the
researcher faces serious practical problems of inconsistency and confusion, given
the different aims and the range of complexity of existing anatomy ontologies. A
Standards and Ontologies for Functional Genomics (SOFG) group therefore initiated
discussions between several of the major anatomical ontologies for higher vertebrates.
As we report here, one result of these discussions is a simple, accessible, controlled
vocabulary of gross anatomical terms, the SOFG Anatomy Entry List (SAEL).
The SAEL is available from http://www.sofg.org and is intended as a resource
for biologists, curators, bioinformaticians and developers of software supporting
functional genomics. It can be used directly for annotation in the contexts described
above. Importantly, each term is linked to the corresponding term in each of the
major anatomy ontologies. Where the simple list does not provide enough detail or
sophistication, therefore, the researcher can use the SAEL to choose the appropriate
ontology and move directly to the relevant term as an entry point. The SAEL links will
also be used to support computational access to the respective ontologies.
ApiEST-DB (http://www.cbil.upenn.edu/paradbs-servlet/) provides integrated access to publicly available EST data from protozoan parasites in the phylum Apicomplexa. The database currently incorporates a total of nearly 100 000 ESTs from several parasite species of clinical and/or veterinary interest, including Eimeria tenella, Neospora caninum, Plasmodium falciparum, Sarcocystis neurona and Toxoplasma gondii. To facilitate analysis of these data, EST sequences were clustered and assembled to form consensus sequences for each organism, and these assemblies were then subjected to automated annotation via similarity searches against protein and domain databases. The underlying relational database infrastructure, Genomics Unified Schema (GUS), enables complex biologically based queries, facilitating validation of gene models, identification of alternative splicing, detection of single nucleotide polymorphisms, identification of stage-specific genes and recognition of phylogenetically conserved and phylogenetically restricted sequences.
EuPathDB (http://eupathdb.org) resources include 11 databases supporting eukaryotic pathogen genomic and functional genomic data, isolate data and phylogenomics. EuPathDB resources are built using the same infrastructure and provide a sophisticated search strategy system enabling complex interrogations of underlying data. Recent advances in EuPathDB resources include the design and implementation of a new data loading workflow, a new database supporting Piroplasmida (i.e. Babesia and Theileria), the addition of large amounts of new data and data types and the incorporation of new analysis tools. New data include genome sequences and annotation, strand-specific RNA-seq data, splice junction predictions (based on RNA-seq), phosphoproteomic data, high-throughput phenotyping data, single nucleotide polymorphism data based on high-throughput sequencing (HTS) and expression quantitative trait loci data. New analysis tools enable users to search for DNA motifs and define genes based on their genomic colocation, view results from searches graphically (i.e. genes mapped to chromosomes or isolates displayed on a map) and analyze data from columns in result tables (word cloud and histogram summaries of column content). The manuscript herein describes updates to EuPathDB since the previous report published in NAR in 2010.
The Microarray Gene Expression Data (MGED) society was formed with an initial focus on experiments involving microarray technology. Despite the diversity of
applications, there are common concepts used and a common need to capture
experimental information in a standardized manner. In building the MGED ontology,
it was recognized that it would be impractical to cover all the different types of
experiments on all the different types of organisms by listing and defining all the
types of organisms and their properties. Our solution was to create a framework for
describing microarray experiments with an initial focus on the biological sample and
its manipulation. For concepts that are common for many species, we could provide a
manageable listing of controlled terms. For concepts that are species-specific or whose
values cannot be readily listed, we created an ‘OntologyEntry’ concept that referenced
an external resource. The MGED ontology is a work in progress that needs additional
instances and particularly needs constraints to be added. The ontology currently
covers the experimental sample and design, and we have begun capturing aspects of
the microarrays themselves as well. The primary application of the ontology will be
to develop forms for entering information into databases, and consequently allowing
queries, taking advantage of the structure provided by the ontology. The application
of an ontology of experimental conditions extends beyond microarray experiments
and, as the scope of MGED includes other aspects of functional genomics, so too will
the MGED ontology.