With the effort of the International Phenotyping Consortium (IMPC) to produce thousands
of strains with conditional potential gathering steam, there is growing recognition that it must be
supported by a rich toolbox of cre driver strains. The approaches to build cre strains have evolved
in both sophistication and reliability, replacing first generation strains with tools that can
target individual cell populations with incredible precision and specificity. The modest set of cre
drivers generated by individual labs over the past 15+ years is now growing rapidly, thanks
to a number of large-scale projects to produce new cre strains for the community. The power of this
growing resource, however, depends upon the proper deep characterization of strain function, as even
the best designed strain can display a variety of undesirable features that must be considered in
experimental design. This must be coupled with the parallel development of informatics tools to
provide functional data to the user, and facilitated access to the strains through public
repositories. We will discuss the current progress on all of these fronts and the challenges that
remain to ensure the scientific community can capitalize on the tremendous number of mouse resources
at their disposal.
The ultimate goal of studying model organisms is to translate what is learned into useful knowledge about normal human biology and disease to facilitate treatment and early screening for diseases. Recent advances in genomic technologies allow for rapid generation of models with a range of targeted genotypes as well as their characterization by high-throughput phenotyping. As an abundance of phenotype data become available, only systematic analysis will facilitate valid conclusions to be drawn from these data and transferred to human diseases. Owing to the volume of data, automated methods are preferable, allowing for a reliable analysis of the data and providing evidence about possible gene–disease associations.
Here, we propose Phenotype comparisons for DIsease Genes and Models (PhenoDigm), as an automated method to provide evidence about gene–disease associations by analysing phenotype information. PhenoDigm integrates data from a variety of model organisms and, at the same time, uses several intermediate scoring methods to identify only strongly data-supported gene candidates for human genetic diseases. We show results of an automated evaluation as well as selected manually assessed examples that support the validity of PhenoDigm. Furthermore, we provide guidance on how to browse the data with PhenoDigm’s web interface and illustrate its usefulness in supporting research.
Mouse phenotype data represents a valuable resource for the identification of disease-associated genes, especially where the molecular basis is unknown and there is no clue to the candidate gene’s function, pathway involvement or expression pattern. However, until recently these data have not been systematically used due to difficulties in mapping between clinical features observed in humans and mouse phenotype annotations. Here, we describe a semantic approach to solve this problem and demonstrate highly significant recall of known disease-gene associations and orthology relationships. A web application (MouseFinder; www.mousemodels.org) has been developed to allow users to search the results of our whole-phenome comparison of human and mouse. We demonstrate its use in identifying ARTN as a strong candidate gene within the 1p34.1-p32 mapped locus for a hereditary form of ptosis.
phenotype; candidate disease genes; model organism; mouse
Mammalian genes are regulated through the action of multiple regulatory elements, often distributed across large regions. The mechanisms that control the integration of these diverse inputs into specific gene expression patterns are still poorly understood. New approaches enabling the dissection of these mechanisms in vivo are needed.
Here, we describe TRACER (http://tracerdatabase.embl.de), a resource that centralizes information from a large on-going functional exploration of the mouse genome with different transposon-associated regulatory sensors. Hundreds of insertions have been mapped to specific genomic positions, and their corresponding regulatory potential has been documented by analysis of the expression of the reporter sensor gene in mouse embryos. The data can be easily accessed and provides information on the regulatory activities present in a large number of genomic regions, notably in gene-poor intervals that have been associated with human diseases.
TRACER data enables comparisons with the expression pattern of neighbouring genes, activity of surrounding regulatory elements or with other genomic features, revealing the underlying regulatory architecture of these loci. TRACER mouse lines can also be requested for in vivo transposition and chromosomal engineering, to analyse further regions of interest.
Gene regulation and expression; Genome organisation; Regulatory landscapes; Chromosomal engineering; Mouse models of human structural variation
Numerous disease syndromes are associated with regions of copy number variation (CNV) in the human genome and, in most cases, the pathogenicity of the CNV is thought to be related to altered dosage of the genes contained within the affected segment. However, establishing the contribution of individual genes to the overall pathogenicity of CNV syndromes is difficult and often relies on the identification of potential candidates through manual searches of the literature and online resources. We describe here the development of a computational framework to comprehensively search phenotypic information from model organisms and single-gene human hereditary disorders, and thus speed the interpretation of the complex phenotypes of CNV disorders. There are currently more than 5000 human genes about which nothing is known phenotypically but for which detailed phenotypic information for the mouse and/or zebrafish orthologs is available. Here, we present an ontology-based approach to identify similarities between human disease manifestations and the mutational phenotypes in characterized model organism genes; this approach can therefore be used even in cases where there is little or no information about the function of the human genes. We applied this algorithm to detect candidate genes for 27 recurrent CNV disorders and identified 802 gene-phenotype associations, approximately half of which involved genes that were previously reported to be associated with individual phenotypic features and half of which were novel candidates. A total of 431 associations were made solely on the basis of model organism phenotype data. Additionally, we observed a striking, statistically significant tendency for individual disease phenotypes to be associated with multiple genes located within a single CNV region, a phenomenon that we denote as pheno-clustering. Many of the clusters also display statistically significant similarities in protein function or vicinity within the protein-protein interaction network. Our results provide a basis for understanding previously un-interpretable genotype-phenotype correlations in pathogenic CNVs and for mobilizing the large amount of model organism phenotype data to provide insights into human genetic disorders.
Phenotype analyses, e.g. investigating metabolic processes, tissue formation, or organism behavior, are an important element of most biological and medical research activities. Biomedical researchers are making increased use of ontological standards and methods to capture the results of such analyses, with one focus being the comparison and analysis of phenotype information between species.
We have generated a cross-species phenotype ontology for human, mouse and zebra fish that contains zebrafish phenotypes. We also provide up-to-date annotation data connecting human genes to phenotype classes from the generated ontology. We have included the data generation pipeline into our continuous integration system ensuring stable and up-to-date releases.
This article describes the data generation process and is intended to help interested researchers access both the phenotype annotation data and the associated cross-species phenotype ontology. The resource described here can be used in sophisticated semantic similarity and gene set enrichment analyses for phenotype data across species. The stable releases of this resource can be obtained from
The CreZOO (http://www.crezoo.org/) is the European virtual repository of Cre and other targeted conditional driver strains. These mice serve as tools for researchers to selectively ‘switch off’ gene expression in mouse models to examine gene function and disease pathology. CreZOO aims to capture and disseminate extant and new information on these Cre driver strains, such as genetic background and availability information, and details pertaining promoter, allele, inducibility and expression patterns, which are also presented. All transgenic strains carry detailed information according to MGI's official nomenclature, whereas their availability [e.g. live mice, cryopreserved embryos, sperm and embryonic stem (ES) cells] is clearly indicated with links to European and International databases and repositories (EMMA, MGI/IMSR, MMRRC, etc) and laboratories where the particular mouse strain is available together with the respective IDs. Each promoter/gene includes IDs and direct links to MGI, Entrez Gene, Ensembl, OMIM and RGD databases depending on their species origin, whereas allele information is presented with MGI IDs and active hyperlinks to redirect the user to the respective page in a new tab. The tissue/cell (special) and developmental (temporal) specificity expression patterns are clearly presented, whereas handling and genotyping details (in the form of documents or hyperlinks) together with all relevant publications are clearly presented with PMID(s) and direct PubMed links. CreZOO's design offers a user-friendly query interface and provides instant access to the list of conditional driver strains, promoters and inducibility details. Database access is free of charge and there are no registration requirements for data querying. CreZOO is being developed in the context of the CREATE consortium (http://www.creline.org/), a core of major European and international mouse database holders and research groups involved in conditional mutagenesis.
http://www.crezoo.org/; alternative URL: http://www.e-mouse.org/
In this article, we describe the use of the BioMart data management system to provide integrated access to International Knockout Mouse Consortium (IKMC) data and other related mouse resources. The IKMC is currently mutating all mouse protein-coding genes in embryonic stem (ES) cells using gene targeting and gene trapping approaches. The BioMart portal allows researchers to identify and obtain IKMC knockout vectors, ES cells and mice for genes of interest. Gene annotation, expression, phenotype and disease data is also integrated from external BioMarts, allowing selection of IKMC products by a wide variety of criteria. These products are invaluable for researchers involved in the elucidation of gene function and the role of individual genes in human disease. Here, we describe these datasets in more detail and illustrate the functionality of the portal using several examples.
Database URL: http://www.knockoutmouse.org/mart
BioMart Central Portal is a first of its kind, community-driven effort to provide unified access to dozens of biological databases spanning genomics, proteomics, model organisms, cancer data, ontology information and more. Anybody can contribute an independently maintained resource to the Central Portal, allowing it to be exposed to and shared with the research community, and linking it with the other resources in the portal. Users can take advantage of the common interface to quickly utilize different sources without learning a new system for each. The system also simplifies cross-database searches that might otherwise require several complicated steps. Several integrated tools streamline common tasks, such as converting between ID formats and retrieving sequences. The combination of a wide variety of databases, an easy-to-use interface, robust programmatic access and the array of tools make Central Portal a one-stop shop for biological data querying. Here, we describe the structure of Central Portal and show example queries to demonstrate its capabilities.
Database URL: http://central.biomart.org.
The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources; and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases.
The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases.
The International Knockout Mouse Consortium (IKMC) aims to mutate all protein-coding genes in the mouse using a combination of gene targeting and gene trapping in mouse embryonic stem (ES) cells and to make the generated resources readily available to the research community. The IKMC database and web portal (www.knockoutmouse.org) serves as the central public web site for IKMC data and facilitates the coordination and prioritization of work within the consortium. Researchers can access up-to-date information on IKMC knockout vectors, ES cells and mice for specific genes, and follow links to the respective repositories from which corresponding IKMC products can be ordered. Researchers can also use the web site to nominate genes for targeting, or to indicate that targeting of a gene should receive high priority. The IKMC database provides data to, and features extensive interconnections with, other community databases.
The recent explosion of biological data and the concomitant proliferation of distributed databases make it challenging for biologists and bioinformaticians to discover the best data resources for their needs, and the most efficient way to access and use them. Despite a rapid acceleration in uptake of syntactic and semantic standards for interoperability, it is still difficult for users to find which databases support the standards and interfaces that they need. To solve these problems, several groups are developing registries of databases that capture key metadata describing the biological scope, utility, accessibility, ease-of-use and existence of web services allowing interoperability between resources. Here, we describe some of these initiatives including a novel formalism, the Database Description Framework, for describing database operations and functionality and encouraging good database practise. We expect such approaches will result in improved discovery, uptake and utilization of data resources.
Database URL: http://www.casimir.org.uk/casimir_ddf
The laboratory mouse has become the organism of choice for discovering gene function and unravelling pathogenetic mechanisms of human diseases through the application of various functional genomic approaches. The resulting deluge of data has led to the deployment of numerous online resources and the concomitant need for formalized experimental descriptions, data standardization, database interoperability and integration, a need that has yet to be met. We present here the Mouse Resource Browser (MRB), a database of mouse databases that indexes 217 publicly available mouse resources under 22 categories and uses a standardised database description framework (the CASIMIR DDF) to provide information on their controlled vocabularies (ontologies and minimum information standards), and technical information on programmatic access and data availability. Focusing on interoperability and integration, MRB offers automatic generation of downloadable and re-distributable SOAP application-programming interfaces for resources that provide direct database access. MRB aims to provide useful information to both bench scientists, who can easily navigate and find all mouse related resources in one place, and bioinformaticians, who will be provided with interoperable resources containing data which can be mined and integrated.
Database URL: http://bioit.fleming.gr/mrb
The maturing field of genomics is rapidly increasing the number of sequenced genomes and producing more information from those previously sequenced. Much of this additional information is variation data derived from sampling multiple individuals of a given species with the goal of discovering new variants and characterising the population frequencies of the variants that are already known. These data have immense value for many studies, including those designed to understand evolution and connect genotype to phenotype. Maximising the utility of the data requires that it be stored in an accessible manner that facilitates the integration of variation data with other genome resources such as gene annotation and comparative genomics.
The Ensembl project provides comprehensive and integrated variation resources for a wide variety of chordate genomes. This paper provides a detailed description of the sources of data and the methods for creating the Ensembl variation databases. It also explores the utility of the information by explaining the range of query options available, from using interactive web displays, to online data mining tools and connecting directly to the data servers programmatically. It gives a good overview of the variation resources and future plans for expanding the variation data within Ensembl.
Variation data is an important key to understanding the functional and phenotypic differences between individuals. The development of new sequencing and genotyping technologies is greatly increasing the amount of variation data known for almost all genomes. The Ensembl variation resources are integrated into the Ensembl genome browser and provide a comprehensive way to access this data in the context of a widely used genome bioinformatics system. All Ensembl data is freely available at http://www.ensembl.org and from the public MySQL database server at ensembldb.ensembl.org.
XGAP, a software platform for the integration and analysis of genotype and phenotype data.
We present an extensible software model for the genotype and phenotype community, XGAP. Readers can download a standard XGAP (http://www.xgap.org) or auto-generate a custom version using MOLGENIS with programming interfaces to R-software and web-services or user interfaces for biologists. XGAP has simple load formats for any type of genotype, epigenotype, transcript, protein, metabolite or other phenotype data. Current functionality includes tools ranging from eQTL analysis in mouse to genome-wide association studies in humans.
The integration of information present in many disparate biological databases represents a major challenge in biomedical research. To define the problems and needs, and to explore strategies for database integration in mouse functional genomics, we consulted the biologist user community and implemented solutions to two user-defined use-cases.
We organised workshops, meetings and used a questionnaire to identify the needs of biologist database users in mouse functional genomics. As a result, two use-cases were developed that can be used to drive future designs or extensions of mouse databases. Here, we present the use-cases and describe some initial computational solutions for them. The application for the gene-centric use-case, "MUSIG-Gen" starts from a list of gene names and collects a wide range of data types from several distributed databases in a "shopping cart"-like manner. The iterative user-driven approach is a response to strongly articulated requests from users, especially those without computational biology backgrounds. The application for the phenotype-centric use-case, "MUSIG-Phen", is based on a similar concept and starting from phenotype descriptions retrieves information for associated genes.
The use-cases created, and their prototype software implementations should help to better define biologists' needs for database integration and may serve as a starting point for future bioinformatics solutions aimed at end-user biologists.
Ensembl (http://www.ensembl.org) integrates genomic information for a comprehensive set of chordate genomes with a particular focus on resources for human, mouse, rat, zebrafish and other high-value sequenced genomes. We provide complete gene annotations for all supported species in addition to specific resources that target genome variation, function and evolution. Ensembl data is accessible in a variety of formats including via our genome browser, API and BioMart. This year marks the tenth anniversary of Ensembl and in that time the project has grown with advances in genome technology. As of release 56 (September 2009), Ensembl supports 51 species including marmoset, pig, zebra finch, lizard, gorilla and wallaby, which were added in the past year. Major additions and improvements to Ensembl since our previous report include the incorporation of the human GRCh37 assembly, enhanced visualisation and data-mining options for the Ensembl regulatory features and continued development of our software infrastructure.
Following the technological advances that have enabled genome-wide analysis in most model organisms over the last decade, there has been unprecedented growth in genomic and post-genomic science with concomitant generation of an exponentially increasing volume of data and material resources. As a result, numerous repositories have been created to store and archive data, organisms and material, which are of substantial value to the whole community. Sustained access, facilitating re-use of these resources, is essential, not only for validation, but for re-analysis, testing of new hypotheses and developing new technologies/platforms. A common challenge for most data resources and biological repositories today is finding financial support for maintenance and development to best serve the scientific community. In this study we examine the problems that currently confront the data and resource infrastructure underlying the biomedical sciences. We discuss the financial sustainability issues and potential business models that could be adopted by biological resources and consider long term preservation issues within the context of mouse functional genomics efforts in Europe.
The laboratory mouse is the premier animal model for studying human disease and thousands of mutants have been identified or produced, most recently through gene-specific mutagenesis approaches. High throughput strategies by the International Knockout Mouse Consortium (IKMC) are producing mutants for all protein coding genes. Generating a knock-out line involves huge monetary and time costs so capture of both the data describing each mutant alongside archiving of the line for distribution to future researchers is critical. The European Mouse Mutant Archive (EMMA) is a leading international network infrastructure for archiving and worldwide provision of mouse mutant strains. It operates in collaboration with the other members of the Federation of International Mouse Resources (FIMRe), EMMA being the European component. Additionally EMMA is one of four repositories involved in the IKMC, and therefore the current figure of 1700 archived lines will rise markedly. The EMMA database gathers and curates extensive data on each line and presents it through a user-friendly website. A BioMart interface allows advanced searching including integrated querying with other resources e.g. Ensembl. Other resources are able to display EMMA data by accessing our Distributed Annotation System server. EMMA database access is publicly available at http://www.emmanet.org.
BioMart Central Portal (www.biomart.org) offers a one-stop shop solution to access a wide array of biological databases. These include major biomolecular sequence, pathway and annotation databases such as Ensembl, Uniprot, Reactome, HGNC, Wormbase and PRIDE; for a complete list, visit, http://www.biomart.org/biomart/martview. Moreover, the web server features seamless data federation making cross querying of these data sources in a user friendly and unified way. The web server not only provides access through a web interface (MartView), it also supports programmatic access through a Perl API as well as RESTful and SOAP oriented web services. The website is free and open to all users and there is no login requirement.
Biologists need to perform complex queries, often across a variety of databases. Typically, each data resource provides an advanced query interface, each of which must be learnt by the biologist before they can begin to query them. Frequently, more than one data source is required and for high-throughput analysis, cutting and pasting results between websites is certainly very time consuming. Therefore, many groups rely on local bioinformatics support to process queries by accessing the resource's programmatic interfaces if they exist. This is not an efficient solution in terms of cost and time. Instead, it would be better if the biologist only had to learn one generic interface. BioMart provides such a solution.
BioMart enables scientists to perform advanced querying of biological data sources through a single web interface. The power of the system comes from integrated querying of data sources regardless of their geographical locations. Once these queries have been defined, they may be automated with its "scripting at the click of a button" functionality. BioMart's capabilities are extended by integration with several widely used software packages such as BioConductor, DAS, Galaxy, Cytoscape, Taverna. In this paper, we describe all aspects of BioMart from a user's perspective and demonstrate how it can be used to solve real biological use cases such as SNP selection for candidate gene screening or annotation of microarray results.
BioMart is an easy to use, generic and scalable system and therefore, has become an integral part of large data resources including Ensembl, UniProt, HapMap, Wormbase, Gramene, Dictybase, PRIDE, MSD and Reactome. BioMart is freely accessible to use at .
Improved techniques for defining disease-gene location and evaluating the biological candidacy of regional transcripts will hasten disease-gene discovery.
Improved techniques for defining disease-gene location and evaluating the biological candidacy of regional transcripts will hasten disease-gene discovery.
The ZNF198-FGFR1 fusion gene arises as a result of the t(8;13)(p11;q12) in the 8p11 myeloproliferative syndrome. To determine the transforming properties of this chimeric protein we transfected ZNF198-FGFR1 into the interleukin (IL)-3 dependent cell line Ba/F3. Growth factor independent subclones were obtained in which ZNF198-FGFR1, STAT1, and STAT5 were constitutively tyrosine phosphorylated, as determined by immunoprecipitation and Western blot analysis. To test the hypothesis that constitutive activation of ZNF198-FGFR1 tyrosine kinase activity is a result of self-association of the fusion protein, we in vitro transcribed and translated ZNF198-FGFR1 and a derivative construct, ZNF198-FGFR1 ΔC-myc, in which the C-terminal FGFR1 epitope was replaced by a c-myc tag. As expected, an anti-FGFR1 antibody immunoprecipitated ZNF198-FGFR1 but not ZNF198-FGFR1 ΔC-myc. However when both products were translated together, both were coimmunoprecipitated by anti-FGFR1 antisera. Similar results were obtained by using an anti-myc antibody and demonstrated a physical interaction between the two proteins. Analysis of COS-7 cells transfected with ZNF198-FGFR1 demonstrated that the fusion gene, in contrast to normal FGFR1, is located in the cytoplasm. We conclude that ZNF198-FGFR1 is a cytoplasmic protein that self-associates and has constitutive transformation activity. These data suggest that ZNF198-FGFR1 plays a primary role in the pathogenesis of the t(8;13) myeloproliferative syndrome and is the first report to implicate STAT proteins in FGFR1-mediated signaling.
ZNF198; FGFR1; t(8;13); tyrosine kinase