A significant challenge of in-vivo studies is the identification of phenotypes with a method that is robust and reliable. The challenge arises from practical issues that lead to experimental designs which are not ideal. Breeding issues, particularly in the presence of fertility or fecundity problems, frequently lead to data being collected in multiple batches. This problem is acute in high throughput phenotyping programs. In addition, in a high throughput environment operational issues lead to controls not being measured on the same day as knockouts. We highlight how application of traditional methods, such as a Student’s t-Test or a 2-way ANOVA, in these situations give flawed results and should not be used. We explore the use of mixed models using worked examples from Sanger Mouse Genome Project focusing on Dual-Energy X-Ray Absorptiometry data for the analysis of mouse knockout data and compare to a reference range approach. We show that mixed model analysis is more sensitive and less prone to artefacts allowing the discovery of subtle quantitative phenotypes essential for correlating a gene’s function to human disease. We demonstrate how a mixed model approach has the additional advantage of being able to include covariates, such as body weight, to separate effect of genotype from these covariates. This is a particular issue in knockout studies, where body weight is a common phenotype and will enhance the precision of assigning phenotypes and the subsequent selection of lines for secondary phenotyping. The use of mixed models with in-vivo studies has value not only in improving the quality and sensitivity of the data analysis but also ethically as a method suitable for small batches which reduces the breeding burden of a colony. This will reduce the use of animals, increase throughput, and decrease cost whilst improving the quality and depth of knowledge gained.
The genes involved in conferring susceptibility to anxiety remain obscure. We developed a new method to identify genes at quantitative trait loci (QTLs) in a population of heterogeneous stock mice descended from known progenitor strains. QTLs were partitioned into intervals that can be summarized by a single phylogenetic tree among progenitors and intervals tested for consistency with alleles influencing anxiety at each QTL. By searching for common Gene Ontology functions in candidate genes positioned within those intervals, we identified actin depolymerizing factors (ADFs), including cofilin-1 (Cfl1), as genes involved in regulating anxiety in mice. There was no enrichment for function in the totality of genes under each QTL, indicating the importance of phylogenetic filtering. We confirmed experimentally that forebrain-specific inactivation of Cfl1 decreased anxiety in knockout mice. Our results indicate that similarity of function of mammalian genes can be used to recognize key genetic regulators of anxiety and potentially of other emotional behaviours.
Thousands of small effect loci are believed to contribute to behavioural variation in mammals. Their abundance and small size frustrate gene identification and make it difficult to know which among them are central to the responsible biological mechanisms. Using imputed genome sequences from 2,000 outbred mice and by testing for an enrichment of functional annotations, we identify 167 candidate genes involved in anxiety. Unexpectedly, annotations implicate actin depolymerizing factors (ADFs), including cofilin-1 (Cfl1), as being involved with the expression of anxiety phenotypes in mice. We confirmed that forebrain-specific inactivation of Cfl1 decreased anxiety in knockout mice.
Structural variation is widespread in mammalian genomes1,2 and is an important cause of disease3, but just how abundant and important structural variants (SVs) are in shaping phenotypic variation remains unclear4,5. Without knowing how many SVs there are, and how they arise, it is difficult to discover what they do. Combining experimental with automated analyses, we identified 0.71M SVs at 0.28M sites in the genomes of thirteen classical and four wild-derived inbred mouse strains. The majority of SVs are less than 1 kilobase in size and 98% are deletions or insertions. The breakpoints of 0.16M SVs were mapped to base pair resolution allowing us to infer that insertion of retrotransposons causes more than half of SVs. Yet, despite their prevalence, SVs are less likely than other sequence variants to cause gene-expression or quantitative phenotypic variation. We identified 24 SVs that disrupt coding exons, acting as rare variants of large effect on gene function. One third of the genes so affected have immunological functions.
The Collaborative Cross (CC) is a panel of recombinant inbred lines derived from eight genetically diverse laboratory inbred strains. Recently, the genetic architecture of the CC population was reported based on the genotype of a single male per line, and other publications reported incompletely inbred CC mice that have been used to map a variety of traits. The three breeding sites, in the US, Israel, and Australia, are actively collaborating to accelerate the inbreeding process through marker-assisted inbreeding and to expedite community access of CC lines deemed to have reached defined thresholds of inbreeding. Plans are now being developed to provide access to this novel genetic reference population through distribution centers. Here we provide a description of the distribution efforts by the University of North Carolina Systems Genetics Core, Tel Aviv University, Israel and the University of Western Australia.
We report genome sequences of 17 inbred strains of laboratory mice and identify almost ten times more variants than previously known. We use these genomes to explore the phylogenetic history of the laboratory mouse and to examine the functional consequences of allele-specific variation on transcript abundance, revealing that at least 12% of transcripts show a significant tissue-specific expression bias. By identifying candidate functional variants at 718 quantitative trait loci we show that the molecular nature of functional variants and their position relative to genes vary according to the effect size of the locus. These sequences provide a starting point for a new era in the functional analysis of a key model organism.
Multicellular organisms can be regenerated from totipotent differentiated somatic cell or nuclear founders [1–3]. Organisms regenerated from clonally related isogenic founders might a priori have been expected to be phenotypically invariant. However, clonal regenerant animals display variant phenotypes caused by defective epigenetic reprogramming of gene expression , and clonal regenerant plants exhibit poorly understood heritable phenotypic (“somaclonal”) variation [4–7]. Here we show that somaclonal variation in regenerant Arabidopsis lineages is associated with genome-wide elevation in DNA sequence mutation rate. We also show that regenerant mutations comprise a distinctive molecular spectrum of base substitutions, insertions, and deletions that probably results from decreased DNA repair fidelity. Finally, we show that while regenerant base substitutions are a likely major genetic cause of the somaclonal variation of regenerant Arabidopsis lineages, transposon movement is unlikely to contribute substantially to that variation. We conclude that the phenotypic variation of regenerant plants, unlike that of regenerant animals, is substantially due to DNA sequence mutation.
► Regenerant Arabidopsis lineages display heritable phenotypic variation ► Regenerant Arabidopsis lineages display elevated genome-wide DNA sequence mutation ► Regenerant DNA sequence mutations comprise a distinct molecular spectrum ► Regenerant base substitution mutations confer heritable phenotypic variation
During a meeting of the SYSGENET working group ‘Bioinformatics’, currently available software tools and databases for systems genetics in mice were reviewed and the needs for future developments discussed. The group evaluated interoperability and performed initial feasibility studies. To aid future compatibility of software and exchange of already developed software modules, a strong recommendation was made by the group to integrate HAPPY and R/qtl analysis toolboxes, GeneNetwork and XGAP database platforms, and TIQS and xQTL processing platforms. R should be used as the principal computer language for QTL data analysis in all platforms and a ‘cloud’ should be used for software dissemination to the community. Furthermore, the working group recommended that all data models and software source code should be made visible in public repositories to allow a coordinated effort on the use of common data structures and file formats.
QTL mapping; database; mouse; systems genetics
The onset of flowering is an important adaptive trait in plants. The small ephemeral species Arabidopsis thaliana grows under a wide range of temperature and day-length conditions across much of the Northern hemisphere, and a number of flowering-time loci that vary between different accessions have been identified before. However, only few studies have addressed the species-wide genetic architecture of flowering-time control. We have taken advantage of a set of 18 distinct accessions that present much of the common genetic diversity of A. thaliana and mapped quantitative trait loci (QTL) for flowering time in 17 F2 populations derived from these parents. We found that the majority of flowering-time QTL cluster in as few as five genomic regions, which include the locations of the entire FLC/MAF clade of transcription factor genes. By comparing effects across shared parents, we conclude that in several cases there might be an allelic series caused by rare alleles. While this finding parallels results obtained for maize, in contrast to maize much of the variation in flowering time in A. thaliana appears to be due to large-effect alleles.
A large correlation between variation in T cell subsets and hippocampal neurogenesis suggests that the immune system has an unexpectedly large influence on the brain.
Neurogenesis continues through the adult life of mice in the subgranular zone of the dentate gyrus in the hippocampus, but its function remains unclear. Measuring cellular proliferation in the hippocampus of 719 outbred heterogeneous stock mice revealed a highly significant correlation with the proportions of CD8+ versus CD4+ T lymphocyte subsets. This correlation reflected shared genetic loci, with the exception of the H-2Ea locus that had a dominant influence on T cell subsets but no impact on neurogenesis. Analysis of knockouts and repopulation of TCRα-deficient mice by subsets of T cells confirmed the influence of T cells on adult neurogenesis, indicating that CD4+ T cells or subpopulations thereof mediate the effect. Our results reveal an organismal impact, broader than hitherto suspected, of the natural genetic variation that controls T cell development and homeostasis.
In adult mice new neurons are produced in the hippocampus, where they are thought to influence learning, memory, and emotional regulation. The mechanisms and functions of this neurogenesis, however, remain unclear. Here we report that in different strains of mice, variation in cellular proliferation in the hippocampus (an index of neurogenesis) correlates with variation in the relative proportions of the ratio of CD4+ to CD8+ T cells (an immunology phenotype). We also show that T cells can influence neurogenesis (but that neurogenesis does not influence T cells) by analyzing knockouts, depleting mice of T cells, and repopulating alymphoid animals. The strong genetic correlation between T cells and cellular proliferation in the hippocampus contrasts with the weak, often non-significant, correlation with behavioral phenotypes. Of significance, the findings here suggest that modulation of the functions of the hippocampus to influence behavior is not the primary role of neurogenesis.
Array comparative genomic hybridization (aCGH) to detect copy number variants (CNVs) in mammalian genomes has led to a growing awareness of the potential importance of this category of sequence variation as a cause of phenotypic variation. Yet there are large discrepancies between studies, so that the extent of the genome affected by CNVs is unknown. We combined molecular and aCGH analyses of CNVs in inbred mouse strains to investigate this question.
Using a 2.1 million probe array we identified 1,477 deletions and 499 gains in 7 inbred mouse strains. Molecular characterization indicated that approximately one third of the CNVs detected by the array were false positives and we estimate the false negative rate to be more than 50%. We show that low concordance between studies is largely due to the molecular nature of CNVs, many of which consist of a series of smaller deletions and gains interspersed by regions where the DNA copy number is normal.
Our results indicate that CNVs detected by arrays may be the coincidental co-localization of smaller CNVs, whose presence is more likely to perturb an aCGH hybridization profile than the effect of an isolated, small, copy number alteration. Our findings help explain the hitherto unexplored discrepancies between array-based studies of copy number variation in the mouse genome.
Genome-wide association studies using commercially available outbred mice can detect genes involved in phenotypes of biomedical interest. Useful populations need high-frequency alleles to ensure high power to detect quantitative trait loci (QTLs), low linkage disequilibrium between markers to obtain accurate mapping resolution, and an absence of population structure to prevent false positive associations. We surveyed 66 colonies for inbreeding, genetic diversity, and linkage disequilibrium, and we demonstrate that some have haplotype blocks of less than 100 Kb, enabling gene-level mapping resolution. The same alleles contribute to variation in different colonies, so that when mapping progress stalls in one, another can be used in its stead. Colonies are genetically diverse: 45% of the total genetic variation is attributable to differences between colonies. However, quantitative differences in allele frequencies, rather than the existence of private alleles, are responsible for these population differences. The colonies derive from a limited pool of ancestral haplotypes resembling those found in inbred strains: over 95% of sequence variants segregating in outbred populations are found in inbred strains. Consequently it is possible to impute the sequence of any mouse from a dense SNP map combined with inbred strain sequence data, which opens up the possibility of cataloguing and testing all variants for association, a situation that has so far eluded studies in completely outbred populations. We demonstrate the colonies' potential by identifying a deletion in the promoter of H2-Ea as the molecular change that strongly contributes to setting the ratio of CD4+ and CD8+ lymphocytes.
We show that commercially available mice are a resource for detecting single genes by genome-wide association. We surveyed 66 populations and identified those with properties conducive to high-resolution mapping. Importantly, we show that the same alleles contribute to variation in different colonies, so that when mapping progress stalls in one colony, another can be used in its stead. As a proof of principle, we detect the same QTL in different colonies influencing CD4+/CD8+ ratios and refine this mapping to the gene level. We show that a deletion in the promoter of H2-Ea is the molecular change that strongly contributes to setting the ratio of CD4+ and CD8+ lymphocytes. Our results make it possible for geneticists to make informed choices on the use of colonies for genome-wide association studies of complex traits in mice.
The 1001 Genomes project for Arabidopsis thaliana could provide an enormous boost for plant research for a modest financial investment.
We advocate here a 1001 Genomes project for Arabidopsis thaliana, the workhorse of plant genetics, which will provide an enormous boost for plant research with a modest financial investment.
A number of tools for the examination of linkage disequilibrium (LD) patterns between nearby alleles exist, but none are available for quickly and easily investigating LD at longer ranges (>500 kb). We have developed a web-based query tool (GLIDERS: Genome-wide LInkage DisEquilibrium Repository and Search engine) that enables the retrieval of pairwise associations with r2 ≥ 0.3 across the human genome for any SNP genotyped within HapMap phase 2 and 3, regardless of distance between the markers.
GLIDERS is an easy to use web tool that only requires the user to enter rs numbers of SNPs they want to retrieve genome-wide LD for (both nearby and long-range). The intuitive web interface handles both manual entry of SNP IDs as well as allowing users to upload files of SNP IDs. The user can limit the resulting inter SNP associations with easy to use menu options. These include MAF limit (5-45%), distance limits between SNPs (minimum and maximum), r2 (0.3 to 1), HapMap population sample (CEU, YRI and JPT+CHB combined) and HapMap build/release. All resulting genome-wide inter-SNP associations are displayed on a single output page, which has a link to a downloadable tab delimited text file.
GLIDERS is a quick and easy way to retrieve genome-wide inter-SNP associations and to explore LD patterns for any number of SNPs of interest. GLIDERS can be useful in identifying SNPs with long-range LD. This can highlight mis-mapping or other potential association signal localisation problems.
The mammalian epidermis is a continually renewing structure that provides the interface between the organism and an innately hostile environment. The keratinocyte is its principal cell. Keratinocyte proteins form a physical epithelial barrier, protect against microbial damage, and prepare immune responses to danger. Epithelial immunity is disordered in many common diseases and disordered epithelial differentiation underlies many cancers. In order to identify the genes that mediate epithelial development we used a tissue model of the skin derived from primary human keratinocytes. We measured global gene expression in triplicate at five times over the ten days that the keratinocytes took to fully differentiate. We identified 1282 gene transcripts that significantly changed during differentiation (false discovery rate <0.01%). We robustly grouped these transcripts by K-means clustering into modules with distinct temporal expression patterns, shared regulatory motifs, and biological functions. We found a striking cluster of late expressed genes that form the structural and innate immune defences of the epithelial barrier. Gene Ontology analyses showed that undifferentiated keratinocytes were characterised by genes for motility and the adaptive immune response. We systematically identified calcium-binding genes, which may operate with the epidermal calcium gradient to control keratinocyte division during skin repair. The results provide multiple novel insights into keratinocyte biology, in particular providing a comprehensive list of known and previously unrecognised major components of the epidermal barrier. The findings provide a reference for subsequent understanding of how the barrier functions in health and disease.
This review deals with the pharmacological properties of an alkylated monosaccharide mimetic, N-butyldeoxynojirimycin (NB-DNJ). This compound is of pharmacogenetic interest because one of its biological effects in mice – impairment of spermatogenesis, leading to male infertility – depends greatly on the genetic background of the animal. In susceptible mice, administration of NB-DNJ perturbs the formation of an organelle, the acrosome, in early post-meiotic male germ cells. In all recipient mice, irrespective of reproductive phenotype, NB-DNJ has a similar biochemical effect: inhibition of the glucosylceramidase β-glucosidase 2 and subsequent elevation of glucosylceramide, a glycosphingolipid. The questions that we now need to address are: how can glucosylceramide specifically affect early acrosome formation, and why is this contingent on genetic factors? Here we discuss relevant aspects of reproductive biology, the metabolism and cell biology of sphingolipids, and complex trait analysis; we also present a speculative model that takes our observations into account.
acrosome; glucosylceramide; glycosphingolipid; imino sugar; semen parameters; sperm morphology; spermatid; spermatogenesis
Identifying natural allelic variation that underlies quantitative trait variation remains a fundamental problem in genetics. Most studies have employed either simple synthetic populations with restricted allelic variation or performed association mapping on a sample of naturally occurring haplotypes. Both of these approaches have some limitations, therefore alternative resources for the genetic dissection of complex traits continue to be sought. Here we describe one such alternative, the Multiparent Advanced Generation Inter-Cross (MAGIC). This approach is expected to improve the precision with which QTL can be mapped, improving the outlook for QTL cloning. Here, we present the first panel of MAGIC lines developed: a set of 527 recombinant inbred lines (RILs) descended from a heterogeneous stock of 19 intermated accessions of the plant Arabidopsis thaliana. These lines and the 19 founders were genotyped with 1,260 single nucleotide polymorphisms and phenotyped for development-related traits. Analytical methods were developed to fine-map quantitative trait loci (QTL) in the MAGIC lines by reconstructing the genome of each line as a mosaic of the founders. We show by simulation that QTL explaining 10% of the phenotypic variance will be detected in most situations with an average mapping error of about 300 kb, and that if the number of lines were doubled the mapping error would be under 200 kb. We also show how the power to detect a QTL and the mapping accuracy vary, depending on QTL location. We demonstrate the utility of this new mapping population by mapping several known QTL with high precision and by finding novel QTL for germination data and bolting time. Our results provide strong support for similar ongoing efforts to produce MAGIC lines in other organisms.
Most traits of economic and evolutionary interest vary quantitatively and have multiple genes affecting their expression. Dissecting the genetic basis of such traits is crucial for the improvement of crops and management of diseases. Here, we develop a new resource to identify genes underlying such quantitative traits in Arabidopsis thaliana, a genetic model organism in plants. We show that using a large population of inbred lines derived from intercrossing 19 parents, we can localize the genes underlying quantitative traits better than with existing methods. Using these lines, we were able to replicate the identification of previously known genes that affect developmental traits in A. thaliana and identify some new ones. This paper also presents all the necessary biological and computational material necessary for the scientific community to use these lines in their own research. Our results suggest that the use of lines derived from a multiparent advanced generation inter-cross (MAGIC lines) should be very useful in other organisms.
I survey the state of the art in complex trait analysis, including the use of new experimental and computational technologies and resources becoming available, and the challenges facing us. I also discuss how the prospects of rodent model systems compare with association mapping in humans.
complex traits; genetic mapping; association; human genetics; mouse genetics; quantitative trait locus
High-resolution genetic maps are required for mapping complex traits and for the study of recombination. We report the highest density genetic map yet created for any organism, except humans. Using more than 10,000 single nucleotide polymorphisms evenly spaced across the mouse genome, we have constructed genetic maps for both outbred and inbred mice, and separately for males and females. Recombination rates are highly correlated in outbred and inbred mice, but show relatively low correlation between males and females. Differences between male and female recombination maps and the sequence features associated with recombination are strikingly similar to those observed in humans. Genetic maps are available from http://gscan.well.ox.ac.uk/#genetic_map and as supporting information to this publication.
A high-density SNP map based on outbred and inbred mice with male and female separation suggests a high degree of homology between mouse and human recombination.
Large-scale genetic mapping projects require data management systems that can handle complex phenotypes and detect and correct high-throughput genotyping errors, yet are easy to use.
We have developed an Integrated Genotyping System (IGS) to meet this need. IGS securely stores, edits and analyses genotype and phenotype data. It stores information about DNA samples, plates, primers, markers and genotypes generated by a genotyping laboratory. Data are structured so that statistical genetic analysis of both case-control and pedigree data is straightforward.
IGS can model complex phenotypes and contain genotypes from whole genome association studies. The database makes it possible to integrate genetic analysis with data curation. The IGS web site contains further information.
Comparative genome hybridization (CGH) to DNA microarrays (array CGH) is a technique capable of detecting deletions and duplications in genomes at high resolution. However, array CGH studies of the human genome noting false negative and false positive results using large insert clones as probes have raised important concerns regarding the suitability of this approach for clinical diagnostic applications. Here, we adapt the Smith–Waterman dynamic-programming algorithm to provide a sensitive and robust analytic approach (SW-ARRAY) for detecting copy-number changes in array CGH data. In a blind series of hybridizations to arrays consisting of the entire tiling path for the terminal 2 Mb of human chromosome 16p, the method identified all monosomies between 267 and 1567 kb with a high degree of statistical significance and accurately located the boundaries of deletions in the range 267–1052 kb. The approach is unique in offering both a nonparametric segmentation procedure and a nonparametric test of significance. It is scalable and well-suited to high resolution whole genome array CGH studies that use array probes derived from large insert clones as well as PCR products and oligonucleotides.
We present a general high-throughput approach to accurately quantify DNA–protein interactions, which can facilitate the identification of functional genetic polymorphisms. The method tested here on two structurally distinct transcription factors (TFs), NF-κB and OCT-1, comprises three steps: (i) optimized selection of DNA variants to be tested experimentally, which we show is superior to selecting variants at random; (ii) a quantitative protein–DNA binding assay using microarray and surface plasmon resonance technologies; (iii) prediction of binding affinity for all DNA variants in the consensus space using a statistical model based on principal coordinates analysis. For the protein–DNA binding assay, we identified a polyacrylamide/ester glass activation chemistry which formed exclusive covalent bonds with 5′-amino-modified DNA duplexes and hindered non-specific electrostatic attachment of DNA. Full accessibility of the DNA duplexes attached to polyacrylamide-modified slides was confirmed by the high degree of data correlation with the electromobility shift assay (correlation coefficient 93%). This approach offers the potential for high-throughput determination of TF binding profiles and predicting the effects of single nucleotide polymorphisms on TF binding affinity. New DNA binding data for OCT-1 are presented.
To understand the causal basis of TNF associations with disease, it is necessary to understand the haplotypic structure of this locus. The TNF locus in Gambian and Malawi human samples is haplotypically diverse and has a rich history of intragenic recombination. As a consequence, a large proportion of TNF single nucleotide polymorphisms (SNPs) must be typed to detect a disease-modifying SNP at this locus. The most informative subset of SNPs to genotype differs between the two populations.
To understand the causal basis of TNF associations with disease, it is necessary to understand the haplotypic structure of this locus. We genotyped 12 single-nucleotide polymorphisms (SNPs) distributed over 4.3 kilobases in 296 healthy, unrelated Gambian and Malawian adults. We generated 592 high-quality haplotypes by integrating family- and population-based reconstruction methods.
We found 32 different haplotypes, of which 13 were shared between the two populations. Both populations were haplotypically diverse (gene diversity = 0.80, Gambia; 0.85, Malawi) and significantly differentiated (p < 10-5 by exact test). More than a quarter of marker pairs showed evidence of intragenic recombination (29% Gambia; 27% Malawi). We applied two new methods of analyzing haplotypic data: association efficiency analysis (AEA), which describes the ability of each SNP to detect every other SNP in a case-control scenario; and the entropy maximization method (EMM), which selects the subset of SNPs that most effectively dissects the underlying haplotypic structure. AEA revealed that many SNPs in TNF are poor markers of each other. The EMM showed that 8 of 12 SNPs (Gambia) and 7 of 12 SNPs (Malawi) are required to describe 95% of the haplotypic diversity.
The TNF locus in the Gambian and Malawi sample is haplotypically diverse and has a rich history of intragenic recombination. As a consequence, a large proportion of TNF SNPs must be typed to detect a disease-modifying SNP at this locus. The most informative subset of SNPs to genotype differs between the two populations.