Although there has been much success in identifying genetic variants associated with common diseases using genome-wide association studies (GWAS)1, it has been difficult to demonstrate which variants are causal and what role they play in disease. Moreover, the modest contribution these variants make to disease risk has raised questions regarding their medical relevance2. We have investigated a single nucleotide polymorphism (SNP) in the TNFRSF1A gene, that encodes TNF receptor 1 (TNFR1), which was discovered through GWAS to be associated with multiple sclerosis (MS)3,4, but not with other autoimmune conditions such as rheumatoid arthritis (RA)5, psoriasis6 and Crohn’s disease7. By analyzing MS GWAS3,4 data in conjunction with the 1000 Genomes Project data8 we provide genetic evidence that strongly implicates this SNP, rs1800693, as the causal variant in the TNFRSF1A region. We further substantiate this through functional studies showing that the MS risk allele directs expression of a novel, soluble form of TNFR1 that can block TNF. Importantly, TNF blocking drugs can promote onset or exacerbation of MS9-11, but they have proven highly efficacious in the treatment of autoimmune diseases for which there is no association with rs1800693. This indicates that the clinical experience with these drugs parallels the disease association of rs1800693, and that the MS-associated TNFR1 variant mimics the effect of TNF blocking drugs. Hence, our study demonstrates that clinical practice can be informed by comparing GWAS across common autoimmune diseases and by investigating the functional consequences of the disease-associated genetic variation.
Isolated populations are emerging as a powerful study design in the search for low-frequency and rare variant associations with complex phenotypes. Here we genotype 2,296 samples from two isolated Greek populations, the Pomak villages (HELIC-Pomak) in the North of Greece and the Mylopotamos villages (HELIC-MANOLIS) in Crete. We compare their genomic characteristics to the general Greek population and establish them as genetic isolates. In the MANOLIS cohort, we observe an enrichment of missense variants among the variants that have drifted up in frequency by more than fivefold. In the Pomak cohort, we find novel associations at variants on chr11p15.4 showing large allele frequency increases (from 0.2% in the general Greek population to 4.6% in the isolate) with haematological traits, for example, with mean corpuscular volume (rs7116019, P=2.3 × 10−26). We replicate this association in a second set of Pomak samples (combined P=2.0 × 10−36). We demonstrate significant power gains in detecting medical trait associations.
Isolated populations can increase power to detect low frequency and rare risk variants associated with complex phenotypes. Here, the authors identify variants associated with haematological traits in two isolated Greek populations that would be difficult to detect in the general population, due to their low frequency.
Large whole-genome sequencing projects have provided access to much rare variation in human populations, which is highly informative about population structure and recent demography. Here, we show how the age of rare variants can be estimated from patterns of haplotype sharing and how these ages can be related to historical relationships between populations. We investigate the distribution of the age of variants occurring exactly twice ( variants) in a worldwide sample sequenced by the 1000 Genomes Project, revealing enormous variation across populations. The median age of haplotypes carrying variants is 50 to 160 generations across populations within Europe or Asia, and 170 to 320 generations within Africa. Haplotypes shared between continents are much older with median ages for haplotypes shared between Europe and Asia ranging from 320 to 670 generations. The distribution of the ages of haplotypes is informative about their demography, revealing recent bottlenecks, ancient splits, and more modern connections between populations. We see the effect of selection in the observation that functional variants are significantly younger than nonfunctional variants of the same frequency. This approach is relatively insensitive to mutation rate and complements other nonparametric methods for demographic inference.
In this paper we describe a method for estimating the age of rare genetic variants. These ages are highly informative about the extent and dates of connections between populations. Variants in closely related populations generally arose more recently than variants of the same frequency in more diverged populations. Therefore, comparing the ages of variants shared across different populations allows us to infer the dates of demographic events like population splits and bottlenecks. We also see that rare functional variants shared within populations tend to have more recent origins than nonfunctional variants, which is consistent with the effects of natural selection.
Using the ImmunoChip custom genotyping array, we analysed 14,498 multiple sclerosis subjects and 24,091 healthy controls for 161,311 autosomal variants and identified 135 potentially associated regions (p-value < 1.0 × 10-4). In a replication phase, we combined these data with previous genome-wide association study (GWAS) data from an independent 14,802 multiple sclerosis subjects and 26,703 healthy controls. In these 80,094 individuals of European ancestry we identified 48 new susceptibility variants (p-value < 5.0 × 10-8); three found after conditioning on previously identified variants. Thus, there are now 110 established multiple sclerosis risk variants in 103 discrete loci outside of the Major Histocompatibility Complex. With high resolution Bayesian fine-mapping, we identified five regions where one variant accounted for more than 50% of the posterior probability of association. This study enhances the catalogue of multiple sclerosis risk variants and illustrates the value of fine-mapping in the resolution of GWAS signals.
PRDM9 directs human meiotic crossover hotspots to intergenic sequence motifs, whereas budding yeast hotspots overlap low nucleosome density regions in gene promoters. To investigate hotspots in plants, which lack PRDM9, we used coalescent analysis of Arabidopsis genetic variation. Crossovers increase towards gene promoters and terminators, and hotspots are associated with active chromatin modifications, including H2A.Z, histone H3K4me3, low nucleosome density and low DNA methylation. Hotspot-enriched A-rich and CTT-repeat DNA motifs occur upstream and downstream of transcriptional start respectively. Crossovers are asymmetric around promoters and highest over CTT-motifs and H2A.Z-nucleosomes. Pollen-typing, segregation and cytogenetic analysis show decreased crossovers in the arp6 H2A.Z deposition mutant, at multiple scales. During meiosis H2A.Z and DMC1/RAD51 recombinases form overlapping chromosomal foci. As arp6 reduces DMC1/RAD51 foci, H2A.Z may promote formation or processing of meiotic DNA double-strand breaks. We propose that gene chromatin ancestrally designates hotspots within eukaryotes and PRDM9 is a derived state within vertebrates.
In severe early-onset epilepsy, precise clinical and molecular genetic diagnosis is complex, as many metabolic and electro-physiological processes have been implicated in disease causation. The clinical phenotypes share many features such as complex seizure types and developmental delay. Molecular diagnosis has historically been confined to sequential testing of candidate genes known to be associated with specific sub-phenotypes, but the diagnostic yield of this approach can be low. We conducted whole-genome sequencing (WGS) on six patients with severe early-onset epilepsy who had previously been refractory to molecular diagnosis, and their parents. Four of these patients had a clinical diagnosis of Ohtahara Syndrome (OS) and two patients had severe non-syndromic early-onset epilepsy (NSEOE). In two OS cases, we found de novo non-synonymous mutations in the genes KCNQ2 and SCN2A. In a third OS case, WGS revealed paternal isodisomy for chromosome 9, leading to identification of the causal homozygous missense variant in KCNT1, which produced a substantial increase in potassium channel current. The fourth OS patient had a recessive mutation in PIGQ that led to exon skipping and defective glycophosphatidyl inositol biosynthesis. The two patients with NSEOE had likely pathogenic de novo mutations in CBL and CSNK1G1, respectively. Mutations in these genes were not found among 500 additional individuals with epilepsy. This work reveals two novel genes for OS, KCNT1 and PIGQ. It also uncovers unexpected genetic mechanisms and emphasizes the power of WGS as a clinical tool for making molecular diagnoses, particularly for highly heterogeneous disorders.
Summary: We have developed a software package, Cortex, designed for the analysis of genetic variation by de novo assembly of multiple samples. This allows direct comparison of samples without using a reference genome as intermediate and incorporates discovery and genotyping of single-nucleotide polymorphisms, indels and larger events in a single framework. We introduce pipelines which simplify the analysis of microbial samples and increase discovery power; these also enable the construction of a graph of known sequence and variation in a species, against which new samples can be compared rapidly. We demonstrate the ease-of-use and power by reproducing the results of studies using both long and short reads.
http://cortexassembler.sourceforge.net (GPLv3 license).
Supplementary information: Supplementary data are available at Bioinformatics online.
Although present in both humans and chimpanzees, recombination hotspots, at which meiotic cross-over events cluster, differ markedly in their genomic location between the species. We report that a 13-bp sequence motif previously associated with the activity of 40% of human hotspots does not function in chimpanzee, and is being removed by self-destructive drive in the human lineage. Multiple lines of evidence suggest that the rapidly evolving zinc-finger protein, PRDM9 binds to this motif and that sequence changes in the protein may be responsible for hotspot differences between species. The involvement of PRDM9, which causes Histone H3 Lysine 4 trimethylation, implicates a common mechanism for recombination hotspots in eukaryotes but raises questions about what forces have driven such rapid change.
The rates of escape and reversion in response to selection pressure arising from the host immune system, notably the cytotoxic T-lymphocyte (CTL) response, are key factors determining the evolution of HIV. Existing methods for estimating these parameters from cross-sectional population data using ordinary differential equations (ODEs) ignore information about the genealogy of sampled HIV sequences, which has the potential to cause systematic bias and overestimate certainty. Here, we describe an integrated approach, validated through extensive simulations, which combines genealogical inference and epidemiological modelling, to estimate rates of CTL escape and reversion in HIV epitopes. We show that there is substantial uncertainty about rates of viral escape and reversion from cross-sectional data, which arises from the inherent stochasticity in the evolutionary process. By application to empirical data, we find that point estimates of rates from a previously published ODE model and the integrated approach presented here are often similar, but can also differ several-fold depending on the structure of the genealogy. The model-based approach we apply provides a framework for the statistical analysis and hypothesis testing of escape and reversion in population data and highlights the need for longitudinal and denser cross-sectional sampling to enable accurate estimate of these key parameters.
phylodynamics; HIV; escape; genealogy; peeling; cytotoxic T-lymphocyte
Adaptor protein-2 (AP2), a central component of clathrin-coated vesicles (CCVs), is pivotal in clathrin-mediated endocytosis which internalises plasma membrane constituents such as G protein-coupled receptors (GPCRs)1-3 . AP2, a heterotetramer of alpha, beta, mu and sigma subunits, links clathrin to vesicle membranes and binds to tyrosine-based and dileucine-based motifs of membrane-associated cargo proteins1,4. Here, we show that AP2 sigma subunit (AP2S1) missense mutations, which all involved the Arg15 residue (Arg15Cys, Arg15His and Arg15Leu) that forms key contacts with dileucine-based motifs of CCV cargo proteins4, result in familial hypocalciuric hypercalcemia type 3 (FHH3), an extracellular-calcium homeostasis disorder affecting parathyroids, kidneys and bone5-7 These AP2S1 mutations occurred in >20% of FHH patients without calcium-sensing GPCR (CaSR) mutations which cause FHH18-12. AP2S1 mutations decreased the sensitivity of CaSR-expressing cells to extracellular-calcium and reduced CaSR endocytosis, likely through a loss of interaction with a C-terminus CaSR dileucine-based motif whose disruption also decreased intracellular signalling. Thus, our results reveal a new role for AP2 in extracellular-calcium homeostasis.
Efforts to identify the genetic basis of human adaptations from polymorphism data have sought footprints of “classic selective sweeps”. Yet it remains unknown whether this form of natural selection was common in our evolution. We examined the evidence for classic sweeps in resequencing data from 179 human genomes. As expected under a recurrent sweep model, diversity levels decrease near exons and conserved non-coding regions. In contrast to expectation, however, the trough in diversity around human-specific amino acid substitutions is no more pronounced than around synonymous substitutions. Moreover, relative to the genome background, amino acid and putative regulatory sites are not significantly enriched for alleles that are highly differentiated between populations. These findings indicate that classic sweeps were not a dominant mode of adaptation over the past ~250,000 years.
The var genes of the human malaria parasite Plasmodium falciparum are highly polymorphic loci coding for the erythrocyte membrane proteins 1 (PfEMP1), which are responsible for the cytoaherence of P. falciparum infected red blood cells to the human vasculature. Cytoadhesion, coupled with differential expression of var genes, contributes to virulence and allows the parasite to establish chronic infections by evading detection from the host’s immune system. Although studying genetic diversity is a major focus of recent work on the var genes, little is known about the gene family's origin and evolutionary history.
Using a novel hidden Markov model-based approach and var sequences assembled from additional isolates and species, we are able to reveal elements of both the early evolution of the var genes as well as recent diversifying events. We compare sequences of the var gene DBLα domains from divergent isolates of P. falciparum (3D7 and HB3), and a closely-related species, Plasmodium reichenowi. We find that the gene family is equally large in P. reichenowi and P. falciparum -- with a minimum of 51 var genes in the P. reichenowi genome (compared to 61 in 3D7 and a minimum of 48 in HB3). In addition, we are able to define large, continuous blocks of homologous sequence among P. falciparum and P. reichenowi var gene DBLα domains. These results reveal that the contemporary structure of the var gene family was present before the divergence of P. falciparum and P. reichenowi, estimated to be between 2.5 to 6 million years ago. We also reveal that recombination has played an important and traceable role in both the establishment, and the maintenance, of diversity in the sequences.
Despite the remarkable diversity and rapid evolution found in these loci within and among P. falciparum populations, the basic structure of these domains and the gene family is surprisingly old and stable. Revealing a common structure as well as conserved sequence among two species also has implications for developing new primate-parasite models for studying the pathology and immunology of falciparum malaria, and for studying the population genetics of var genes and associated virulence phenotypes.
Non-allelic homologous recombination; Hidden Markov-model; var genes; Malaria; PfEMP1; Gene family evolution; Balancing selection
Inferring the nature and magnitude of selection is an important problem in many biological contexts. Typically when estimating a selection coefficient for an allele, it is assumed that samples are drawn from a panmictic population and that selection acts uniformly across the population. However, these assumptions are rarely satisfied. Natural populations are almost always structured, and selective pressures are likely to act differentially. Inference about selection ought therefore to take account of structure. We do this by considering evolution in a simple lattice model of spatial population structure. We develop a hidden Markov model based maximum-likelihood approach for estimating the selection coefficient in a single population from time series data of allele frequencies. We then develop an approximate extension of this to the structured case to provide a joint estimate of migration rate and spatially varying selection coefficients. We illustrate our method using classical data sets of moth pigmentation morph frequencies, but it has wide applications in settings ranging from ecology to human evolution.
selection; spatial structure; population structure; allele frequencies; time series
Statistical imputation of classical HLA alleles in case-control studies has become established as a valuable tool for identifying and fine-mapping signals of disease association in the MHC. Imputation into diverse populations has, however, remained challenging, mainly because of the additional haplotypic heterogeneity introduced by combining reference panels of different sources. We present an HLA type imputation model, HLA*IMP:02, designed to operate on a multi-population reference panel. HLA*IMP:02 is based on a graphical representation of haplotype structure. We present a probabilistic algorithm to build such models for the HLA region, accommodating genotyping error, haplotypic heterogeneity and the need for maximum accuracy at the HLA loci, generalizing the work of Browning and Browning (2007) and Ron et al. (1998). HLA*IMP:02 achieves an average 4-digit imputation accuracy on diverse European panels of 97% (call rate 97%). On non-European samples, 2-digit performance is over 90% for most loci and ethnicities where data available. HLA*IMP:02 supports imputation of HLA-DPB1 and HLA-DRB3-5, is highly tolerant of missing data in the imputation panel and works on standard genotype data from popular genotyping chips. It is publicly available in source code and as a user-friendly web service framework.
The human leukocyte antigen (HLA) proteins influence how pathogens and components of body cells are presented to immune cells. It has long been known that they are highly variable and that this variation is associated with differential risk for autoimmune and infectious diseases. Variant frequencies differ substantially between and even within continents. Determining HLA genotypes is thus an important part of many studies to understand the genetic basis of disease risk. However, conventional methods for HLA typing (e.g. targeted sequencing, hybridisation, amplification) are typically laborious and expensive. We have developed a method for inferring an individual's HLA genotype based on evaluating genetic information from nearby variable sites that are more easily assayed, which aims to integrate heterogeneous data. We introduce two key innovations: we allow for single HLA types to appear on heterogeneous backgrounds of genetic information and we take into account the possibility of genotyping error, which is common within the HLA region. We show that the method is well-suited to deal with multi-population datasets: it enables integrated HLA type inference for individuals of differing ancestry and ethnicity. It will therefore prove useful particularly in international collaborations to better understand disease risks, where samples are drawn from multiple countries.
To study the evolution of recombination rates in apes, we developed methodology to construct a fine-scale genetic map from high throughput sequence data from ten Western chimpanzees, Pan troglodytes verus. Compared to the human genetic map, broad-scale recombination rates tend to be conserved, but with exceptions, particularly in regions of chromosomal rearrangements and around the site of ancestral fusion in human chromosome 2. At fine-scales, chimpanzee recombination is dominated by hotspots, which show no overlap with humans even though rates are similarly elevated around CpG islands and decreased within genes. The hotspot-specifying protein PRDM9 shows extensive variation among Western chimpanzees and there is little evidence that any sequence motifs are enriched in hotspots. The contrasting locations of hotspots provide a natural experiment, which demonstrates the impact of recombination on base composition.
Well-powered genome-wide association studies, now possible through advances in technology and large-scale collaborative projects, promise to reveal the contribution of rare variants to complex traits and disease. However, while population structure is a known confounder of association studies, it is unknown whether methods developed to control stratification are equally effective for rare variants. Here we demonstrate that rare variants can show a systematically different and typically stronger stratification than common variants, and that this is not necessarily corrected by existing methods. We show that the same process leads to inflation for load-based tests and can obscure signals at truly associated variants. We show that populations can display spatial structure in rare variants even when FST is low, but that allele-frequency dependent metrics of allele sharing can reveal localized stratification. These results underscore the importance of collecting and integrating spatial information in the genetic analysis of complex traits.
Detecting genetic variants that are highly divergent from a reference sequence remains a major challenge in genome sequencing. We introduce de novo assembly algorithms using colored de Bruijn graphs for detecting and genotyping simple and complex genetic variants in an individual or population. We provide an efficient software implementation, Cortex; the first de novo assembler capable of assembling multiple eukaryote genomes simultaneously. Four applications of Cortex are presented. First, we detect and validate both simple and complex structural variation in a high coverage human genome. Second, we identify over 3Mb of novel sequence in pooled low-coverage population sequence data from the 1000 Genomes Project. Third, we show how population information from 10 chimpanzees enables accurate variant calls without a reference sequence. Finally, we estimate classical HLA genotypes at HLA-B, the most variable gene in the human genome.
Motivation: Genetic variation at classical HLA alleles influences many phenotypes, including susceptibility to autoimmune disease, resistance to pathogens and the risk of adverse drug reactions. However, classical HLA typing methods are often prohibitively expensive for large-scale studies. We previously described a method for imputing classical alleles from linked SNP genotype data. Here, we present a modification of the original algorithm implemented in a freely available software suite that combines local data preparation and QC with probabilistic imputation through a remote server.
Results: We introduce two modifications to the original algorithm. First, we present a novel SNP selection function that leads to pronounced increases (up by 40% in some scenarios) in call rate. Second, we develop a parallelized model building algorithm that allows us to process a reference set of over 2500 individuals. In a validation experiment, we show that our framework produces highly accurate HLA type imputations at class I and class II loci for independent datasets: at call rates of 95–99%, imputation accuracy is between 92% and 98% at the four-digit level and over 97% at the two-digit level. We demonstrate utility of the method through analysis of a genome-wide association study for psoriasis where there is a known classical HLA risk allele (HLA-C*06:02). We show that the imputed allele shows stronger association with disease than any single SNP within the region. The imputation framework, HLA*IMP, provides a powerful tool for dissecting the architecture of genetic risk within the HLA.
Availability: HLA*IMP, implemented in C++ and Perl, is available from http://oxfordhla.well.ox.ac.uk and is free for academic use.
Supplementary information: Supplementary data are available at Bioinformatics online.
An important paradigm in evolutionary genetics is that of a delicate balance between genetic variants that favorably boost host control of infection but which may unfavorably increase susceptibility to autoimmune disease. Here, we investigated whether patients with psoriasis, a common immune-mediated disease of the skin, are enriched for genetic variants that limit the ability of HIV-1 virus to replicate after infection. We analyzed the HLA class I and class II alleles of 1,727 Caucasian psoriasis cases and 3,581 controls and found that psoriasis patients are significantly more likely than controls to have gene variants that are protective against HIV-1 disease. This includes several HLA class I alleles associated with HIV-1 control; amino acid residues at HLA-B positions 67, 70, and 97 that mediate HIV-1 peptide binding; and the deletion polymorphism rs67384697 associated with high surface expression of HLA-C. We also found that the compound genotype KIR3DS1 plus HLA-B Bw4-80I, which respectively encode a natural killer cell activating receptor and its putative ligand, significantly increased psoriasis susceptibility. This compound genotype has also been associated with delay of progression to AIDS. Together, our results suggest that genetic variants that contribute to anti-viral immunity may predispose to the development of psoriasis.
Individuals with autoimmune disease generally demonstrate excessive immune system activation, leading to inflammation and damage of specific target organs. However, in some cases the detrimental effects of an overactive immune system might be counterbalanced by a beneficial effect in protecting against certain infections. In this study, we investigated whether patients with psoriasis, a common autoimmune disease of the skin, harbor genetic variants that are associated with an enhanced ability to limit replication of the HIV-1 virus. We profiled the HLA (human leukocyte antigen) immune genes located on chromosome 6 in 1,727 Caucasian psoriasis cases and 3,581 healthy controls and found that psoriasis patients are significantly more likely than controls to have gene variants that are protective against HIV-1 disease. We found that this enrichment for HIV-1 protective variants was unique to psoriasis and largely absent in patients with other autoimmune or inflammatory diseases such as rheumatoid arthritis, Crohn's disease, type 1 diabetes, type 2 diabetes, and coronary artery disease. Our results suggest the possibility that the excessive skin inflammation in psoriasis may be associated with activation of anti-viral immune pathways that were important to human ancestors who encountered viruses similar to HIV-1.
Multiple sclerosis (OMIM 126200) is a common disease of the central nervous system in which the interplay between inflammatory and neurodegenerative processes typically results in intermittent neurological disturbance followed by progressive accumulation of disability.1 Epidemiological studies have shown that genetic factors are primarily responsible for the substantially increased frequency of the disease seen in the relatives of affected individuals;2,3 and systematic attempts to identify linkage in multiplex families have confirmed that variation within the Major Histocompatibility Complex (MHC) exerts the greatest individual effect on risk.4 Modestly powered Genome-Wide Association Studies (GWAS)5-10 have enabled more than 20 additional risk loci to be identified and have shown that multiple variants exerting modest individual effects play a key role in disease susceptibility.11 Most of the genetic architecture underlying susceptibility to the disease remains to be defined and is anticipated to require the analysis of sample sizes that are beyond the numbers currently available to individual research groups. In a collaborative GWAS involving 9772 cases of European descent collected by 23 research groups working in 15 different countries, we have replicated almost all of the previously suggested associations and identified at least a further 29 novel susceptibility loci. Within the MHC we have refined the identity of the DRB1 risk alleles and confirmed that variation in the HLA-A gene underlies the independent protective effect attributable to the Class I region. Immunologically relevant genes are significantly over-represented amongst those mapping close to the identified loci and particularly implicate T helper cell differentiation in the pathogenesis of multiple sclerosis.
multiple sclerosis; GWAS; genetics
Genomic structural variants (SVs) are abundant in humans, differing from other variation classes in extent, origin, and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (i.e., copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analyzing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.
Salmonella enterica is a bacterial pathogen that causes enteric fever and gastroenteritis in humans and animals. Although its population structure was long described as clonal, based on high linkage disequilibrium between loci typed by enzyme electrophoresis, recent examination of gene sequences has revealed that recombination plays an important evolutionary role. We sequenced around 10% of the core genome of 114 isolates of enterica using a resequencing microarray. Application of two different analysis methods (Structure and ClonalFrame) to our genomic data allowed us to define five clear lineages within S. enterica subspecies enterica, one of which is five times older than the other four and two thirds of the age of the whole subspecies. We show that some of these lineages display more evidence of recombination than others. We also demonstrate that some level of sexual isolation exists between the lineages, so that recombination has occurred predominantly between members of the same lineage. This pattern of recombination is compatible with expectations from the previously described ecological structuring of the enterica population as well as mechanistic barriers to recombination observed in laboratory experiments. In spite of their relatively low level of genetic differentiation, these lineages might therefore represent incipient species.
Salmonella enterica is a species of bacteria that causes severe diseases in humans and animals. We sequenced about a tenth of the genome from a broadly sampled collection of S. enterica. By comparing these genetic sequences, we were able to partially reconstruct the ancestry of this sample. We identified five lineages within S. enterica, one of which is almost as old as the common ancestor of our sample. We also found evidence for frequent homologous recombination in the ancestry of S. enterica, where fragments of genes from one individual bacterium are acquired by a distinct individual. These recombination events make the ancestry harder to reconstruct in its entirety, but also contain interesting information. We found in particular that recombination had happened more often between strains belonging to the same lineage than across lineage boundaries. This observation is compatible with the lineages of S. enterica becoming progressively isolated from each other, which could lead to their gradual splintering into new species.
Recombination between homologous, but non-allelic, stretches of DNA such as gene families, segmental duplications and repeat elements is an important source of mutation. In humans, recent studies have identified short DNA motifs that both determine the location of 40 per cent of meiotic cross-over hotspots and are significantly enriched at the breakpoints of recurrent non-allelic homologous recombination (NAHR) syndromes. Unexpectedly, the most highly penetrant form of the motif occurs on the background of an inactive repeat element family (THE1 elements) and the motif also has strong recombinogenic activity on currently active element families including Alu and LINE2 elements. Analysis of genetic variation among members of these repeat families indicates an important role for NAHR in their evolution. Given the potential for double-strand breaks within repeat DNA to cause pathological rearrangement, the association between repeats and hotspots is surprising. Here we consider possible explanations for why selection acting against NAHR has not eliminated hotspots from repeat DNA including mechanistic constraints, possible benefits to repeat DNA from recruiting hotspots and rapid evolution of the recombination machinery. I suggest that rapid evolution of hotspot motifs may, surprisingly, tend to favour sequences present in repeat DNA and outline the data required to differentiate between hypotheses.
recombination hotspot; mutation; repeat element
Principal components analysis, PCA, is a statistical method commonly used in population genetics to identify structure in the distribution of genetic variation across geographical location and ethnic background. However, while the method is often used to inform about historical demographic processes, little is known about the relationship between fundamental demographic parameters and the projection of samples onto the primary axes. Here I show that for SNP data the projection of samples onto the principal components can be obtained directly from considering the average coalescent times between pairs of haploid genomes. The result provides a framework for interpreting PCA projections in terms of underlying processes, including migration, geographical isolation, and admixture. I also demonstrate a link between PCA and Wright's fst and show that SNP ascertainment has a largely simple and predictable effect on the projection of samples. Using examples from human genetics, I discuss the application of these results to empirical data and the implications for inference.
Genetic variation in natural populations typically demonstrates structure arising from diverse processes including geographical isolation, founder events, migration, and admixture. One technique commonly used to uncover such structure is principal components analysis, which identifies the primary axes of variation in data and projects the samples onto these axes in a graphically appealing and intuitive manner. However, as the method is non-parametric, it can be hard to relate PCA to underlying process. Here, I show that the underlying genealogical history of the samples can be related directly to the PC projection. The result is useful because it is straightforward to predict the effects of different demographic processes on the sample genealogy. However, the result also reveals the limitations of PCA, in that multiple processes can give the same projections, it is strongly influenced by uneven sampling, and it discards important information in the spatial structure of genetic variation along chromosomes.
The proteins encoded by the classical HLA class I and class II genes in the major histocompatibility complex (MHC) are highly polymorphic and play an essential role in self/non-self immune recognition. HLA variation is a crucial determinant of transplant rejection and susceptibility to a large number of infectious and autoimmune disease1. Yet identification of causal variants is problematic due to linkage disequilibrium (LD) that extends across multiple HLA and non-HLA genes in the MHC2,3. We therefore set out to characterize the LD patterns between the highly polymorphic HLA genes and background variation by typing the classical HLA genes and >7,500 common single nucleotide polymorphisms (SNPs) and deletion/insertion polymorphisms (DIPs) across four population samples. The analysis provides informative tag SNPs that capture some of the variation in the MHC region and that could be used in initial disease association studies, and provides new insight into the evolutionary dynamics and ancestral origins of the HLA loci and their haplotypes.