Recent efforts have attempted to describe the population structure of common chimpanzee, focusing on four subspecies: Pan troglodytes verus, P. t. ellioti, P. t. troglodytes, and P. t. schweinfurthii. However, few studies have pursued the effects of natural selection in shaping their response to pathogens and reproduction. Whey acidic protein (WAP) four-disulfide core domain (WFDC) genes and neighboring semenogelin (SEMG) genes encode proteins with combined roles in immunity and fertility. They display a strikingly high rate of amino acid replacement (dN/dS), indicative of adaptive pressures during primate evolution. In human populations, three signals of selection at the WFDC locus were described, possibly influencing the proteolytic profile and antimicrobial activities of the male reproductive tract. To evaluate the patterns of genomic variation and selection at the WFDC locus in chimpanzees, we sequenced 17 WFDC genes and 47 autosomal pseudogenes in 68 chimpanzees (15 P. t. troglodytes, 22 P. t. verus, and 31 P. t. ellioti). We found a clear differentiation of P. t. verus and estimated the divergence of P. t. troglodytes and P. t. ellioti subspecies in 0.173 Myr; further, at the WFDC locus we identified a signature of strong selective constraints common to the three subspecies in WFDC6—a recent paralog of the epididymal protease inhibitor EPPIN. Overall, chimpanzees and humans do not display similar footprints of selection across the WFDC locus, possibly due to different selective pressures between the two species related to immune response and reproductive biology.
WFDC; natural selection; chimpanzees; serine protease inhibitor; reproduction; innate immunity
In 2007, the National Institutes of Health (NIH) introduced the Genome-Wide Association Studies (GWAS) Policy and the database of Genotypes and Phenotypes (dbGaP) to facilitate “controlled” access to GWAS data based on participants’ informed consent. dbGaP has provided 2,221 investigators access to 304 studies, resulting in 924 publications and significant scientific advances. Following this success, the 2014 Genomic Data Sharing Policy will extend the GWAS Policy to additional data types.
Genome-wide association studies, DNA sequencing studies, and other genomic studies are finding an increasing number of genetic variants associated with clinical phenotypes that may be useful in developing diagnostic, preventive, and treatment strategies for individual patients. However, few common variants have been integrated into routine clinical practice. The reasons for this are several, but two of the most significant are limited evidence about the clinical implications of the variants and a lack of a comprehensive knowledge base that captures genetic variants, their phenotypic associations, and other pertinent phenotypic information that is openly accessible to clinical groups attempting to interpret sequencing data. As the field of medicine begins to incorporate genome-scale analysis into clinical care, approaches need to be developed for collecting and characterizing data on the clinical implications of variants, developing consensus on their actionability, and making this information available for clinical use. The National Human Genome Research Institute (NHGRI) and the Wellcome Trust thus convened a workshop to consider the processes and resources needed to: 1) identify clinically valid genetic variants; 2) decide whether they are actionable and what the action should be; and 3) provide this information for clinical use. This commentary outlines the key discussion points and recommendations from the workshop.
genomic medicine; clinical actionability; database; electronic health records (EHR); pharmacogenomics; DNA sequencing
Biomedical research has and will continue to generate large amounts of data (termed ‘big data’) in many formats and at all levels. Consequently, there is an increasing need to better understand and mine the data to further knowledge and foster new discovery. The National Institutes of Health (NIH) has initiated a Big Data to Knowledge (BD2K) initiative to maximize the use of biomedical big data. BD2K seeks to better define how to extract value from the data, both for the individual investigator and the overall research community, create the analytic tools needed to enhance utility of the data, provide the next generation of trained personnel, and develop data science concepts and tools that can be made available to all stakeholders.
biomedical big data; BD2K; NIH
The whey acidic protein (WAP) four-disulfide core domain (WFDC) locus located on human chromosome 20q13 spans 19 genes with WAP and/or Kunitz domains. These genes participate in antimicrobial, immune, and tissue homoeostasis activities. Neighboring SEMG genes encode seminal proteins Semenogelin 1 and 2 (SEMG1 and SEMG2). WFDC and SEMG genes have a strikingly high rate of amino acid replacement (dN/dS), indicative of responses to adaptive pressures during vertebrate evolution. To better understand the selection pressures acting on WFDC genes in human populations, we resequenced 18 genes and 54 noncoding segments in 71 European (CEU), African (YRI), and Asian (CHB + JPT) individuals. Overall, we identified 484 single-nucleotide polymorphisms (SNPs), including 65 coding variants (of which 49 are nonsynonymous differences). Using classic neutrality tests, we confirmed the signature of short-term balancing selection on WFDC8 in Europeans and a signature of positive selection spanning genes PI3, SEMG1, SEMG2, and SLPI. Associated with the latter signal, we identified an unusually homogeneous-derived 100-kb haplotype with a frequency of 88% in Asian populations. A putative candidate variant targeted by selection is Thr56Ser in SEMG1, which may alter the proteolytic profile of SEMG1 and antimicrobial activities of semen. All the well-characterized genes residing in the WDFC locus encode proteins that appear to have a role in immunity and/or fertility, two processes that are often associated with adaptive evolution. This study provides further evidence that the WFDC and SEMG loci have been under strong adaptive pressure within the short timescale of modern humans.
WFDC; semenogelins; natural selection; innate immunity; serine protease inhibitors; reproduction
Massively-parallel cDNA sequencing (RNA-Seq) is a new technique that holds great promise for cardiovascular genomics. Here, we used RNA-Seq to study the transcriptomes of matched coronary artery disease cases and controls in the ClinSeq® study, using cell lines as tissue surrogates.
Lymphoblastoid cell lines (LCLs) from 16 cases and controls representing phenotypic extremes for coronary calcification were cultured and analyzed using RNA-Seq. All cell lines were then independently re-cultured and along with another set of 16 independent cases and controls, were profiled with Affymetrix microarrays to perform a technical validation of the RNA-Seq results. Statistically significant changes (p < 0.05) were detected in 186 transcripts, many of which are expressed at extremely low levels (5–10 copies/cell), which we confirmed through a separate spike-in control RNA-Seq experiment. Next, by fitting a linear model to exon-level RNA-Seq read counts, we detected signals of alternative splicing in 18 transcripts. Finally, we used the RNA-Seq data to identify differential expression (p < 0.0001) in eight previously unannotated regions that may represent novel transcripts. Overall, differentially expressed genes showed strong enrichment (p = 0.0002) for prior association with cardiovascular disease. At the network level, we found evidence for perturbation in pathways involving both cardiovascular system development and function as well as lipid metabolism.
We present a pilot study for transcriptome involvement in coronary artery calcification and demonstrate how RNA-Seq analyses using LCLs as a tissue surrogate may yield fruitful results in a clinical sequencing project. In addition to canonical gene expression, we present candidate variants from alternative splicing and novel transcript detection, which have been unexplored in the context of this disease.
Coronary artery calcification; RNA-Seq; Lymphoblastoid cell lines; Transcriptome profiling
We previously reported a human-specific gene conversion of
SIGLEC11 by an adjacent paralogous pseudogene
(SIGLEC16P), generating a uniquely human form of the Siglec-11 protein,
which is expressed in the human brain. Here, we show that Siglec-11 is expressed
exclusively in microglia in all human brains studied—a finding of potential
relevance to brain evolution, as microglia modulate neuronal survival, and Siglec-11
recruits SHP-1, a tyrosine phosphatase that modulates microglial biology. Following the
recent finding of a functional SIGLEC16 allele in human populations,
further analysis of the human SIGLEC11 and
SIGLEC16/P sequences revealed an unusual series of
gene conversion events between two loci. Two tandem and likely simultaneous gene
conversions occurred from SIGLEC16P to SIGLEC11 with a
potentially deleterious intervening short segment happening to be excluded. One of the
conversion events also changed the 5′ untranslated sequence, altering predicted
transcription factor binding sites. Both of the gene conversions have been dated to
∼1–1.2 Ma, after the emergence of the genus Homo, but prior to
the emergence of the common ancestor of Denisovans and modern humans about 800,000 years
ago, thus suggesting involvement in later stages of hominin brain evolution. In keeping
with this, recombinant soluble Siglec-11 binds ligands in the human brain. We also address
a second-round more recent gene conversion from SIGLEC11 to
SIGLEC16, with the latter showing an allele frequency of
∼0.1–0.3 in a worldwide population study. Initial pseudogenization of
SIGLEC16 was estimated to occur at least 3 Ma, which thus preceded the
gene conversion of SIGLEC11 by SIGLEC16P. As gene
conversion usually disrupts the converted gene, the fact that ORFs of
hSIGLEC11 and hSIGLEC16 have been maintained after an
unusual series of very complex gene conversion events suggests that these events may have
been subject to hominin-specific selection forces.
pseudogene; gene conversion; human evolution; human brain; microglia
Summary: VarSifter is a graphical software tool for desktop computers that allows investigators of varying computational skills to easily and quickly sort, filter, and sift through sequence variation data. A variety of filters and a custom query framework allow filtering based on any combination of sample and annotation information. By simplifying visualization and analyses of exome-scale sequence variation data, this program will help bring the power and promise of massively-parallel DNA sequencing to a broader group of researchers.
Availability and Implementation: VarSifter is written in Java, and is freely available in source and binary versions, along with a User Guide, at http://research.nhgri.nih.gov/software/VarSifter/.
Supplementary Information: Additional figures and methods available online at the journal's website.
The kallikrein (KLK) gene family comprises the largest uninterrupted locus of serine proteases in the human genome and represents a notable case for studying the evolutionary fate of duplicated genes. In primates, a recent duplication event gave rise to KLK2 and KLK3, both encoding essential proteins for the cascade of seminal plasma liquefaction. We reconstructed the evolutionary history of KLK2 and KLK3 by comparative analysis of the orthologous sequences from 22 primate species, calculated dN/dS ratios, and addressed the hypothesis of coevolution with their substrates, the semenogelins (SEMG1 and SEMG2). Our findings support the placement of the KLK2–KLK3 duplication in the Catarrhini ancestor and unveil the frequent loss of KLK2 throughout primate evolution by different genomic mechanisms, including unequal crossing-over, deletions, and pseudogenization. We provide evidences for an adaptive evolution of KLK3 toward an expanded enzymatic spectrum, with an effect on the hydrolysis of semen coagulum. Furthermore, we found associations between mating system, the number of SEMG repeat units, and the number of functional KLK2 and KLK3, suggesting complex evolutionary dynamics shaped by reproductive biology.
serine proteases; adaptive evolution; mating system; semen coagulation; semenogelins
Reports of frequent loss of heterozygosity (LOH) of markers on human chromosome 7q in malignant myeloid disorders as well as breast, prostate, ovarian, colon, head and neck, gastric, pancreatic, and renal cell carcinomas suggest the presence of a tumor suppressor gene (TSG). Functional assays have demonstrated that the introduction of an intact copy of human chromosome 7 (hchr7) can restore senescence to immortalized human fibroblast cell lines having LOH of markers within 7q31-q32 and can inhibit the tumorigenic phenotype of a murine squamous cell carcinoma cell line. To facilitate the cloning of the putative TSG, we have constructed a high-resolution physical map of this region of hchr7, specifically that encompassing the markers D7S522 and D7S677 within 7q31.1-q31.2. By using a lower resolution yeast artificial chromosome-based map as a starting framework, we established complete clone coverage of the implicated critical region in bacterial-artificial chromosomes (BACs) and P1-derived artificial chromosomes (PACs). The resulting BAC/PAC-based contig map has provided suitable clones for the systematic sequencing of the entire interval. In addition, we have already identified 29 clusters of overlapping expressed-sequence tags (ESTs) and 4 known genes contained within these clones. Together, the physical map reported here coupled with the evolving sequence and gene maps should hasten the identification of the putative TSG residing within this region of hchr7.
tumor suppressor gene; human chromosome 7; physical mapping; expressed-sequence tag; bacterial artificial chromosome
Although it is recognized that many common complex diseases are a result of multiple genetic and environmental risk factors, studies of gene-environment interaction remain a challenge and have had limited success to date. Given the current state-of-the-science, NIH sought input on ways to accelerate investigations of gene-environment interplay in health and disease by inviting experts from a variety of disciplines to give advice about the future direction of gene-environment interaction studies. Participants of the NIH Gene-Environment Interplay Workshop agreed that there is a need for continued emphasis on studies of the interplay between genetic and environmental factors in disease and that studies need to be designed around a multifaceted approach to reflect differences in diseases, exposure attributes, and pertinent stages of human development. The participants indicated that both targeted and agnostic approaches have strengths and weaknesses for evaluating main effects of genetic and environmental factors and their interactions. The unique perspectives represented at the workshop allowed the exploration of diverse study designs and analytical strategies, and conveyed the need for an interdisciplinary approach including data sharing, and data harmonization to fully explore gene-environment interactions. Further, participants also emphasized the continued need for high-quality measures of environmental exposures and new genomic technologies in ongoing and new studies.
gene-environment interaction; epidemiology; study design; genetics; environment
Comparison of related genomes has emerged as a powerful lens for genome interpretation. Here, we report the sequencing and comparative analysis of 29 eutherian genomes. We confirm that at least 5.5% of the human genome has undergone purifying selection, and report constrained elements covering ~4.2% of the genome. We use evolutionary signatures and comparison with experimental datasets to suggest candidate functions for ~60% of constrained bases. These elements reveal a small number of new coding exons, candidate stop codon readthrough events, and over 10,000 regions of overlapping synonymous constraint within protein-coding exons. We find 220 candidate RNA structural families, and nearly a million elements overlapping potential promoter, enhancer and insulator regions. We report specific amino acid residues that have undergone positive selection, 280,000 non-coding elements exapted from mobile elements, and ~1,000 primate- and human-accelerated elements. Overlap with disease-associated variants suggests our findings will be relevant for studies of human biology and health.
Many software tools for comparative analysis of genomic sequence data have been released in recent decades. Despite this, it remains challenging to determine evolutionary relationships in gene clusters due to their complex histories involving duplications, deletions, inversions, and conversions. One concept describing these relationships is orthology. Orthologs derive from a common ancestor by speciation, in contrast to paralogs, which derive from duplication. Discriminating orthologs from paralogs is a necessary step in most multispecies sequence analyses, but doing so accurately is impeded by the occurrence of gene conversion events. We propose a refined method of orthology assignment based on two paradigms for interpreting its definition: by genomic context or by sequence content. X-orthology (based on context) traces orthology resulting from speciation and duplication only, while N-orthology (based on content) includes the influence of conversion events. We developed a computational method for automatically mapping both types of orthology on a per-nucleotide basis in gene cluster regions studied by comparative sequencing, and we make this mapping accessible by visualizing the output. All of these steps are incorporated into our newly extended CHAP 2 package. We evaluate our method using both simulated data and real gene clusters (including the well-characterized α-globin and β-globin clusters). We also illustrate use of CHAP 2 by analyzing four more loci: CCL (chemokine ligand), IFN (interferon), CYP2abf (part of cytochrome P450 family 2), and KIR (killer cell immunoglobulin-like receptors). These new methods facilitate and extend our understanding of evolution at these and other loci by adding automated accurate evolutionary inference to the biologist's toolkit. The CHAP 2 package is freely available from http://www.bx.psu.edu/miller_lab.
gene clusters; orthology; conversion; evolutionary inference; KIR
Charcot-Marie-Tooth disease type 2D (CMT2D) is a dominantly inherited peripheral neuropathy caused by missense mutations in the glycyl-tRNA synthetase gene (GARS). In addition to GARS, mutations in three other tRNA synthetase genes cause similar neuropathies, although the underlying mechanisms are not fully understood. To address this, we generated transgenic mice that ubiquitously over-express wild-type GARS and crossed them to two dominant mouse models of CMT2D to distinguish loss-of-function and gain-of-function mechanisms. Over-expression of wild-type GARS does not improve the neuropathy phenotype in heterozygous Gars mutant mice, as determined by histological, functional, and behavioral tests. Transgenic GARS is able to rescue a pathological point mutation as a homozygote or in complementation tests with a Gars null allele, demonstrating the functionality of the transgene and revealing a recessive loss-of-function component of the point mutation. Missense mutations as transgene-rescued homozygotes or compound heterozygotes have a more severe neuropathy than heterozygotes, indicating that increased dosage of the disease-causing alleles results in a more severe neurological phenotype, even in the presence of a wild-type transgene. We conclude that, although missense mutations of Gars may cause some loss of function, the dominant neuropathy phenotype observed in mice is caused by a dose-dependent gain of function that is not mitigated by over-expression of functional wild-type protein.
Mutations in the glycyl-tRNA synthetase gene (GARS) cause Charcot-Marie-Tooth disease type 2D, a disease characterized by neuronal axon loss in the arms and legs, resulting in weakness and sensory problems. The GARS protein is essential for protein synthesis in every cell, and it has been difficult to determine whether the mutations result in disease because they impair this function or whether GARS somehow becomes toxic when it is mutated. We generated mice that overexpress normal GARS and mated these to two different mouse models of the disease to determine whether a restoration of normal function could prevent the disease. These crosses demonstrated that the mutant forms of GARS are toxic, and this toxic effect increases as the amount of mutant protein increases. Furthermore, this toxicity cannot be reduced or prevented by providing additional normal GARS. Therefore, these results suggest that, for most patients, therapies need to specifically target the mutant form of GARS or the toxic function.
Ciliary dysfunction leads to a broad range of overlapping phenotypes, termed collectively as ciliopathies. This grouping is underscored by genetic overlap, where causal genes can also contribute modifying alleles to clinically distinct disorders. Here we show that mutations in TTC21B/IFT139, encoding a retrograde intraflagellar transport (IFT) protein, cause both isolated nephronophthisis (NPHP) and syndromic Jeune Asphyxiating Thoracic Dystrophy (JATD). Moreover, although systematic medical resequencing of a large, clinically diverse ciliopathy cohort and matched controls showed a similar frequency of rare changes, in vivo and in vitro evaluations unmasked a significant enrichment of pathogenic alleles in cases, suggesting that TTC21B contributes pathogenic alleles to ∼5% of ciliopathy patients. Our data illustrate how genetic lesions can be both causally associated with diverse ciliopathies, as well as interact in trans with other disease-causing genes, and highlight how saturated resequencing followed by functional analysis of all variants informs the genetic architecture of disorders.
Gene conversion events are often overlooked in analyses of genome evolution. In a conversion event, an interval of DNA sequence (not necessarily containing a gene) overwrites a highly similar sequence. The event creates relationships among genomic intervals that can confound attempts to identify orthologs and to transfer functional annotation between genomes. Here we examine 1,616,329 paralogous pairs of mouse genomic intervals, and detect conversion events in about 7.5% of them. Properties of the putative gene conversions are analyzed, such as the lengths of the paralogous pairs and the spacing between their sources and targets. Our approach is illustrated using conversion events in primate CCL gene clusters. Source code for our program is included in the 3SEQ_2D package, which is freely available at www.bx.psu.edu/miller_lab/.
algorithms; computational molecular biology; evolution
Gene clusters containing multiple similar genomic regions in close proximity are of great interest for biomedical studies because of their associations with inherited diseases. However, such regions are difficult to analyze due to their structural complexity and their complicated evolutionary histories, reflecting a variety of large-scale mutational events. In particular, conversion events can mislead inferences about the relationships among these regions, as traced by traditional methods such as construction of phylogenetic trees or multi-species alignments.
To correct the distorted information generated by such methods, we have developed an automated pipeline called CHAP (Cluster History Analysis Package) for detecting conversion events. We used this pipeline to analyze the conversion events that affected two well-studied gene clusters (α-globin and β-globin) and three gene clusters for which comparative sequence data were generated from seven primate species: CCL (chemokine ligand), IFN (interferon), and CYP2abf (part of cytochrome P450 family 2). CHAP is freely available at http://www.bx.psu.edu/miller_lab.
These studies reveal the value of characterizing conversion events in the context of studying gene clusters in complex genomes.
Rapid evolution is a hallmark of centromeric DNA in eukaryotic genomes. Yet, the centromere itself has a conserved functional role that is mediated by the kinetochore protein complex. To broaden our understanding about both the DNA and proteins that interact at the functional centromere, we sought to gain a detailed view of the evolutionary events that have shaped the primate kinetochore. Specifically, we performed comparative mapping and sequencing of the genomic regions encompassing the genes encoding three foundation kinetochore proteins: Centromere Proteins A, B, and C (CENP-A, CENP-B, and CENP-C). A histone H3 variant, CENP-A provides the foundation of the centromere-specific nucleosome. Comparative sequence analyses of the CENP-A gene in 14 primate species revealed encoded amino-acid residues within both the histone-fold domain and the N-terminal tail that are under strong positive selection. Similar comparative analyses of CENP-C, another foundation protein essential for centromere function, identified amino-acid residues throughout the protein under positive selection in the primate lineage, including several in the centromere localization and DNA-binding regions. Perhaps surprisingly, the gene encoding CENP-B, a kinetochore protein that binds specifically to alpha-satellite DNA, was not found to be associated with signatures of positive selection. These findings point to important and distinct evolutionary forces operating on the DNA and proteins of the primate centromere.
kinetochore; selection; evolution; centromere
Patients with cystic fibrosis (CF) manifest a multisystem disease due to deleterious mutations in each gene encoding the cystic fibrosis transmembrane conductance regulator (CFTR). However, the role of dysfunctional CFTR is uncertain in individuals with mild forms of CF (ie, pancreatic sufficiency) and mutation in only one CFTR gene.
Eleven pancreatic sufficient (PS) CF patients with only one CFTR mutation identified after mutation screening (three patients), mutation scanning (four patients) or DNA sequencing (four patients) were studied. Bi-directional sequencing of the coding region of CFTR was performed in patients who had mutation screening or scanning. If a second CFTR mutation was not identified, CFTR mRNA transcripts from nasal epithelial cells were analysed to determine if any PS-CF patients harboured a second CFTR mutation that altered RNA expression.
Sequencing of the coding regions of CFTR identified a second deleterious mutation in five of the seven patients who previously had mutation screening or mutation scanning. Five of the remaining six patients with only one deleterious mutation identified in the coding region of one CFTR gene had a pathologic reduction in the amount of RNA transcribed from their other CFTR gene (8.4–16% of wild type).
These results show that sequencing of the coding region of CFTR followed by analysis of CFTR transcription could be a useful diagnostic approach to confirm that patients with mild forms of CF harbour deleterious alterations in both CFTR genes.
In Gaucher disease (GD), the inherited deficiency of glucocerebrosidase results in the accumulation of glucocerebroside within lysosomes. Although almost 300 mutations in the glucocerebrosidase gene (GBA) have been identified, the ability to predict phenotype from genotype is quite limited. In this study, we sought to examine potential GBA transcriptional regulatory elements for variants that contribute to phenotypic diversity. Specifically, we generated the genomic sequence for the orthologous genomic region (~39.4 kb) encompassing GBA in eight non-human mammals. Computational comparisons of the resulting sequences, using human sequence as the reference, allowed the identification of multi-species conserved sequences (MCSs). Further analyses predicted the presence of two putative clusters of transcriptional regulatory elements upstream and downstream of GBA, containing five and three transcription factor-binding sites (TFBSs), respectively. A firefly luciferase (Fluc) reporter construct containing sequence flanking the GBA gene was used to test the functional consequences of altering these conserved sequences. The predicted TFBSs were individually altered by targeted mutagenesis, resulting in enhanced Fluc expression for one site and decreased expression for seven others sites. Gel-shift assays confirmed the loss of nuclear-protein binding for several of the mutated constructs. These identified conserved non-coding sequences flanking GBA could play a role in the transcriptional regulation of the gene contributing to the complexity underlying the phenotypic diversity seen in GD.
Multi-species sequence comparisons; glucocerebrosidase; Gaucher disease; transcriptional regulation; luciferase assays
Mutations in the Otopetrin 1 gene (Otop1) in mice and fish produce an unusual bilateral vestibular pathology that involves the absence of otoconia without hearing impairment. The encoded protein, Otop1, is the only functionally characterized member of the Otopetrin Domain Protein (ODP) family; the extended sequence and structural preservation of ODP proteins in metazoans suggest a conserved functional role. Here, we use the tools of sequence- and cytogenetic-based comparative genomics to study the Otop1 and the Otop2-Otop3 genes and to establish their genomic context in 25 vertebrates. We extend our evolutionary study to include the gene mutated in Usher syndrome (USH) subtype 1G (Ush1g), both because of the head-to-tail clustering of Ush1g with Otop2 and because Otop1 and Ush1g mutations result in inner ear phenotypes.
We established that OTOP1 is the boundary gene of an inversion polymorphism on human chromosome 4p16 that originated in the common human-chimpanzee lineage more than 6 million years ago. Other lineage-specific evolutionary events included a three-fold expansion of the Otop genes in Xenopus tropicalis and of Ush1g in teleostei fish. The tight physical linkage between Otop2 and Ush1g is conserved in all vertebrates. To further understand the functional organization of the Ushg1-Otop2 locus, we deduced a putative map of binding sites for CCCTC-binding factor (CTCF), a mammalian insulator transcription factor, from genome-wide chromatin immunoprecipitation-sequencing (ChIP-seq) data in mouse and human embryonic stem (ES) cells combined with detection of CTCF-binding motifs.
The results presented here clarify the evolutionary history of the vertebrate Otop and Ush1g families, and establish a framework for studying the possible interaction(s) of Ush1g and Otop in developmental pathways.
Myelin protein zero (MPZ) is a critical structural component of myelin in the peripheral nervous system. The MPZ gene is regulated, in part, by the transcription factors SOX10 and EGR2. Mutations in MPZ, SOX10, and EGR2 have been implicated in demyelinating peripheral neuropathies, suggesting that components of this transcriptional network are candidates for harboring disease-causing mutations (or otherwise functional variants) that affect MPZ expression.
We utilized a combination of multi-species sequence comparisons, transcription factor-binding site predictions, targeted human DNA re-sequencing, and in vitro and in vivo enhancer assays to study human non-coding MPZ variants.
Our efforts revealed a variant within the first intron of MPZ that resides within a previously described SOX10 binding site is associated with decreased enhancer activity, and alters binding of nuclear proteins. Additionally, the genomic segment harboring this variant directs tissue-relevant reporter gene expression in zebrafish.
This is the first reported MPZ variant within a cis-acting transcriptional regulatory element. While we were unable to implicate this variant in disease onset, our data suggests that similar non-coding sequences should be screened for mutations in patients with neurological disease. Furthermore, our multi-faceted approach for examining the functional significance of non-coding variants can be readily generalized to study other loci important for myelin structure and function.
Balancing selection is potentially an important biological force for maintaining advantageous genetic diversity in populations, including variation that is responsible for long-term adaptation to the environment. By serving as a means to maintain genetic variation, it may be particularly relevant to maintaining phenotypic variation in natural populations. Nevertheless, its prevalence and specific targets in the human genome remain largely unknown. We have analyzed the patterns of diversity and divergence of 13,400 genes in two human populations using an unbiased single-nucleotide polymorphism data set, a genome-wide approach, and a method that incorporates demography in neutrality tests. We identified an unbiased catalog of genes with signatures of long-term balancing selection, which includes immunity genes as well as genes encoding keratins and membrane channels; the catalog also shows enrichment in functional categories involved in cellular structure. Patterns are mostly concordant in the two populations, with a small fraction of genes showing population-specific signatures of selection. Power considerations indicate that our findings represent a subset of all targets in the genome, suggesting that although balancing selection may not have an obvious impact on a large proportion of human genes, it is a key force affecting the evolution of a number of genes in humans.
overdominance; frequency-dependent selection; heterosis; human evolution; population genetics; human diversity
A remarkable characteristic of the human major histocompatibility complex (MHC) is its extreme genetic diversity, which is maintained by balancing selection. In fact, the MHC complex remains one of the best-known examples of natural selection in humans, with well-established genetic signatures and biological mechanisms for the action of selection. Here, we present genetic and functional evidence that another gene with a fundamental role in MHC class I presentation, endoplasmic reticulum aminopeptidase 2 (ERAP2), has also evolved under balancing selection and contains a variant that affects antigen presentation. Specifically, genetic analyses of six human populations revealed strong and consistent signatures of balancing selection affecting ERAP2. This selection maintains two highly differentiated haplotypes (Haplotype A and Haplotype B), with frequencies 0.44 and 0.56, respectively. We found that ERAP2 expressed from Haplotype B undergoes differential splicing and encodes a truncated protein, leading to nonsense-mediated decay of the mRNA. To investigate the consequences of ERAP2 deficiency on MHC presentation, we correlated surface MHC class I expression with ERAP2 genotypes in primary lymphocytes. Haplotype B homozygotes had lower levels of MHC class I expressed on the surface of B cells, suggesting that naturally occurring ERAP2 deficiency affects MHC presentation and immune response. Interestingly, an ERAP2 paralog, endoplasmic reticulum aminopeptidase 1 (ERAP1), also shows genetic signatures of balancing selection. Together, our findings link the genetic signatures of selection with an effect on splicing and a cellular phenotype. Although the precise selective pressure that maintains polymorphism is unknown, the demonstrated differences between the ERAP2 splice forms provide important insights into the potential mechanism for the action of selection.
It has long been known that the extremely high levels of genetic diversity present in the major histocompatibility locus (MHC) are due to balancing selection, a type of natural selection that maintains advantageous genetic diversity in populations. The MHC encodes for molecules required for a type of antigen presentation that mediates detection of infected and cancerous cells by the immune system; the genetic diversity of the MHC thus ensures an adequate response to the wide variety of pathogens that humans encounter. Here, we show that other genes involved in the same antigen-presentation pathway are also subject to balancing selection in humans. Specifically, we show that balancing selection acts to maintain two forms of the endoplasmic reticulum aminopeptidase 2 gene (ERAP2), which encodes a protein also involved in antigen presentation. Although the two ERAP2 forms are present in a similar frequency (close to 0.5), they are associated with differences with respect to the levels of MHC molecules on the cell surface of immune cells. In summary, our findings show that natural selection maintains variants of ERAP2 that affect immune surveillance; they also establish ERAP2 as one of the few examples of balancing selection in humans where the selected variant, its functional consequences, and its influence in interpersonal diversity are known.