Sickle cell disease (SCD) is a congenital blood disease, affecting predominantly children from sub-Saharan Africa, but also populations world-wide. Although the causal mutation of SCD is known, the sources of clinical variability of SCD remain poorly understood, with only a few highly heritable traits associated with SCD having been identified. Phenotypic heterogeneity in the clinical expression of SCD is problematic for follow-up (FU), management, and treatment of patients. Here we used the joint analysis of gene expression and whole genome genotyping data to identify the genetic regulatory effects contributing to gene expression variation among groups of patients exhibiting clinical variability, as well as unaffected siblings, in Benin, West Africa. We characterized and replicated patterns of whole blood gene expression variation within and between SCD patients at entry to clinic, as well as in follow-up programs. We present a global map of genes involved in the disease through analysis of whole blood sampled from the cohort. Genome-wide association mapping of gene expression revealed 390 peak genome-wide significant expression SNPs (eSNPs) and 6 significant eSNP-by-clinical status interaction effects. The strong modulation of the transcriptome implicates pathways affecting core circulating cell functions and shows how genotypic regulatory variation likely contributes to the clinical variation observed in SCD.
sickle cell disease; genomics; transcriptome; eSNP mapping; gene-by-environment interactions
Whole-exome or gene targeted resequencing in hundreds to thousands of individuals has shown that the majority of genetic variants are at low frequency in human populations. Rare variants are enriched for functional mutations and are expected to explain an important fraction of the genetic etiology of human disease, therefore having a potential medical interest. In this work, we analyze the whole-exome sequences of French-Canadian individuals, a founder population with a unique demographic history that includes an original population bottleneck less than 20 generations ago, followed by a demographic explosion, and the whole exomes of French individuals sampled from France. We show that in less than 20 generations of genetic isolation from the French population, the genetic pool of French-Canadians shows reduced levels of diversity, higher homozygosity, and an excess of rare variants with low variant sharing with Europeans. Furthermore, the French-Canadian population contains a larger proportion of putatively damaging functional variants, which could partially explain the increased incidence of genetic disease in the province. Our results highlight the impact of population demography on genetic fitness and the contribution of rare variants to the human genetic variation landscape, emphasizing the need for deep cataloguing of genetic variants by resequencing worldwide human populations in order to truly assess disease risk.
Recent resequencing of the whole genome or the coding part of the genome (the exome) in thousands of individuals has described a large excess of low frequency variants in humans, probably arising as a consequence of recent rapid growth in human population sizes. Most rare variants are private to specific populations and are enriched for functional mutations, thus potentially having some medical relevance. In this study, we analyze whole-exome sequences from over a hundred individuals from the French-Canadian population, which was founded less than 400 years ago by about 8,500 French settlers who colonized the province between the 17th and 18th centuries. We show that in a remarkably short period of time this population has accumulated substantial differences, including an excess of rare, functional and potentially damaging variants, when compared to the original European population. Our results show the effects of population history on genetic variation that may have an impact on genetic fitness and disease, and have implications in the design of genetic studies, highlighting the importance of extending deep resequencing to worldwide human populations.
Recent advances in high-throughput DNA sequencing technologies and associated statistical analyses have enabled in-depth analysis of whole-genome sequences. As this technology is applied to a growing number of individual human genomes, entire families are now being sequenced. Information contained within the pedigree of a sequenced family can be leveraged when inferring the donors' genotypes. The presence of a de novo mutation within the pedigree is indicated by a violation of Mendelian inheritance laws. Here, we present a method for probabilistically inferring genotypes across a pedigree using high-throughput sequencing data and producing the posterior probability of de novo mutation at each genomic site examined. This framework can be used to disentangle the effects of germline and somatic mutational processes and to simultaneously estimate the effect of sequencing error and the initial genetic variation in the population from which the founders of the pedigree arise. This approach is examined in detail through simulations and areas for method improvement are noted. By applying this method to data from members of a well-defined nuclear family with accurate pedigree information, the stage is set to make the most direct estimates of the human mutation rate to date.
de novo mutations; pedigree; short-read data; mutation rates; trio model
Regions of the genome that are under evolutionary constraint across multiple species have previously been used to identify functional sequences in the human genome. Furthermore, it is known that there is an inverse relationship between evolutionary constraint and the allele frequency of a mutation segregating in human populations, implying a direct relationship between interspecies divergence and fitness in humans. Here we utilise this relationship to test differences in the accumulation of putatively deleterious mutations both between populations and on the individual level.
Using whole genome and exome sequencing data from Phase 1 of the 1000 Genome Project for 1,092 individuals from 14 worldwide populations we show that minor allele frequency (MAF) varies as a function of constraint around both coding regions and non-coding sites genome-wide, implying that negative, rather than positive, selection primarily drives the distribution of alleles among individuals via background selection. We find a strong relationship between effective population size and the depth of depression in MAF around the most conserved genes, suggesting that populations with smaller effective size are carrying more deleterious mutations, which also translates into higher genetic load when considering the number of putatively deleterious alleles segregating within each population. Finally, given the extreme richness of the data, we are now able to classify individual genomes by the accumulation of mutations at functional sites using high coverage 1000 Genomes data. Using this approach we detect differences between ‘healthy’ individuals within populations for the distributions of putatively deleterious rare alleles they are carrying.
These findings demonstrate the extent of background selection in the human genome and highlight the role of population history in shaping patterns of diversity between human individuals. Furthermore, we provide a framework for the utility of personal genomic data for the study of genetic fitness and diseases.
Gene-environment interactions have long been recognized as a fundamental concept in evolutionary, quantitative, and medical genetics. In the genomics era, study of how environment and genome interact to shape gene expression variation is relevant to understanding the genetic architecture of complex phenotypes. While genetic analysis of gene expression variation focused on main effects, little is known about the extent of interaction effects implicating regulatory variants and their consequences on transcriptional variation. Here we survey the current state of the concept of transcriptional gene-environment interactions and discuss its utility for mapping disease phenotypes in light of the insights gained from genome-wide association studies of gene expression.
eQTL; eSNP; gene-environment interactions; transcriptome
The var genes of the human malaria parasite Plasmodium falciparum are highly polymorphic loci coding for the erythrocyte membrane proteins 1 (PfEMP1), which are responsible for the cytoaherence of P. falciparum infected red blood cells to the human vasculature. Cytoadhesion, coupled with differential expression of var genes, contributes to virulence and allows the parasite to establish chronic infections by evading detection from the host’s immune system. Although studying genetic diversity is a major focus of recent work on the var genes, little is known about the gene family's origin and evolutionary history.
Using a novel hidden Markov model-based approach and var sequences assembled from additional isolates and species, we are able to reveal elements of both the early evolution of the var genes as well as recent diversifying events. We compare sequences of the var gene DBLα domains from divergent isolates of P. falciparum (3D7 and HB3), and a closely-related species, Plasmodium reichenowi. We find that the gene family is equally large in P. reichenowi and P. falciparum -- with a minimum of 51 var genes in the P. reichenowi genome (compared to 61 in 3D7 and a minimum of 48 in HB3). In addition, we are able to define large, continuous blocks of homologous sequence among P. falciparum and P. reichenowi var gene DBLα domains. These results reveal that the contemporary structure of the var gene family was present before the divergence of P. falciparum and P. reichenowi, estimated to be between 2.5 to 6 million years ago. We also reveal that recombination has played an important and traceable role in both the establishment, and the maintenance, of diversity in the sequences.
Despite the remarkable diversity and rapid evolution found in these loci within and among P. falciparum populations, the basic structure of these domains and the gene family is surprisingly old and stable. Revealing a common structure as well as conserved sequence among two species also has implications for developing new primate-parasite models for studying the pathology and immunology of falciparum malaria, and for studying the population genetics of var genes and associated virulence phenotypes.
Non-allelic homologous recombination; Hidden Markov-model; var genes; Malaria; PfEMP1; Gene family evolution; Balancing selection
Whole genome sequencing studies are essential to obtain a comprehensive understanding of the vast pattern of human genomic variations. Here we report the results of a high-coverage whole genome sequencing study for 44 unrelated healthy Caucasian adults, each sequenced to over 50-fold coverage (averaging 65.8×). We identified approximately 11 million single nucleotide polymorphisms (SNPs), 2.8 million short insertions and deletions, and over 500,000 block substitutions. We showed that, although previous studies, including the 1000 Genomes Project Phase 1 study, have catalogued the vast majority of common SNPs, many of the low-frequency and rare variants remain undiscovered. For instance, approximately 1.4 million SNPs and 1.3 million short indels that we found were novel to both the dbSNP and the 1000 Genomes Project Phase 1 data sets, and the majority of which (∼96%) have a minor allele frequency less than 5%. On average, each individual genome carried ∼3.3 million SNPs and ∼492,000 indels/block substitutions, including approximately 179 variants that were predicted to cause loss of function of the gene products. Moreover, each individual genome carried an average of 44 such loss-of-function variants in a homozygous state, which would completely “knock out” the corresponding genes. Across all the 44 genomes, a total of 182 genes were “knocked-out” in at least one individual genome, among which 46 genes were “knocked out” in over 30% of our samples, suggesting that a number of genes are commonly “knocked-out” in general populations. Gene ontology analysis suggested that these commonly “knocked-out” genes are enriched in biological process related to antigen processing and immune response. Our results contribute towards a comprehensive characterization of human genomic variation, especially for less-common and rare variants, and provide an invaluable resource for future genetic studies of human variation and diseases.
Next-generation sequencing technologies can now be used to directly measure heritable de novo DNA sequence mutations in humans. However, these techniques have not been used to examine environmental factors that induce such mutations and their associated diseases. To address this issue, a working group on environmentally induced germline mutation analysis (ENIGMA) met in October 2011 to propose the necessary foundational studies, which include sequencing of parent–offspring trios from highly exposed human populations, and controlled dose–response experiments in animals. These studies will establish background levels of variability in germline mutation rates and identify environmental agents that influence these rates and heritable disease. Guidance for the types of exposures to examine come from rodent studies that have identified agents such as cancer chemotherapeutic drugs, ionizing radiation, cigarette smoke, and air pollution as germ-cell mutagens. Research is urgently needed to establish the health consequences of parental exposures on subsequent generations.
Germ cell; Heritable mutation; Next generation sequencing; Copy number variants
We present a statistical framework for estimation and application of sample allele frequency spectra from New-Generation Sequencing (NGS) data. In this method, we first estimate the allele frequency spectrum using maximum likelihood. In contrast to previous methods, the likelihood function is calculated using a dynamic programming algorithm and numerically optimized using analytical derivatives. We then use a Bayesian method for estimating the sample allele frequency in a single site, and show how the method can be used for genotype calling and SNP calling. We also show how the method can be extended to various other cases including cases with deviations from Hardy-Weinberg equilibrium. We evaluate the statistical properties of the methods using simulations and by application to a real data set.
J.B.S. Haldane proposed in 1947 that the male germline may be more mutagenic than the female 1. Diverse studies have supported Haldane’s contention of a higher average mutation rate in the male germline in a variety of mammals, including humans (e.g. 2,3). Here we present the first direct comparative analysis of male and female germline mutation rates from complete genome sequences of two parent-offspring trios. Through extensive validation, we identified 49 and 35 germline de novo mutations (DNMs) in two trio offspring, as well as 1,586 non-germline DNMs arising either somatically or in the cell-lines from which DNA was derived. Most strikingly, in one family we observed that 92% of germline DNMs were from the paternal germline, while, in complete contrast, in the other family 64% of DNMs were from the maternal germline. These observations reveal considerable variation in mutation rates within and between families.
RNA editing is an important cellular process by which the nucleotides in a mature RNA transcript are altered to cause them to differ from the corresponding DNA sequence. While this process yields essential transcripts in humans and other organisms, it is believed to occur at a relatively small number of loci. The rarity of RNA editing has been challenged by a recent comparison of human RNA and DNA sequence data from 27 individuals, which revealed that over 10,000 human exonic sites appear to exhibit RNA-DNA differences (RDDs). Many of these differences could not have been caused by either of the two previously known human RNA editing mechanisms—ADAR-mediated A→G substitutions or APOBEC1-mediated C→U switches—suggesting that a previously unknown mechanism of RNA editing may be active in humans. Here, we reanalyze these data and demonstrate that genomic sequences exist in these same individuals or in the human genome that match the majority of RDDs. Our results suggest that the majority of these RDD events were observed due to accurate transcription of sequences paralogous to the apparently edited gene but differing at the edited site. In light of our results it seems prudent to conclude that if indeed an unknown mechanism is causing RDD events in humans, such events occur at a much lower frequency than originally proposed.
Thirty-two common variants associated with body mass index (BMI) have been identified in genome-wide association studies, explaining ∼1.45% of BMI variation in general population cohorts. We performed a genome-wide association study in a sample of young adults enriched for extremely overweight individuals. We aimed to identify new loci associated with BMI and to ascertain whether using an extreme sampling design would identify the variants known to be associated with BMI in general populations.
From two large Danish cohorts we selected all extremely overweight young men and women (n = 2,633), and equal numbers of population-based controls (n = 2,740, drawn randomly from the same populations as the extremes, representing ∼212,000 individuals). We followed up novel (at the time of the study) association signals (p<0.001) from the discovery cohort in a genome-wide study of 5,846 Europeans, before attempting to replicate the most strongly associated 28 SNPs in an independent sample of Danish individuals (n = 20,917) and a population-based cohort of 15-year-old British adolescents (n = 2,418). Our discovery analysis identified SNPs at three loci known to be associated with BMI with genome-wide confidence (P<5×10−8; FTO, MC4R and FAIM2). We also found strong evidence of association at the known TMEM18, GNPDA2, SEC16B, TFAP2B, SH2B1 and KCTD15 loci (p<0.001), and nominal association (p<0.05) at a further 8 loci known to be associated with BMI. However, meta-analyses of our discovery and replication cohorts identified no novel associations.
Our results indicate that the detectable genetic variation associated with extreme overweight is very similar to that previously found for general BMI. This suggests that population-based study designs with enriched sampling of individuals with the extreme phenotype may be an efficient method for identifying common variants that influence quantitative traits and a valid alternative to genotyping all individuals in large population-based studies, which may require tens of thousands of subjects to achieve similar power.
Over the past decade, attempts to explain the unusual size and prevalence of low-complexity regions (LCRs) in the proteins of the human malaria parasite Plasmodium falciparum have used both neutral and adaptive models. This past research has offered conflicting explanations for LCR characteristics and their role in, and influence on, the evolution of genome structure. Here we show that P. falciparum LCRs (PfLCRs) are not a single phenomenon, but rather consist of at least three distinct types of sequence, and this heterogeneity is the source of the conflict in the literature. Using molecular and population genetics, we show that these families of PfLCRs are evolving by different mechanisms. One of these families, named here the HighGC family, is of particular interest because these LCRs act as recombination hotspots, both in genes under positive selection for high levels of diversity which can be created by recombination (antigens) and those likely to be evolving neutrally or under negative selection (metabolic enzymes). We discuss how the discovery of these distinct species of PfLCRs helps to resolve previous contradictory studies on LCRs in malaria and contributes to our understanding of the evolution of the of the parasite's unusual genome.
Plasmodium falciparum; low-complexity regions; repeat sequences; genome evolution; recombination
In humans, chromosome-number abnormalities have been associated with altered recombination and increased maternal age. Therefore, age-related effects on recombination are of major importance, especially in relation to the mechanisms involved in human trisomies. Here, we examine the relationship between maternal age and recombination rate in humans. We localized crossovers at high resolution by using over 600,000 markers genotyped in a panel of 69 French-Canadian pedigrees, revealing recombination events in 195 maternal meioses. Overall, we observed the general patterns of variation in fine-scale recombination rates previously reported in humans. However, we make the first observation of a significant decrease in recombination rates with advancing maternal age in humans, likely driven by chromosome-specific effects. The effect appears to be localized in the middle section of chromosomal arms and near subtelomeric regions. We postulate that, for some chromosomes, protection against non-disjunction provided by recombination becomes less efficient with advancing maternal age, which can be partly responsible for the higher rates of aneuploidy in older women. We propose a model that reconciles our findings with reported associations between maternal age and recombination in cases of trisomies.
Aging is a genetically and environmentally modulated process. One particular manifestation of aging in humans is the age-related changes that affect the female reproductive system. It is well established that chromosome-number abnormalities in offspring occur more frequently as maternal age advances, but the meiotic mechanisms involved remain unclear. Meiotic recombination has been associated with maternal age in different species but contrasting effects of maternal age on recombination rates have been reported among mammals. In this study, we found a decrease of recombination rates with increasing maternal age in a French-Canadian cohort, with the most pronounced decline possibly occurring before 32 years of age. We observed chromosome-specific age effects, and in older women recombination frequencies are notably reduced in the middle portion of chromosomal arms and near subtelomeric regions. No paternal age effect on recombination was found, highlighting differences in patterns of variation among sexes. Many studies have shown significant inter-individual variation in genome-wide recombination rates, and our results points to an additional, intra-individual source of variation in recombination rates among transmissions from the same mother.
Pathogens have represented an important selective force during the adaptation of modern human populations to changing social and other environmental conditions. The evolution of the immune system has therefore been influenced by these pressures. Genomic scans have revealed that immune system is one of the functions enriched with genes under adaptive selection.
Here, we describe how the innate immune system has responded to these challenges, through the analysis of resequencing data for 132 innate immunity genes in two human populations. Results are interpreted in the context of the functional and interaction networks defined by these genes. Nucleotide diversity is lower in the adaptors and modulators functional classes, and is negatively correlated with the centrality of the proteins within the interaction network. We also produced a list of candidate genes under positive or balancing selection in each population detected by neutrality tests and showed that some functional classes are preferential targets for selection.
We found evidence that the role of each gene in the network conditions the capacity to evolve or their evolvability: genes at the core of the network are more constrained, while adaptation mostly occurred at particular positions at the network edges. Interestingly, the functional classes containing most of the genes with signatures of balancing selection are involved in autoinflammatory and autoimmune diseases, suggesting a counterbalance between the beneficial and deleterious effects of the immune response.
The human malaria parasite Plasmodium falciparum survives pressures from the host immune system and antimalarial drugs by modifying its genome. Genetic recombination and nucleotide substitution are the two major mechanisms that the parasite employs to generate genome diversity. A better understanding of these mechanisms may provide important information for studying parasite evolution, immune evasion and drug resistance.
Here, we used a high-density tiling array to estimate the genetic recombination rate among 32 progeny of a P. falciparum genetic cross (7G8 × GB4). We detected 638 recombination events and constructed a high-resolution genetic map. Comparing genetic and physical maps, we obtained an overall recombination rate of 9.6 kb per centimorgan and identified 54 candidate recombination hotspots. Similar to centromeres in other organisms, the sequences of P. falciparum centromeres are found in chromosome regions largely devoid of recombination activity. Motifs enriched in hotspots were also identified, including a 12-bp G/C-rich motif with 3-bp periodicity that may interact with a protein containing 11 predicted zinc finger arrays.
These results show that the P. falciparum genome has a high recombination rate, although it also follows the overall rule of meiosis in eukaryotes with an average of approximately one crossover per chromosome per meiosis. GC-rich repetitive motifs identified in the hotspot sequences may play a role in the high recombination rate observed. The lack of recombination activity in centromeric regions is consistent with the observations of reduced recombination near the centromeres of other organisms.
Recombination varies greatly among species, as illustrated by the poor conservation of the recombination landscape between humans and chimpanzees. Thus, shorter evolutionary time frames are needed to understand the evolution of recombination. Here, we analyze its recent evolution in humans. We calculated the recombination rates between adjacent pairs of 636,933 common single-nucleotide polymorphism loci in 28 worldwide human populations and analyzed them in relation to genetic distances between populations. We found a strong and highly significant correlation between similarity in the recombination rates corrected for effective population size and genetic differentiation between populations. This correlation is observed at the genome-wide level, but also for each chromosome and when genetic distances and recombination similarities are calculated independently from different parts of the genome. Moreover, and more relevant, this relationship is robustly maintained when considering presence/absence of recombination hotspots. Simulations show that this correlation cannot be explained by biases in the inference of recombination rates caused by haplotype sharing among similar populations. This result indicates a rapid pace of evolution of recombination, within the time span of differentiation of modern humans.
One of the major virulence factors of the malaria causing parasite is the Plasmodium falciparum encoded erythrocyte membrane protein 1 (PfEMP1). It is translocated to It the membrane of infected erythrocytes and expressed from approximately 60 var genes in a mutually exclusive manner. Switching of var genes allows the parasite to alter functional and antigenic properties of infected erythrocytes, to escape the immune defense and to establish chronic infections. We have developed an efficient method for isolating VAR genes from telomeric and other genome locations by adapting transformation-associated recombination (TAR) cloning, which can then be analyzed and sequenced. For this purpose, three plasmids each containing a homologous sequence representing the upstream regions of the group A, B, and C var genes and a sequence homologous to the conserved acidic terminal segment (ATS) of var genes were generated. Co-transfection with P. falciparum strain ITG2F6 genomic DNA in yeast cells yielded 200 TAR clones. The relative frequencies of clones from each group were not biased. Clones were screened by PCR, as well as Southern blotting, which revealed clones missed by PCR due to sequence mismatches with the primers. Selected clones were transformed into E. coli and further analyzed by RFLP and end sequencing. Physical analysis of 36 clones revealed 27 distinct types potentially representing 50% of the var gene repertoire. Three clones were selected for sequencing and assembled into single var gene containing contigs. This study demonstrates that it is possible to rapidly obtain the repertoire of var genes from P. falciparum within a single set of cloning experiments. This technique can be applied to individual isolates which will provide a detailed picture of the diversity of var genes in the field. This is a powerful tool to overcome the obstacles with cloning and assembly of multi-gene families by simultaneously cloning each member.
Deep resequencing of functional regions in human genomes is key to identifying potentially causal rare variants for complex disorders. Here, we present the results from a large-sample resequencing (n = 285 patients) study of candidate genes coupled with population genetics and statistical methods to identify rare variants associated with Autism Spectrum Disorder and Schizophrenia. Three genes, MAP1A, GRIN2B, and CACNA1F, were consistently identified by different methods as having significant excess of rare missense mutations in either one or both disease cohorts. In a broader context, we also found that the overall site frequency spectrum of variation in these cases is best explained by population models of both selection and complex demography rather than neutral models or models accounting for complex demography alone. Mutations in the three disease-associated genes explained much of the difference in the overall site frequency spectrum among the cases versus controls. This study demonstrates that genes associated with complex disorders can be mapped using resequencing and analytical methods with sample sizes far smaller than those required by genome-wide association studies. Additionally, our findings support the hypothesis that rare mutations account for a proportion of the phenotypic variance of these complex disorders.
It is widely accepted that genetic factors play important roles in the etiology of neurological diseases. However, the nature of the underlying genetic variation remains unclear. Critical questions in the field of human genetics relate to the frequency and size effects of genetic variants associated with disease. For instance, the common disease–common variant model is based on the idea that sets of common variants explain a significant fraction of the variance found in common disease phenotypes. On the other hand, rare variants may have strong effects and therefore largely contribute to disease phenotypes. Due to their high penetrance and reduced fitness, such variants are maintained in the population at low frequencies, thus limiting their detection in genome-wide association studies. Here, we use a resequencing approach on a cohort of 285 Autism Spectrum Disorder and Schizophrenia patients and preformed several analyses, enhanced with population genetic approaches, to identify variants associated with both diseases. Our results demonstrate an excess of rare variants in these disease cohorts and identify genes with negative (deleterious) selection coefficients, suggesting an accumulation of variants of detrimental effects. Our results present further evidence for rare variants explaining a component of the genetic etiology of autism and schizophrenia.
Identification of somatic mutations in cancer is a major goal for understanding and monitoring the events related to cancer initiation and progression. High resolution melting (HRM) curve analysis represents a fast, post-PCR high-throughput method for scanning somatic sequence alterations in target genes. The aim of this study was to assess the sensitivity and specificity of HRM analysis for tumor mutation screening in a range of tumor samples, which included 216 frozen pediatric small rounded blue-cell tumors as well as 180 paraffin-embedded tumors from breast, endometrial and ovarian cancers (60 of each). HRM analysis was performed in exons of the following candidate genes known to harbor established commonly observed mutations: PIK3CA, ERBB2, KRAS, TP53, EGFR, BRAF, GATA3, and FGFR3. Bi-directional sequencing analysis was used to determine the accuracy of the HRM analysis. For the 39 mutations observed in frozen samples, the sensitivity and specificity of HRM analysis were 97% and 87%, respectively. There were 67 mutation/variants in the paraffin-embedded samples, and the sensitivity and specificity for the HRM analysis were 88% and 80%, respectively. Paraffin-embedded samples require higher quantity of purified DNA for high performance. In summary, HRM analysis is a promising moderate-throughput screening test for mutations among known candidate genomic regions. Although the overall accuracy appears to be better in frozen specimens, somatic alterations were detected in DNA extracted from paraffin-embedded samples.
Anopheles funestus is one of the primary vectors of human malaria, which causes a million deaths each year in sub-Saharan Africa. Few scientific resources are available to facilitate studies of this mosquito species and relatively little is known about its basic biology and evolution, making development and implementation of novel disease control efforts more difficult. The An. funestus genome has not been sequenced, so in order to facilitate genome-scale experimental biology, we have sequenced the adult female transcriptome of An. funestus from a newly founded colony in Burkina Faso, West Africa, using the Illumina GAIIx next generation sequencing platform.
We assembled short Illumina reads de novo using a novel approach involving iterative de novo assemblies and “target-based” contig clustering. We then selected a conservative set of 15,527 contigs through comparisons to four Dipteran transcriptomes as well as multiple functional and conserved protein domain databases. Comparison to the Anopheles gambiae immune system identified 339 contigs as putative immune genes, thus identifying a large portion of the immune system that can form the basis for subsequent studies of this important malaria vector. We identified 5,434 1∶1 orthologues between An. funestus and An. gambiae and found that among these 1∶1 orthologues, the protein sequence of those with putative immune function were significantly more diverged than the transcriptome as a whole. Short read alignments to the contig set revealed almost 367,000 genetic polymorphisms segregating in the An. funestus colony and demonstrated the utility of the assembled transcriptome for use in RNA-seq based measurements of gene expression.
We developed a pipeline that makes de novo transcriptome sequencing possible in virtually any organism at a very reasonable cost ($6,300 in sequencing costs in our case). We anticipate that our approach could be used to develop genomic resources in a diversity of systems for which full genome sequence is currently unavailable. Our An. funestus contig set and analytical results provide a valuable resource for future studies in this non-model, but epidemiologically critical, vector insect.
Plasmodium falciparum entered into the Peruvian Amazon in 1994, sparking an epidemic between 1995 and 1998. Since 2000, there has been sustained low P. falciparum transmission. The Malaria Immunology and Genetics in the Amazon project has longitudinally followed members of the community of Zungarococha (N = 1,945, 4 villages) with active household and health center-based visits each year since 2003. We examined parasite population structure and traced the parasite genetic diversity temporally and spatially. We genotyped infections over 5 years (2003–2007) using 14 microsatellite (MS) markers scattered across ten different chromosomes. Despite low transmission, there was considerable genetic diversity, which we compared with other geographic regions. We detected 182 different haplotypes from 302 parasites in 217 infections. Structure v2.2 identified five clusters (subpopulations) of phylogenetically related clones. To consider genetic diversity on a more detailed level, we defined haplotype families (hapfams) by grouping haplotypes with three or less loci differences. We identified 34 different hapfams identified. The Fst statistic and heterozygosity analysis showed the five clusters were maintained in each village throughout this time. A minimum spanning network (MSN), stratified by the year of detection, showed that haplotypes within hapfams had allele differences and haplotypes within a cluster definition were more separated in the later years (2006–2007). We modeled hapfam detection and loss, accounting for sample size and stochastic fluctuations in frequencies overtime. Principle component analysis of genetic variation revealed patterns of genetic structure with time rather than village. The population structure, genetic diversity, appearance/disappearance of the different haplotypes from 2003 to 2007 provides a genome-wide “real-time” perspective of P. falciparum parasites in a low transmission region.
malaria; genetic diversity; immunity; low transmission; Peru; microsatellite
Antimalarial drugs impose strong pressure on Plasmodium falciparum parasites and leave signatures of selection in the parasite genome 1,2. Search for signals of selection may lead to genes encoding drug or immune targets 3. The lack of high-throughput genotyping methods, inadequate knowledge of parasite population history, and time-consuming adaptations of parasites to in vitro culture have hampered genome-wide association studies (GWAS) of parasite traits. Here we report genotyping of DNA from 189 culture-adapted P. falciparum parasites using a custom-built array with thousands of single nucleotide polymorphisms (SNPs). Population structure, variation in recombination rate, and loci under recent positive selection were detected. Parasite half maximum inhibitory concentrations (IC50) to seven antimalarial drugs were obtained and used in GWAS to identify genes associated with drug responses. The SNP array and genome-wide parameters provide valuable tools and information for new advances in P. falciparum genetics.
malaria; single nucleotide polymorphism (SNP); genome-wide association study; recombination; drug resistance; population structure
Copy number variations (CNV) are important causal genetic variations for human disease; however, the lack of a statistical model has impeded the systematic testing of CNVs associated with disease in large-scale cohort.
Here, we developed a novel integrated strategy to test CNV-association in genome-wide case-control studies. We converted the single-nucleotide polymorphism (SNP) signal to copy number states using a well-trained hidden Markov model. We mapped the susceptible CNV-loci through SNP site-specific testing to cope with the physiological complexity of CNVs. We also ensured the credibility of the associated CNVs through further window-based CNV-pattern clustering. Genome-wide data with seven diseases were used to test our strategy and, in total, we identified 36 new susceptible loci that are associated with CNVs for the seven diseases: 5 with bipolar disorder, 4 with coronary artery disease, 1 with Crohn's disease, 7 with hypertension, 9 with rheumatoid arthritis, 7 with type 1 diabetes and 3 with type 2 diabetes. Fifteen of these identified loci were validated through genotype-association and physiological function from previous studies, which provide further confidence for our results. Notably, the genes associated with bipolar disorder converged in the phosphoinositide/calcium signaling, a well-known affected pathway in bipolar disorder, which further supports that CNVs have impact on bipolar disorder.
Our results demonstrated the effectiveness and robustness of our CNV-association analysis and provided an alternative avenue for discovering new associated loci of human diseases.