Signatures of natural selection occur throughout the human genome and can be detected at the sequence level. We have re-sequenced ABCE1, a host candidate gene essential for HIV-1 capsid assembly, in European- (n=23) and African-descent (Yoruban; n=24) reference populations for genetic variation discovery. We identified an excess of rare genetic variation in Yoruban samples, and the resulting Tajima’s D was low (−2.27). The trend of excess rare variation persisted in flanking candidate genes ANAPC10 and OTUD4, suggesting that this pattern of positive selection can be detected across the 184.5kb examined on chromosome 4. Because of ABCE1’s role in HIV-1 replication, we re-sequenced the candidate gene in three small cohorts of HIV-1-infected or resistant individuals. We were able to confirm the excess of rare genetic variation among HIV-1 positive African-American individuals (n=53; Tajima’s D = −2.34). These results highlight the potential importance of ABCE1’s role in infectious diseases such as HIV-1.
ABCE1; African-Americans; single nucleotide polymorphisms; HIV-1
Currently, there is very limited knowledge about the genes involved in normal pigmentation variation in East Asian populations. We carried out a genome-wide scan of signatures of positive selection using the 1000 Genomes Phase I dataset, in order to identify pigmentation genes showing putative signatures of selective sweeps in East Asia. We applied a broad range of methods to detect signatures of selection including: 1) Tests designed to identify deviations of the Site Frequency Spectrum (SFS) from neutral expectations (Tajima’s D, Fay and Wu’s H and Fu and Li’s D* and F*), 2) Tests focused on the identification of high-frequency haplotypes with extended linkage disequilibrium (iHS and Rsb) and 3) Tests based on genetic differentiation between populations (LSBL). Based on the results obtained from a genome wide analysis of 25 kb windows, we constructed an empirical distribution for each statistic across all windows, and identified pigmentation genes that are outliers in the distribution.
Our tests identified twenty genes that are relevant for pigmentation biology. Of these, eight genes (ATRN, EDAR, KLHL7, MITF, OCA2, TH, TMEM33 and TRPM1,) were extreme outliers (top 0.1% of the empirical distribution) for at least one statistic, and twelve genes (ADAM17, BNC2, CTSD, DCT, EGFR, LYST, MC1R, MLPH, OPRM1, PDIA6, PMEL (SILV) and TYRP1) were in the top 1% of the empirical distribution for at least one statistic. Additionally, eight of these genes (BNC2, EGFR, LYST, MC1R, OCA2, OPRM1, PMEL (SILV) and TYRP1) have been associated with pigmentary traits in association studies.
We identified a number of putative pigmentation genes showing extremely unusual patterns of genetic variation in East Asia. Most of these genes are outliers for different tests and/or different populations, and have already been described in previous scans for positive selection, providing strong support to the hypothesis that recent selective sweeps left a signature in these regions. However, it will be necessary to carry out association and functional studies to demonstrate the implication of these genes in normal pigmentation variation.
Nucleotide polymorphism at 12 nuclear loci was studied in Scots pine populations across an environmental gradient in Scotland, to evaluate the impacts of demographic history and selection on genetic diversity. At eight loci, diversity patterns were compared between Scottish and continental European populations. At these loci, a similar level of diversity (θsil=∼0.01) was found in Scottish vs mainland European populations, contrary to expectations for recent colonization, however, less rapid decay of linkage disequilibrium was observed in the former (ρ=0.0086±0.0009, ρ=0.0245±0.0022, respectively). Scottish populations also showed a deficit of rare nucleotide variants (multi-locus Tajima's D=0.316 vs D=−0.379) and differed significantly from mainland populations in allelic frequency and/or haplotype structure at several loci. Within Scotland, western populations showed slightly reduced nucleotide diversity (πtot=0.0068) compared with those from the south and east (0.0079 and 0.0083, respectively) and about three times higher recombination to diversity ratio (ρ/θ=0.71 vs 0.15 and 0.18, respectively). By comparison with results from coalescent simulations, the observed allelic frequency spectrum in the western populations was compatible with a relatively recent bottleneck (0.00175 × 4Ne generations) that reduced the population to about 2% of the present size. However, heterogeneity in the allelic frequency distribution among geographical regions in Scotland suggests that subsequent admixture of populations with different demographic histories may also have played a role.
adaptation; bottleneck; nucleotide diversity; population differentiation; linkage disequilibrium; recolonization
Genome-wide patterns of diversity and selection are critical measures for understanding how evolution has shaped the genome. Yet, these population genomic estimates are available for only a limited number of model organisms. Here we focus on the population genomics of the pea aphid (Acyrthosiphon pisum). The pea aphid is an emerging model system that exhibits a range of intriguing biological traits not present in classic model systems. We performed low-coverage genome resequencing of 21 clonal pea aphid lines collected from alfalfa host plants in North America to characterize genome-wide patterns of diversity and selection. We observed an excess of low-frequency polymorphisms throughout coding and noncoding DNA, which we suggest is the result of a founding event and subsequent population expansion in North America. Most gene regions showed lower levels of Tajima’s D than synonymous sites, suggesting that the majority of the genome is not evolving neutrally but rather exhibits significant constraint. Furthermore, we used the pea aphid’s unique manner of X-chromosome inheritance to assign genomic scaffolds to either autosomes or the X chromosome. Comparing autosomal vs. X-linked sequence variation, we discovered that autosomal genes show an excess of low frequency variants indicating that purifying selection acts more efficiently on the X chromosome. Overall, our results provide a critical first step in characterizing the genetic diversity and evolutionary pressures on an aphid genome.
pea aphid; Acyrthosiphon; population genomics; sex chromosome; selection
Most species have at least some level of genetic structure. Recent simulation studies have shown that it is important to consider population structure when sampling individuals to infer past population history. The relevance of the results of these computer simulations for empirical studies, however, remains unclear. In the present study, we use DNA sequence datasets collected from two closely related species with very different histories, the selfing species Capsella rubella and its outcrossing relative C. grandiflora, to assess the impact of different sampling strategies on summary statistics and the inference of historical demography. Sampling strategy did not strongly influence the mean values of Tajima’s D in either species, but it had some impact on the variance. The general conclusions about demographic history were comparable across sampling schemes even when resampled data were analyzed with approximate Bayesian computation (ABC). We used simulations to explore the effects of sampling scheme under different demographic models. We conclude that when sequences from modest numbers of loci (<60) are analyzed, the sampling strategy is generally of limited importance. The same is true under intermediate or high levels of gene flow (4Nm > 2–10) in models in which global expansion is combined with either local expansion or hierarchical population structure. Although we observe a less severe effect of sampling than predicted under some earlier simulation models, our results should not be seen as an encouragement to neglect this issue. In general, a good coverage of the natural range, both within and between populations, will be needed to obtain a reliable reconstruction of a species’s demographic history, and in fact, the effect of sampling scheme on polymorphism patterns may itself provide important information about demographic history.
population structure; Tajima’s D; frequency spectrum; Capsella
The Sonoda–Tajima Cell Collection includes cell samples obtained from a range of ethnic minority groups across the world but in particular from South America. The collection is made all the more valuable by the fact that some of these ethnic populations have since died out, and thus it will be impossible to prepare a similar cell collection again. The collection was donated to our institute, a public cell bank in Japan, by Drs Sonoda and Tajima to make it available to researchers throughout the world. The original cell collection was composed of cryopreserved peripheral blood samples that would obviously have been rapidly exhausted if used directly. We, therefore, immortalized some samples with the Epstein–Barr virus and established B-lymphoblastoid cell lines (B-LCLs). As there is continuing controversy over whether the B-LCL genome is stably maintained, we performed an array comparative genomic hybridization (CGH) analysis to confirm the genomic stability of the cell lines. The array CGH analysis of the B-LCL lines and their parental B cells demonstrated that genomic stability was maintained in the long-term cell cultures. The B-LCLs of the Sonoda–Tajima Collection will therefore be made available to interested scientists around the world. At present, 512 B-LCLs have been developed, and we are willing to increase the number if there is sufficient demand.
Amerind; minority group; B-LCL; array CGH
Next-generation sequencing technologies now make it possible to genotype and measure hundreds of thousands of rare genetic variations in individuals across the genome. Characterization of high-density genetic variation facilitates control of population genetic structure on a finer scale before large-scale genotyping in disease genetics studies. Population structure is a well-known, prevalent, and important factor in common variant genetic studies, but its relevance in rare variants is unclear. We perform an extensive population structure analysis using common and rare functional variants from the Genetic Analysis Workshop 17 mini-exome sequence. The analysis based on common functional variants required 388 principal components to account for 90% of the variation in population structure. However, an analysis based on rare variants required 532 significant principal components to account for similar levels of variation. Using rare variants, we detected fine-scale substructure beyond the population structure identified using common functional variants. Our results show that the level of population structure embedded in rare variant data is different from the level embedded in common variant data and that correcting for population structure is only as good as the level one wishes to correct.
Next-generation sequencing technologies are rapidly changing the field of genetic epidemiology and enabling exploration of the full allele frequency spectrum underlying complex diseases. Although sequencing technologies have shifted our focus toward rare genetic variants, statistical methods traditionally used in genetic association studies are inadequate for estimating effects of low minor allele frequency variants. Four our study we use the Genetic Analysis Workshop 17 data from 697 unrelated individuals (genotypes for 24,487 autosomal variants from 3,205 genes). We apply a Bayesian hierarchical mixture model to identify genes associated with a simulated binary phenotype using a transformed genotype design matrix weighted by allele frequencies. A Metropolis Hasting algorithm is used to jointly sample each indicator variable and additive genetic effect pair from its conditional posterior distribution, and remaining parameters are sampled by Gibbs sampling. This method identified 58 genes with a posterior probability greater than 0.8 for being associated with the phenotype. One of these 58 genes, PIK3C2B was correctly identified as being associated with affected status based on the simulation process. This project demonstrates the utility of Bayesian hierarchical mixture models using a transformed genotype matrix to detect genes containing rare and common variants associated with a binary phenotype.
The data set simulated for Genetic Analysis Workshop 17 was designed to mimic a subset of data that might be produced in a full exome screen for a complex disorder and related risk factors in order to permit workshop participants to investigate issues of study design and statistical genetic analysis. Real sequence data from the 1000 Genomes Project formed the basis for simulating a common disease trait with a prevalence of 30% and three related quantitative risk factors in a sample of 697 unrelated individuals and a second sample of 697 individuals in large, extended pedigrees. Called genotypes for 24,487 autosomal markers assigned to 3,205 genes and simulated affection status, quantitative traits, age, sex, pedigree relationships, and cigarette smoking were provided to workshop participants. The simulating model included both common and rare variants with minor allele frequencies ranging from 0.07% to 25.8% and a wide range of effect sizes for these variants. Genotype-smoking interaction effects were included for variants in one gene. Functional variants were concentrated in genes selected from specific biological pathways and were selected on the basis of the predicted deleteriousness of the coding change. For each sample, unrelated individuals and family, 200 replicates of the phenotypes were simulated.
The intron 5 of gene LMBR1 is the cis-acting regulatory module for the sonic hedgehog (SHH) gene. Mutation in this non-coding region is associated with preaxial polydactyly, and may play crucial roles in the evolution of limb and skeletal system.
We sequenced a region of the LMBR1 gene intron 5 in East Asian human population, and found a significant deviation of Tajima's D statistics from neutrality taking human population growth into account. Data from HapMap also demonstrated extended linkage disequilibrium in the region in East Asian and European population, and significantly low degree of genetic differentiation among human populations.
We proposed that the intron 5 of LMBR1 was presumably subject to balancing selection during the evolution of modern human.
The site frequency spectrum of mutations (SFS) and linkage disequilibrium (LD) are the two major sources of information in population genetics studies. In this study we focus on the levels of LD and the SFS and on the effect of sample size on summary statistics in 10 Scandinavian populations of Norway spruce. We found that previous estimates of a low level of LD were highly influenced by both sampling strategy and the fact that data from multiple loci were analyzed jointly. Estimates of LD were in fact heterogeneous across loci and increased within individual populations compared with the estimate from the total data. The variation in levels of LD among populations most likely reflects different demographic histories, although we were unable to detect population structure by using standard approaches. As in previous studies, we also found that the SFS-based test Tajima’s D was highly sensitive to sample size, revealing that care should be taken to draw strong conclusions from this test when sample size is small. In conclusion, the results from this study are in line with recent studies in other conifers that have revealed a more complex and variable pattern of LD than earlier studies suggested and with studies in trees and humans that suggest that Tajima’s D is sensitive to sample size. This has large consequences for the design of future association and population genetic studies in Norway spruce.
linkage disequilibrium; conifer; recombination; Tajima’s D; resampling
Both common variants and rare variants are involved in the etiology of most complex diseases in humans. Developments in sequencing technology have led to the identification of a high density of rare variant single-nucleotide polymorphisms (SNPs) on the genome, each of which affects only at most 1% of the population. Genotypes derived from these SNPs allow one to study the involvement of rare variants in common human disorders. Here, we propose an association screening approach that treats genes as units of analysis. SNPs within a gene are used to create partitions of individuals, and inverse-probability weighting is used to overweight genotypic differences observed on rare variants. Association between a phenotype trait and the constructed partition is then evaluated. We consider three association tests (one-way ANOVA, chi-square test, and the partition retention method) and compare these strategies using the simulated data from the Genetic Analysis Workshop 17. Several genes that contain causal SNPs were identified by the proposed method as top genes.
Taenia saginata is the most common human Taenia in Thailand. By cox1 sequences, 73 isolates from four localities in north and northeast were differentiated into 14 haplotypes, 11 variation sites and haplotype diversity of 0.683. Among 14 haplotypes, haplotype A was the major (52.1%), followed by haplotype B (21.9%). Clustering diagram of Thai and GenBank sequences indicated mixed phylogeny among localities. By MJ analysis, haplotype clustering relationships showed paired-stars-like network, having two main cores surrounded by minor haplotypes. Tajima's D values were significantly negative in T. saginata world population, suggesting population expansion. Significant Fu's Fs values in Thai, as well as world population, also indicate that population is expanding and may be hitchhiking as part of selective sweep. Haplotype B and its dispersion were only found in populations from Thailand. Haplotype B may evolve and ultimately become an ancestor of future populations in Thailand. Haplotype A seems to be dispersion haplotype, not just in Thailand, but worldwide. High genetic T. saginata intraspecies divergence was found, in contrast to its sister species, T. asiatica; among 30 samples from seven countries, its haplotype diversity was 0.067, while only 2 haplotypes were revealed. This extremely low intraspecific variation suggests that T. asiatica could be an endangered species.
Pinna nobilis is the largest endemic Mediterranean marine bivalve. During past centuries, various human activities have promoted the regression of its populations. As a consequence of stringent standards of protection, demographic expansions are currently reported in many sites. The aim of this study was to provide the first large broad-scale insight into the genetic variability of P. nobilis in the area that encompasses the western Mediterranean, Ionian Sea, and Adriatic Sea marine ecoregions. To accomplish this objective twenty-five populations from this area were surveyed using two mitochondrial DNA markers (COI and 16S). Our dataset was then merged with those obtained in other studies for the Aegean and Tunisian populations (eastern Mediterranean), and statistical analyses (Bayesian model-based clustering, median-joining network, AMOVA, mismatch distribution, Tajima’s and Fu’s neutrality tests and Bayesian skyline plots) were performed. The results revealed genetic divergence among three distinguishable areas: (1) western Mediterranean and Ionian Sea; (2) Adriatic Sea; and (3) Aegean Sea and Tunisian coastal areas. From a conservational point of view, populations from the three genetically divergent groups found may be considered as different management units.
Identifying rare variants that are responsible for complex disease has been promoted by advances in sequencing technologies. However, statistical methods that can handle the vast amount of data generated and that can interpret the complicated relationship between disease and these variants have lagged. We apply a zero-inflated Poisson regression model to take into account the excess of zeros caused by the extremely low frequency of the 24,487 exonic variants in the Genetic Analysis Workshop 17 data. We grouped the 697 subjects in the data set as Europeans, Asians, and Africans based on principal components analysis and found the total number of rare variants per gene for each individual. We then analyzed these collapsed variants based on the assumption that rare variants are enriched in a group of people affected by a disease compared to a group of unaffected people. We also tested the hypothesis with quantitative traits Q1, Q2, and Q4. Analyses performed on the combined 697 individuals and on each ethnic group yielded different results. For the combined population analysis, we found that UGT1A1, which was not part of the simulation model, was associated with disease liability and that FLT1, which was a causal locus in the simulation model, was associated with Q1. Of the causal loci in the simulation models, FLT1 and KDR were associated with Q1 and VNN1 was correlated with Q2. No significant genes were associated with Q4. These results show the feasibility and capability of our new statistical model to detect multiple rare variants influencing disease risk.
A comparative population genetics study revealed high levels of nucleotide polymorphism and intermediate-frequency alleles in an arcC gene of Staphylococcus epidermidis, but not in a homologous gene of the more aggressive human pathogen, Staphylococcus aureus. Further investigation showed that the arcC genes used in the multilocus sequence typing schemes of these two species were paralogs. Phylogenetic analyses of arcC-containing loci, including the arginine catabolic mobile element, from both species, suggested that these loci had an eventful history involving gene duplications, rearrangements, deletions, and horizontal transfers. The peak signatures in the polymorphic S. epidermidis locus were traced to an arcD-like gene adjacent to arcC; these signatures consisted of unusually elevated Tajima’s D and π/K ratios, which were robust to assumptions about recombination and species divergence time and among the most elevated in the S. epidermidis genome. Amino acid polymorphisms, including one that differed in polarity and hydropathy, were located in the peak signatures and defined two allelic lineages. Recombination events were detected between these allelic lineages and potential donors and recipients of S. epidermidis were identified in each case. By comparison, the orthologous gene of S. aureus showed no unusual signatures. The ArcD-like protein belonged to the unknown ion transporter 3 family and appeared to be unrelated to ArcD from the arginine deiminase pathway. These studies report the first comparative population genetics results for staphylococci and the first statistical evidence for a candidate target of balancing selection in S. epidermidis.
Staphylococcus aureus; Staphylococcus epidermidis; Population genetics; Balancing selection; Genetic hitchhiking; Approximate Bayesian computation
The mesic habitats of eastern Australia harbour a highly diverse fauna. We examined the impact of climatic oscillations and recognised biogeographic barriers on the evolutionary history of the delicate skink (Lampropholis delicata), a species that occurs in moist habitats throughout eastern Australia. The delicate skink is a common and widespread species whose distribution spans 26° of latitude and nine major biogeographic barriers in eastern Australia. Sequence data were obtained from four mitochondrial genes (ND2, ND4, 12SrRNA, 16SrRNA) for 238 individuals from 120 populations across the entire native distribution of the species. The evolutionary history and diversification of the delicate skink was investigated using a range of phylogenetic (Maximum Likelihood, Bayesian) and phylogeographic analyses (genetic diversity, ΦST, AMOVA, Tajima's D, Fu's F statistic).
Nine geographically structured, genetically divergent clades were identified within the delicate skink. The main clades diverged during the late Miocene-Pliocene, coinciding with the decline and fragmentation of rainforest and other wet forest habitats in eastern Australia. Most of the phylogeographic breaks within the delicate skink were concordant with dry habitat or high elevation barriers, including several recognised biogeographic barriers in eastern Australia (Burdekin Gap, St Lawrence Gap, McPherson Range, Hunter Valley, southern New South Wales). Genetically divergent populations were also located in high elevation topographic isolates inland from the main range of L. delicata (Kroombit Tops, Blackdown Tablelands, Coolah Tops). The species colonised South Australia from southern New South Wales via an inland route, possibly along the Murray River system. There is evidence for recent expansion of the species range across eastern Victoria and into Tasmania, via the Bassian Isthmus, during the late Pleistocene.
The delicate skink is a single widespread, but genetically variable, species. This study provides the first detailed phylogeographic investigation of a widespread species whose distribution spans virtually all of the major biogeographic barriers in eastern Australia.
New high-throughput sequencing technologies have brought forth opportunities for unbiased analysis of thousands of rare genomic variants in genome-wide association studies of complex diseases. Because it is hard to detect single rare variants with appreciable effect sizes at the population level, existing methods mostly aggregate effects of multiple markers by collapsing the rare variants in genes (or genomic regions). We hypothesize that a higher level of aggregation can further improve association signal strength. Using the Genetic Analysis Workshop 17 simulated data, we test a two-step strategy that first applies a collapsing method in a gene-level analysis and then aggregates the gene-level test results by performing an enrichment analysis in gene sets. We find that the gene set approach which combines signals across multiple genes outperforms testing individual genes separately and that the power of the gene set enrichment test is further improved by proper adjustment of statistics to account for gene-wise differences.
Balancing selection is common on many defense genes, but it has rarely been reported for immune effector proteins such as antimicrobial peptides (AMPs). We describe genetic diversity at a brevinin-1 AMP locus in three species of leopard frogs (Rana pipiens, Rana blairi, and Rana palustris). Several highly divergent allelic lineages are segregating at this locus. That this unusual pattern results from balancing selection is demonstrated by multiple lines of evidence, including a ratio of nonsynonymous/synonymous polymorphism significantly higher than 1, the ZnS test, incongruence between the number of segregating sites and haplotype diversity, and significant Tajima's D values. Our data are more consistent with a model of fluctuating selection in which alleles change frequencies over time than with a model of stable balancing selection such as overdominance. Evidence for fluctuating selection includes skewed allele frequencies, low levels of synonymous variation, nonneutral values of Tajima's D within allelic lineages, an inverse relationship between the frequency of an allelic lineage and its degree of polymorphism, and divergent allele frequencies among populations. AMP loci could be important sites of adaptive genetic diversity, with consequences for host–pathogen coevolution and the ability of species to resist disease epidemics.
Rana pipiens; Rana blairi; Rana palustris; antimicrobial peptide; balancing selection; fluctuating selection
The cat (Felis silvestris catus) shows significant variation in pelage, morphological, and behavioral phenotypes amongst its over 40 domesticated breeds. The majority of the breed specific phenotypic presentations originated through artificial selection, especially on desired novel phenotypic characteristics that arose only a few hundred years ago. Variations in coat texture and color of hair often delineate breeds amongst domestic animals. Although the genetic basis of several feline coat colors and hair lengths are characterized, less is known about the genes influencing variation in coat growth and texture, especially rexoid – curly coated types. Cornish Rex is a cat breed defined by a fixed recessive curly coat trait. Genome-wide analyses for selection (di, Tajima’s D and nucleotide diversity) were performed in the Cornish Rex breed and in 11 phenotypically diverse breeds and two random bred populations. Approximately 63K SNPs were used in the analysis that aimed to localize the locus controlling the rexoid hair texture. A region with a strong signature of recent selective sweep was identified in the Cornish Rex breed on chromosome A1, as well as a consensus block of homozygosity that spans approximately 3 Mb. Inspection of the region for candidate genes led to the identification of the lysophosphatidic acid receptor 6 (LPAR6). A 4 bp deletion in exon 5, c.250_253_delTTTG, which induces a premature stop codon in the receptor, was identified via Sanger sequencing. The mutation is fixed in Cornish Rex, absent in all straight haired cats analyzed, and is also segregating in the German Rex breed. LPAR6 encodes a G protein-coupled receptor essential for maintaining the structural integrity of the hair shaft; and has mutations resulting in a wooly hair phenotype in humans.
The phenomenon of synthetic association raises the possibility that common variant genetic markers may be coupled with functional rare variants sufficiently often to allow the rare variants to be tagged by the common ones. Using human exome sequence data from the 1000 Genomes Project, two investigative teams in Group 12 of Genetic Analysis Workshop 17 found that stochastic coupling between rare and common variants does occur, although perhaps not sufficiently often that we can expect common variant signals to reflect synthetic association; other teams considered methods for detecting association using both rare and common variants. Common themes were that synthetic association is more apparent in population strata (ancestral or familial) and that careful selection of the unit of analysis (gene, gene network, or other genomic subset) is likely to be crucial to the discovery of rare variants that contribute to risk of disease.
synthetic association; rare variants; association; identity by state
The Genetic Analysis Workshop 17 data we used comprise 697 unrelated individuals genotyped at 24,487 single-nucleotide polymorphisms (SNPs) from a mini-exome scan, using real sequence data for 3,205 genes annotated by the 1000 Genomes Project and simulated phenotypes. We studied 200 sets of simulated phenotypes of trait Q2. An important feature of this data set is that most SNPs are rare, with 87% of the SNPs having a minor allele frequency less than 0.05. For rare SNP detection, in this study we performed a least absolute shrinkage and selection operator (LASSO) regression and F tests at the gene level and calculated the generalized degrees of freedom to avoid any selection bias. For comparison, we also carried out linear regression and the collapsing method, which sums the rare SNPs, modified for a quantitative trait and with two different allele frequency thresholds. The aim of this paper is to evaluate these four approaches in this mini-exome data and compare their performance in terms of power and false positive rates. In most situations the LASSO approach is more powerful than linear regression and collapsing methods. We also note the difficulty in determining the optimal threshold for the collapsing method and the significant role that linkage disequilibrium plays in detecting rare causal SNPs. If a rare causal SNP is in strong linkage disequilibrium with a common marker in the same gene, power will be much improved.
A region of approximately one megabase of human Chromosome 12 shows extensive linkage disequilibrium in Utah residents with ancestry from northern and western Europe. This strikingly large linkage disequilibrium block was analyzed with statistical and experimental methods to determine whether natural selection could be implicated in shaping the current genome structure. Extended Haplotype Homozygosity and Relative Extended Haplotype Homozygosity analyses on this region mapped a core region of the strongest conserved haplotype to the exon 1 of the Spinocerebellar ataxia type 2 gene (SCA2). Direct DNA sequencing of this region of the SCA2 gene revealed a significant association between a pre-expanded allele [(CAG)8CAA(CAG)4CAA(CAG)8] of CAG repeats within exon 1 and the selected haplotype of the SCA2 gene. A significantly negative Tajima's D value (−2.20, p < 0.01) on this site consistently suggested selection on the CAG repeat. This region was also investigated in the three other populations, none of which showed signs of selection. These results suggest that a recent positive selection of the pre-expansion SCA2 CAG repeat has occurred in Utah residents with European ancestry.
Natural selection ultimately acts on the genetic variants existing among human populations. Therefore, there are “footprints” that the selective force has left behind in the human genome. In this study, Yu et al. identified an extremely large region on Chromosome 12 that is under positive selection in Utah residents with European ancestry by characterizing the correlation patterns of genomic variants. Further analyses on this interval suggested that selection centered on one of the many forms of Spinocerebellar ataxia type-2 (SCA2) gene. The selected form was next demonstrated to associate with one short version of the disease-causing CAG repeat in the SCA2 gene. These results suggest that the CAG repeat was positively selected. An abnormally long version of CAGs can cause SCA2, a neurodegenerative disease that severely impairs the abilities of body movement. The authors showed how they unraveled natural selection acting on the SCA2 gene. Their findings might lead to the discovery of the biological functions of this gene and its CAG repeat. This kind of study holds potential to facilitate the finding of common disease genes.
The common disease/rare variant hypothesis predicts that rare variants with large effects will have a strong impact on corresponding phenotypes. Therefore it is assumed that rare functional variants are enriched in the extremes of the phenotype distribution. In this analysis of the Genetic Analysis Workshop 17 data set, my aim is to detect genes with rare variants that are associated with quantitative traits using two general approaches: analyzing the association with the complete distribution of values by means of linear regression and using statistical tests based on the tails of the distribution (bottom 10% of values versus top 10%). Three methods are used for this extreme phenotype approach: Fisher’s exact test, weighted-sum method, and beta method. Rare variants were collapsed on the gene level. Linear regression including all values provided the highest power to detect rare variants. Of the three methods used in the extreme phenotype approach, the beta method performed best. Furthermore, the sample size was enriched in this approach by adding additional samples with extreme phenotype values. Doubling the sample size using this approach, which corresponds to only 40% of sample size of the original continuous trait, yielded a comparable or even higher power than linear regression. If samples are selected primarily for sequencing, enriching the analysis by gathering a greater proportion of individuals with extreme values in the phenotype of interest rather than in the general population leads to a higher power to detect rare variants compared to analyzing a population-based sample with equivalent sample size.
Both family- and population-based samples are used to identify genetic variants associated with phenotypes. Each strategy has demonstrated advantages, but their ability to identify rare variants and genes containing rare variants is unclear. To compare these two study designs in the identification of rare causal variants, we applied various methods to the population- and family-based data simulated by the Genetic Analysis Workshop 17 with knowledge of the simulated model. Our results suggest that different variants can be identified by different study designs. Family-based and population-based study designs can be complementary in the identification of rare causal variants and should be considered in future studies.