Local adaptation towards divergent ecological conditions often results in genetic differentiation and adaptive phenotypic divergence. To illuminate the ecological distinctiveness of the schizothoracine fish, we studied a Gymnocypris species complex consisting of three morphs distributed across four bodies of water (the Yellow River, Lake Qinghai, the Ganzi River and Lake Keluke) in the Northeast Tibetan Plateau. We used a combination of mitochondrial (16S rRNA and Cyt b) and nuclear (RAG-2) genetic sequences to investigate the phylogeography of these morphs based on a sample of 277 specimens. Analysis of gill rakers allowed for mapping of phenotypic trajectories along the phylogeny. The phylogenetic and morphological analyses showed that the three sparsely rakered morphs were present at two extremes of the phylogenetic tree: the Yellow River morphs were located at the basal phylogenetic split, and the Lake Keluke and Ganzi River morphs at the peak, with the densely rakered Lake Qinghai morphs located between these two extremes. Age estimation further indicated that the sparsely rakered morphs constituted the oldest and youngest lineages, whereas the densely rakered morph was assigned to an intermediate-age lineage. These results are most compatible with the process of evolutionary convergence or reversal. Disruptive natural selection due to divergent habitats and dietary preferences is likely the driving force behind the formation of new morphs, and the similarities between their phenotypes may be attributable to the similarities between their forms of niche tracking associated with food acquisition. This study provides the first genetic evidence for the occurrence of convergence or reversal in the schizothoracine fish of the Tibetan Plateau at small temporal scales.
We investigate the consequences of adopting the criteria used by the state of California, as described by Myers et al. (2011), for conducting familial searches. We carried out a simulation study of randomly generated profiles of related and unrelated individuals with 13-locus CODIS genotypes and YFiler® Y-chromosome haplotypes, on which the Myers protocol for relative identification was carried out. For Y-chromosome sharing first degree relatives, the Myers protocol has a high probability () of identifying their relationship. For unrelated individuals, there is a low probability that an unrelated person in the database will be identified as a first-degree relative. For more distant Y-haplotype sharing relatives (half-siblings, first cousins, half-first cousins or second cousins) there is a substantial probability that the more distant relative will be incorrectly identified as a first-degree relative. For example, there is a probability that a first cousin will be identified as a full sibling, with the probability depending on the population background. Although the California familial search policy is likely to identify a first degree relative if his profile is in the database, and it poses little risk of falsely identifying an unrelated individual in a database as a first-degree relative, there is a substantial risk of falsely identifying a more distant Y-haplotype sharing relative in the database as a first-degree relative, with the consequence that their immediate family may become the target for further investigation. This risk falls disproportionately on those ethnic groups that are currently overrepresented in state and federal databases.
Many coalescent-based methods aiming to infer the demographic history of populations assume a single, isolated and panmictic population (i.e. a Wright-Fisher model). While this assumption may be reasonable under many conditions, several recent studies have shown that the results can be misleading when it is violated. Among the most widely applied demographic inference methods are Bayesian skyline plots (BSPs), which are used across a range of biological fields. Violations of the panmixia assumption are to be expected in many biological systems, but the consequences for skyline plot inferences have so far not been addressed and quantified. We simulated DNA sequence data under a variety of scenarios involving structured populations with variable levels of gene flow and analysed them using BSPs as implemented in the software package BEAST. Results revealed that BSPs can show false signals of population decline under biologically plausible combinations of population structure and sampling strategy, suggesting that the interpretation of several previous studies may need to be re-evaluated. We found that a balanced sampling strategy whereby samples are distributed on several populations provides the best scheme for inferring demographic change over a typical time scale. Analyses of data from a structured African buffalo population demonstrate how BSP results can be strengthened by simulations. We recommend that sample selection should be carefully considered in relation to population structure previous to BSP analyses, and that alternative scenarios should be evaluated when interpreting signals of population size change.
The triplet distance is a distance measure that compares two rooted trees on the same set of leaves by enumerating all sub-sets of three leaves and counting how often the induced topologies of the tree are equal or different. We present an algorithm that computes the triplet distance between two rooted binary trees in time O (n log2 n). The algorithm is related to an algorithm for computing the quartet distance between two unrooted binary trees in time O (n log n). While the quartet distance algorithm has a very severe overhead in the asymptotic time complexity that makes it impractical compared to O (n2) time algorithms, we show through experiments that the triplet distance algorithm can be implemented to give a competitive wall-time running time.
Two African apes are the closest living relatives of humans: the chimpanzee (Pan troglodytes) and the bonobo (Pan paniscus). Although they are similar in many respects, bonobos and chimpanzees differ strikingly in key social and sexual behaviours1–4, and for some of these traits they show more similarity with humans than with each other. Here we report the sequencing and assembly of the bonobo genome to study its evolutionary relationship with the chimpanzee and human genomes. We find that more than three per cent of the human genome is more closely related to either the bonobo or the chimpanzee genome than these are to each other. These regions allow various aspects of the ancestry of the two ape species to be reconstructed. In addition, many of the regions that overlap genes may eventually help us understand the genetic basis of phenotypes that humans share with one of the two apes to the exclusion of the other.
We present a hidden Markov model (HMM) for inferring gradual isolation between two populations during speciation, modelled as a time interval with restricted gene flow. The HMM describes the history of adjacent nucleotides in two genomic sequences, such that the nucleotides can be separated by recombination, can migrate between populations, or can coalesce at variable time points, all dependent on the parameters of the model, which are the effective population sizes, splitting times, recombination rate, and migration rate. We show by extensive simulations that the HMM can accurately infer all parameters except the recombination rate, which is biased downwards. Inference is robust to variation in the mutation rate and the recombination rate over the sequence and also robust to unknown phase of genomes unless they are very closely related. We provide a test for whether divergence is gradual or instantaneous, and we apply the model to three key divergence processes in great apes: (a) the bonobo and common chimpanzee, (b) the eastern and western gorilla, and (c) the Sumatran and Bornean orang-utan. We find that the bonobo and chimpanzee appear to have undergone a clear split, whereas the divergence processes of the gorilla and orang-utan species occurred over several hundred thousands years with gene flow stopping quite recently. We also apply the model to the Homo/Pan speciation event and find that the most likely scenario involves an extended period of gene flow during speciation.
Next-generation sequencing technology has enabled the generation of whole-genome data for many closely related species. For population genetic inference we have sequenced many loci, but only in a few individuals. We present a new method that allows inference of the divergence process based on two closely related genomes, modelled as gradual isolation in an isolation with migration model. This allows estimation of the initial time of restricted gene flow, the cessation of gene flow, as well as the population sizes, migration rates, and recombination rates. We show by simulations that the parameter estimation is accurate with genome-wide data and use the model to disentangle the divergence processes among three sets of closely related great ape species: bonobo/chimpanzee, eastern/western gorillas, and Sumatran/Bornean orang-utans. We find allopatric speciation for bonobo and chimpanzee and non-allopatric speciation for the gorillas and orang-utans. We also consider the split between humans and chimpanzees/bonobos and find evidence for non-allopatric speciation, similar to that within gorillas and orang-utans.
Studies of the apportionment of human genetic variation have long established that most human variation is within population groups and that the additional variation between population groups is small but greatest when comparing different continental populations. These studies often used Wright’s FST that apportions the standardized variance in allele frequencies within and between population groups. Because local adaptations increase population differentiation, high-FST may be found at closely linked loci under selection and used to identify genes undergoing directional or heterotic selection. We re-examined these processes using HapMap data. We analyzed 3 million SNPs on 602 samples from eight worldwide populations and a consensus subset of 1 million SNPs found in all populations. We identified four major features of the data: First, a hierarchically FST analysis showed that only a paucity (12%) of the total genetic variation is distributed between continental populations and even a lesser genetic variation (1%) is found between intra-continental populations. Second, the global FST distribution closely follows an exponential distribution. Third, although the overall FST distribution is similarly shaped (inverse J), FST distributions varies markedly by allele frequency when divided into non-overlapping groups by allele frequency range. Because the mean allele frequency is a crude indicator of allele age, these distributions mark the time-dependent change in genetic differentiation. Finally, the change in mean-FST of these groups is linear in allele frequency. These results suggest that investigating the extremes of the FST distribution for each allele frequency group is more efficient for detecting selection. Consequently, we demonstrate that such extreme SNPs are more clustered along the chromosomes than expected from linkage disequilibrium for each allele frequency group. These genomic regions are therefore likely candidates for natural selection.
Inferring the structure of populations has many applications for genetic research. In addition to providing information for evolutionary studies, it can be used to account for the bias induced by population stratification in association studies. To this end, many algorithms have been proposed to cluster individuals into genetically homogeneous sub-populations. The parametric algorithms, such as Structure, are very popular but their underlying complexity and their high computational cost led to the development of faster parametric alternatives such as Admixture. Alternatives to these methods are the non-parametric approaches. Among this category, AWclust has proven efficient but fails to properly identify population structure for complex datasets. We present in this article a new clustering algorithm called Spectral Hierarchical clustering for the Inference of Population Structure (SHIPS), based on a divisive hierarchical clustering strategy, allowing a progressive investigation of population structure. This method takes genetic data as input to cluster individuals into homogeneous sub-populations and with the use of the gap statistic estimates the optimal number of such sub-populations. SHIPS was applied to a set of simulated discrete and admixed datasets and to real SNP datasets, that are data from the HapMap and Pan-Asian SNP consortium. The programs Structure, Admixture, AWclust and PCAclust were also investigated in a comparison study. SHIPS and the parametric approach Structure were the most accurate when applied to simulated datasets both in terms of individual assignments and estimation of the correct number of clusters. The analysis of the results on the real datasets highlighted that the clusterings of SHIPS were the more consistent with the population labels or those produced by the Admixture program. The performances of SHIPS when applied to SNP data, along with its relatively low computational cost and its ease of use make this method a promising solution to infer fine-scale genetic patterns.
The fluctuation of population size has not been well studied in the previous studies of theoretical linkage disequilibrium (LD) expectation. In this study, an improved theoretical prediction of LD decay was derived to account for the effects of changes in effective population sizes. The equation was used to estimate effective population size (Ne) assuming a constant Ne and LD at equilibrium, and these Ne estimates implied the past changes of Ne for a certain number of generations until equilibrium, which differed based on recombination rate. As the influence of recent population history on the Ne estimates is larger than old population history, recent changes in population size can be inferred more accurately than old changes. The theoretical predictions based on this improved expression showed accurate agreement with the simulated values. When applied to human genome data, the detailed recent history of human populations was obtained. The inferred past population history of each population showed good correspondence with historical studies. Specifically, four populations (three African ancestries and one Mexican ancestry) showed population growth that was significantly less than that of other populations, and two populations originated from China showed prominent exponential growth. During the examination of overall LD decay in the human genome, a selection pressure on chromosome 14, the gephyrin gene, was observed in all populations.
Gorillas are humans’ closest living relatives after chimpanzees, and are of comparable importance for the study of human origins and evolution. Here we present the assembly and analysis of a genome sequence for the western lowland gorilla, and compare the whole genomes of all extant great ape genera. We propose a synthesis of genetic and fossil evidence consistent with placing the human-chimpanzee and human-chimpanzee-gorilla speciation events at approximately 6 and 10 million years ago (Mya). In 30% of the genome, gorilla is closer to human or chimpanzee than the latter are to each other; this is rarer around coding genes, indicating pervasive selection throughout great ape evolution, and has functional consequences in gene expression. A comparison of protein coding genes reveals approximately 500 genes showing accelerated evolution on each of the gorilla, human and chimpanzee lineages, and evidence for parallel acceleration, particularly of genes involved in hearing. We also compare the western and eastern gorilla species, estimating an average sequence divergence time 1.75 million years ago, but with evidence for more recent genetic exchange and a population bottleneck in the eastern species. The use of the genome sequence in these and future analyses will promote a deeper understanding of great ape biology and evolution.
An accurate estimate of the divergence time between Native Americans is important for understanding the initial entry and early dispersion of human beings in the New World. Current methods for estimating the genetic divergence time of populations could seriously depart from a linear relationship with the true divergence for multiple populations of a different population size and significant population expansion. Here, to address this problem, we propose a novel measure to estimate the genetic divergence time of populations. Computer simulation revealed that the new measure maintained an excellent linear correlation with the population divergence time in complicated multi-population scenarios with population expansion. Utilizing the new measure and microsatellite data of 21 Native American populations, we investigated the genetic divergences of the Native American populations. The results indicated that genetic divergences between North American populations are greater than that between Central and South American populations. None of the divergences, however, were large enough to constitute convincing evidence supporting the two-wave or multi-wave migration model for the initial entry of human beings into America. The genetic affinity of the Native American populations was further explored using Neighbor-Net and the genetic divergences suggested that these populations could be categorized into four genetic groups living in four different ecologic zones. The divergence of the population groups suggests that the early dispersion of human beings in America was a multi-step procedure. Further, the divergences suggest the rapid dispersion of Native Americans in Central and South Americas after a long standstill period in North America.
Haplotype phasing represents an essential step in studying the association of genomic polymorphisms with complex genetic diseases, and in determining targets for drug designing. In recent years, huge amounts of genotype data are produced from the rapidly evolving high-throughput sequencing technologies, and the data volume challenges the community with more efficient haplotype phasing algorithms, in the senses of both running time and overall accuracy. 2SNP is one of the fastest haplotype phasing algorithms with comparable low error rates with the other algorithms. The most time-consuming step of 2SNP is the construction of a maximum spanning tree (MST) among all the heterozygous SNP pairs. We simplified this step by replacing the MST with the initial haplotypes of adjacent heterozygous SNP pairs. The multi-SNP haplotypes were estimated within a sliding window along the chromosomes. The comparative studies on four different-scale genotype datasets suggest that our algorithm WinHAP outperforms 2SNP and most of the other haplotype phasing algorithms in terms of both running speeds and overall accuracies. To facilitate the WinHAP’s application in more practical biological datasets, we released the software for free at: http://staff.ustc.edu.cn/~xuyun/winhap/index.htm.
GDF5 is a member of the bone morphogenetic protein (BMP) gene family, and plays an important role in the development of the skeletal system. Variants of the gene are associated with osteoarthritis and height in some human populations. Here, we resequenced the gene in individuals from four geographically separated human populations, and found that the evolution of the promoter region deviated from neutral expectations, with the sequence evolution driven by positive selection in the East Asian population, especially the haplotypes carrying the derived alleles of 5′ UTR SNPs rs143384 and rs143383. The derived alleles of rs143384 and rs143383, which are associated with a risk of osteoarthritis and decreased height, have high frequencies in non-Africans and show strong extended haplotype homozygosity and high population differentiation in East Asian. It is concluded that positive selection has driven the rapid evolution of the two osteoarthritis osteoarthritis-risk and decreased height associated variants of the human GDF5 gene, and supports the suggestion that the reduction in body size during the terminal Pleistocene and Holocene period might have been an adaptive process influenced by genetic factors.
It has been suggested that pathway analysis can complement single-SNP analysis in exploring genomewide association data. Pathway analysis incorporates the available biological knowledge of genes and SNPs and is expected to improve the chances of revealing the underlying genetic architecture of complex traits. Methods for pathway analysis can be classified as competitive (enrichment) or self-contained (association) according to the hypothesis tested. Although association tests are statistically more powerful than enrichment tests they can be difficult to calibrate because biases in analysis accumulate across multiple SNPs or genes. Furthermore, enrichment tests can be more scientifically relevant than association tests, as they detect pathways with relatively more evidence for association than the remaining genes. Here we show how some well known association tests can be simply adapted to test for enrichment, and compare their performance to some established enrichment tests. We propose versions of the Adaptive Rank Truncated Product (ARTP), Tail Strength Measure and Fisher’s combination of p-values for testing the enrichment null hypothesis. We compare the behaviour of these proposed methods with the established Hypergeometric Test and Gene-Set Enrichment Analysis (GSEA). The results of the simulation study show that the modified version of the ARTP method has generally the best performance across the situations considered. The methods were also applied for finding enriched pathways for body mass index (BMI) and platelet function phenotypes. The pathway analysis of BMI identified the Vasoactive Intestinal Peptide pathway as significantly associated with BMI. This pathway has been previously reported as associated with BMI and the risk of obesity. The ARTP method was the method that identified the largest number of enriched pathways across all tested pathway databases and phenotypes. The simulation and data application results are in agreement with previous work on association tests and suggests that the ARTP should be preferred for both enrichment and association testing.
Non-human primates have emerged as an important resource for the study of human disease and evolution. The characterization of genomic variation between and within non-human primate species could advance the development of genetically defined non-human primate disease models. However, non-human primate specific reagents that would expedite such research, such as exon-capture tools, are lacking. We evaluated the efficiency of using a human exome capture design for the selective enrichment of exonic regions of non-human primates. We compared the exon sequence recovery in nine chimpanzees, two crab-eating macaques and eight Japanese macaques. Over 91% of the target regions were captured in the non-human primate samples, although the specificity of the capture decreased as evolutionary divergence from humans increased. Both intra-specific and inter-specific DNA variants were identified; Sanger-based resequencing validated 85.4% of 41 randomly selected SNPs. Among the short indels identified, a majority (54.6%–77.3%) of the variants resulted in a change of 3 base pairs, consistent with expectations for a selection against frame shift mutations. Taken together, these findings indicate that use of a human design exon-capture array can provide efficient enrichment of non-human primate gene regions. Accordingly, use of the human exon-capture methods provides an attractive, cost-effective approach for the comparative analysis of non-human primate genomes, including gene-based DNA variant discovery.
Accurately modeling LD in simulations is essential to correctly evaluate new and existing association methods. At present, there has been minimal research comparing the quality of existing gene region simulation methods to produce LD structures similar to an existing gene region. Here we compare the ability of three approaches to accurately simulate the LD within a gene region: HapSim (2005), Hapgen (2009), and a minor extension to simple haplotype resampling.
In order to observe the variation and bias for each method, we compare the simulated pairwise LD measures and minor allele frequencies to the original HapMap data in an extensive simulation study. When possible, we also evaluate the effects of changing parameters.
HapSim produces samples of haplotypes with lower LD, on average, compared to the original haplotype set while both our resampling method and Hapgen do not introduce this bias. The variation introduced across the replicates by our resampling method is quite small and may not provide enough sampling variability to make a generalizable simulation study.
We recommend using Hapgen to simulate replicate haplotypes from a gene region. Hapgen produces moderate sampling variation between the replicates while retaining the overall unique LD structure of the gene region.
The PRDM9 locus in mammals has increasingly attracted research attention due to its role in mediating chromosomal recombination and possible involvement in hybrid sterility and hence speciation processes. The aim of this study was to characterize sequence variation at the PRDM9 locus in a sample of our closest living relatives, the chimpanzees and bonobos.
PRDM9 contains a highly variable and repetitive zinc finger array. We amplified this domain using long-range PCR and determined the DNA sequences using conventional Sanger sequencing. From 17 chimpanzees representing three subspecies and five bonobos we obtained a total of 12 alleles differing at the nucleotide level. Based on a data set consisting of our data and recently published Pan PRDM9 sequences, we found that at the subspecies level, diversity levels did not differ among chimpanzee subspecies or between chimpanzee subspecies and bonobos. In contrast, the sample of chimpanzees harbors significantly more diversity at PRDM9 than samples of humans. Pan PRDM9 shows signs of rapid evolution including no alleles or ZnFs in common with humans as well as signals of positive selection in the residues responsible for DNA binding.
Conclusions and Significance
The high number of alleles specific to the genus Pan, signs of positive selection in the DNA binding residues, and reported lack of conservation of recombination hotspots between chimpanzees and humans suggest that PRDM9 could be active in hotspot recruitment in the genus Pan. Chimpanzees and bonobos are considered separate species and do not have overlapping ranges in the wild, making the presence of shared alleles at the amino acid level between the chimpanzee and bonobo species interesting in view of the hypothesis that PRDM9 plays a universal role in interspecific hybrid sterility.
We examined the phylogenetic history of Linaria with special emphasis on the Mediterranean sect. Supinae (44 species). We revealed extensive highly supported incongruence among two nuclear (ITS, AGT1) and two plastid regions (rpl32-trnLUAG, trnS-trnG). Coalescent simulations, a hybrid detection test and species tree inference in *BEAST revealed that incomplete lineage sorting and hybridization may both be responsible for the incongruent pattern observed. Additionally, we present a multilabelled *BEAST species tree as an alternative approach that allows the possibility of observing multiple placements in the species tree for the same taxa. That permitted the incorporation of processes such as hybridization within the tree while not violating the assumptions of the *BEAST model. This methodology is presented as a functional tool to disclose the evolutionary history of species complexes that have experienced both hybridization and incomplete lineage sorting. The drastic climatic events that have occurred in the Mediterranean since the late Miocene, including the Quaternary-type climatic oscillations, may have made both processes highly recurrent in the Mediterranean flora.
Dense genotype data can be used to detect chromosome fragments inherited from a common ancestor in apparently unrelated individuals. A disease-causing mutation inherited from a common founder may thus be detected by searching for a common haplotype signature in a sample population of patients. We present here FounderTracker, a computational method for the genome-wide detection of founder mutations in cancer using dense tumor SNP profiles. Our method is based on two assumptions. First, the wild-type allele frequently undergoes loss of heterozygosity (LOH) in the tumors of germline mutation carriers. Second, the overlap between the ancestral chromosome fragments inherited from a common founder will define a minimal haplotype conserved in each patient carrying the founder mutation. Our approach thus relies on the detection of haplotypes with significant identity by descent (IBD) sharing within recurrent regions of LOH to highlight genomic loci likely to harbor a founder mutation. We validated this approach by analyzing two real cancer data sets in which we successfully identified founder mutations of well-characterized tumor suppressor genes. We then used simulated data to evaluate the ability of our method to detect IBD tracts as a function of their size and frequency. We show that FounderTracker can detect haplotypes of low prevalence with high power and specificity, significantly outperforming existing methods. FounderTracker is thus a powerful tool for discovering unknown founder mutations that may explain part of the “missing” heritability in cancer. This method is freely available and can be used online at the FounderTracker website.
Recent advances in automated assessment of basic vocabulary lists allow the construction of linguistic phylogenies useful for tracing dynamics of human population expansions, reconstructing ancestral cultures, and modeling transition rates of cultural traits over time.
Here we investigate the Tupi expansion, a widely-dispersed language family in lowland South America, with a distance-based phylogeny based on 40-word vocabulary lists from 48 languages. We coded 11 cultural traits across the diverse Tupi family including traditional warfare patterns, post-marital residence, corporate structure, community size, paternity beliefs, sibling terminology, presence of canoes, tattooing, shamanism, men's houses, and lip plugs.
The linguistic phylogeny supports a Tupi homeland in west-central Brazil with subsequent major expansions across much of lowland South America. Consistently, ancestral reconstructions of cultural traits over the linguistic phylogeny suggest that social complexity has tended to decline through time, most notably in the independent emergence of several nomadic hunter-gatherer societies. Estimated rates of cultural change across the Tupi expansion are on the order of only a few changes per 10,000 years, in accord with previous cultural phylogenetic results in other language families around the world, and indicate a conservative nature to much of human culture.
Genomic imprinting is an important epigenetic phenomenon, which on the phenotypic level can be detected by the difference between the two heterozygote classes of a gene. Imprinted genes are important in both the development of the placenta and the embryo, and we hypothesized that imprinted genes might be involved in female fertility traits. We therefore performed an association study for imprinted genes related to female fertility traits in two commercial pig populations. For this purpose, 309 SNPs in fifteen evolutionary conserved imprinted regions were genotyped on 689 and 1050 pigs from the two pig populations. A single SNP association study was used to detect additive, dominant and imprinting effects related to four reproduction traits; total number of piglets born, the number of piglets born alive, the total weight of the piglets born and the total weight of the piglets born alive. Several SNPs showed significant () additive and dominant effects and one SNP showed a significant imprinting effect. The SNP with a significant imprinting effect is closely linked to DIO3, a gene involved in thyroid metabolism. The imprinting effect of this SNP explained approximately 1.6% of the phenotypic variance, which corresponded to approximately 15.5% of the additive genetic variance. In the other population, the imprinting effect of this QTL was not significant (), but had a similar effect as in the first population. The results of this study indicate a possible association between the imprinted gene DIO3 and female fertility traits in pigs.
Although complex diseases and traits are thought to have multifactorial genetic basis, the common methods in genome-wide association analyses test each variant for association independent of the others. This computational simplification may lead to reduced power to identify variants with small effect sizes and requires correcting for multiple hypothesis tests with complex relationships. However, advances in computational methods and increase in computational resources are enabling the computation of models that adhere more closely to the theory of multifactorial inheritance. Here, a Bayesian variable selection and model averaging approach is formulated for searching for additive and dominant genetic effects. The approach considers simultaneously all available variants for inclusion as predictors in a linear genotype-phenotype mapping and averages over the uncertainty in the variable selection. This leads to naturally interpretable summary quantities on the significances of the variants and their contribution to the genetic basis of the studied trait. We first characterize the behavior of the approach in simulations. The results indicate a gain in the causal variant identification performance when additive and dominant variation are simulated, with a negligible loss of power in purely additive case. An application to the analysis of high- and low-density lipoprotein cholesterol levels in a dataset of 3895 Finns is then presented, demonstrating the feasibility of the approach at the current scale of single-nucleotide polymorphism data. We describe a Markov chain Monte Carlo algorithm for the computation and give suggestions on the specification of prior parameters using commonly available prior information. An open-source software implementing the method is available at http://www.lce.hut.fi/research/mm/bmagwa/ and https://github.com/to-mi/.
Models of protein evolution currently come in two flavors: generalist and specialist. Generalist models (e.g. PAM, JTT, WAG) adopt a one-size-fits-all approach, where a single model is estimated from a number of different protein alignments. Specialist models (e.g. mtREV, rtREV, HIVbetween) can be estimated when a large quantity of data are available for a single organism or gene, and are intended for use on that organism or gene only. Unsurprisingly, specialist models outperform generalist models, but in most instances there simply are not enough data available to estimate them. We propose a method for estimating alignment-specific models of protein evolution in which the complexity of the model is adapted to suit the richness of the data. Our method uses non-negative matrix factorization (NNMF) to learn a set of basis matrices from a general dataset containing a large number of alignments of different proteins, thus capturing the dimensions of important variation. It then learns a set of weights that are specific to the organism or gene of interest and for which only a smaller dataset is available. Thus the alignment-specific model is obtained as a weighted sum of the basis matrices. Having been constrained to vary along only as many dimensions as the data justify, the model has far fewer parameters than would be required to estimate a specialist model. We show that our NNMF procedure produces models that outperform existing methods on all but one of 50 test alignments. The basis matrices we obtain confirm the expectation that amino acid properties tend to be conserved, and allow us to quantify, on specific alignments, how the strength of conservation varies across different properties. We also apply our new models to phylogeny inference and show that the resulting phylogenies are different from, and have improved likelihood over, those inferred under standard models.
Genome-Wide Association Studies are powerful tools to detect genetic variants associated with diseases. Their results have, however, been questioned, in part because of the bias induced by population stratification. This is a consequence of systematic differences in allele frequencies due to the difference in sample ancestries that can lead to both false positive or false negative findings. Many strategies are available to account for stratification but their performances differ, for instance according to the type of population structure, the disease susceptibility locus minor allele frequency, the degree of sampling imbalanced, or the sample size. We focus on the type of population structure and propose a comparison of the most commonly used methods to deal with stratification that are the Genomic Control, Principal Component based methods such as implemented in Eigenstrat, adjusted Regressions and Meta-Analyses strategies. Our assessment of the methods is based on a large simulation study, involving several scenarios corresponding to many types of population structures. We focused on both false positive rate and power to determine which methods perform the best. Our analysis showed that if there is no population structure, none of the tests led to a bias nor decreased the power except for the Meta-Analyses. When the population is stratified, adjusted Logistic Regressions and Eigenstrat are the best solutions to account for stratification even though only the Logistic Regressions are able to constantly maintain correct false positive rates. This study provides more details about these methods. Their advantages and limitations in different stratification scenarios are highlighted in order to propose practical guidelines to account for population stratification in Genome-Wide Association Studies.
Imprinting is an epigenetic phenomenon where the same alleles have unequal transcriptions and thus contribute differently to a trait depending on their parent of origin. This mechanism has been found to affect a variety of human disorders. Although various methods for testing parent-of-origin effects have been proposed in linkage analysis settings, only a few are available for association analysis and they are usually restricted to small families and particular study designs. In this study, we develop a powerful maximum likelihood test to evaluate the parent-of-origin effects of SNPs on quantitative phenotypes in general family studies. Our method incorporates haplotype distribution to take advantage of inter-marker LD information in genome-wide association studies (GWAS). Our method also accommodates missing genotypes that often occur in genetic studies. Our simulation studies with various minor allele frequencies, LD structures, family sizes, and missing schemes have uniformly shown that using the new method significantly improves the power of detecting imprinted genes compared with the method using the SNP at the testing locus only. Our simulations suggest that the most efficient strategy to investigate parent-of-origin effects is to recruit one parent and as many offspring as possible under practical constraints. As a demonstration, we applied our method to a dataset from the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) to test the parent-of-origin effects of the SNPs within the PPARGC1A, MTP and FABP2 genes on diabetes-related phenotypes, and found that several SNPs in the MTP gene show parent-of-origin effects on insulin and glucose levels.