Bioinformatics and functional screens identified a group of Family A-type DNA Polymerase (polA) genes encoded by viruses inhabiting circumneutral and alkaline hot springs in Yellowstone National Park and the US Great Basin. The proteins encoded by these viral polA genes (PolAs) shared no significant sequence similarity with any known viral proteins but were remarkably similar to PolAs encoded by two of three families of the bacterial phylum Aquificae and by several apicoplast-targeted PolA-like proteins found in the eukaryotic phylum Apicomplexa, which includes the obligate parasites Plasmodium, Babesia, and Toxoplasma. The viral gene products share signature elements previously associated only with Aquificae and Apicomplexa PolA-like proteins and were similar to proteins encoded by prophage elements of a variety of otherwise unrelated Bacteria, each of which additionally encoded a prototypical bacterial PolA. Unique among known viral DNA polymerases, the viral PolA proteins of this study share with the Apicomplexa proteins large amino-terminal domains with putative helicase/primase elements but low primary sequence similarity. The genomic context and distribution, phylogeny, and biochemistry of these PolA proteins suggest that thermophilic viruses transferred polA genes to the Apicomplexa, likely through secondary endosymbiosis of a virus-infected proto-apicoplast, and to the common ancestor of two of three Aquificae families, where they displaced the orthologous cellular polA gene. On the basis of biochemical activity, gene structure, and sequence similarity, we speculate that the xenologous viral-type polA genes may have functions associated with diversity-generating recombination in both Bacteria and Apicomplexa.
viral metagenomics; horizontal gene transfer; replication; DNA polymerase; Apicomplexa; Aquificae
For over the last 2 decades, positively selected amino acid sites have been inferred almost exclusively by showing that the number of nonsynonymous substitutions per nonsynonymous site (dn) is greater than that of synonymous substitutions per synonymous site (ds). However, virtually none of these statistical results have been experimentally tested and remain as hypotheses. To perform such experimental tests, we must connect genotype and phenotype and relate the phenotypic changes to the environmental and behavioral changes of the organism. The genotype–phenotype relationship can be established only by synthesizing and manipulating “proper” ancestral phenotypes, whereas the actual functions of adaptive mutations can be learned by studying their chemical roles in phenotypic changes.
phenotypic adaptation; molecular adaptation; visual pigments; ancestral phenotypes
Mutation rate is one of the most fundamental parameters in genetics and evolutionary biology because mutation rate has major impacts on the incidence of disease, the amount of genetic variation, and the rate and trajectory of evolution. Based on estimates of synonymous nucleotide diversity in Escherichia coli, a recent study claimed that the per-nucleotide mutation rate in a gene decreases with the rise of its expression level or the intensity of purifying selection and that this trend reflects adaptive risk management. Here, we demonstrate that this argument is theoretically untenable, especially in the lack of mechanisms that simultaneously tune the mutabilities of multiple genes with similar fractions of deleterious mutations. Analyzing published genome sequences of E. coli mutation accumulation lines, we show that mutation rates are actually higher in more highly expressed genes, similar to previous genome-wide observations in Salmonella typhimurium, Saccharomyces cerevisiae, and the human germline. These general patterns likely arise from transcription-associated mutagenesis that exceeds transcription-coupled repair.
mutation rate; expression level; natural selection; E. coli
Noncoding genetic variation is known to significantly influence gene expression levels in a growing number of specific cases; however, the patterns of genome-wide noncoding variation present within populations, the evolutionary forces acting on noncoding variants, and the relative effects of regulatory polymorphisms on transcript abundance are not well characterized. Here, we address these questions by analyzing patterns of regulatory variation in motifs for 177 DNA binding proteins in 37 strains of Saccharomyces cerevisiae. Between S. cerevisiae strains, we found considerable polymorphism in regulatory motifs across strains (mean π = 0.005) as well as diversity in regulatory motifs (mean 0.91 motifs differences per regulatory region). Population genetics analyses reveal that motifs are under purifying selection, and there is considerable heterogeneity in the magnitude of selection across different motifs. Finally, we obtained RNA-Seq data in 22 strains and identified 49 polymorphic DNA sequence motifs in 30 distinct genes that are significantly associated with transcriptional differences between strains. In 22 of these genes, there was a single polymorphic motif associated with expression in the upstream region. Our results provide comprehensive insights into the evolutionary trajectory of regulatory variation in yeast and the characteristics of a compendium of regulatory alleles.
adaptive evolution; evolution; regulatory variation; yeast
Ursine bears are a mammalian subfamily that comprises six morphologically and ecologically distinct extant species. Previous phylogenetic analyses of concatenated nuclear genes could not resolve all relationships among bears, and appeared to conflict with the mitochondrial phylogeny. Evolutionary processes such as incomplete lineage sorting and introgression can cause gene tree discordance and complicate phylogenetic inferences, but are not accounted for in phylogenetic analyses of concatenated data. We generated a high-resolution data set of autosomal introns from several individuals per species and of Y-chromosomal markers. Incorporating intraspecific variability in coalescence-based phylogenetic and gene flow estimation approaches, we traced the genealogical history of individual alleles. Considerable heterogeneity among nuclear loci and discordance between nuclear and mitochondrial phylogenies were found. A species tree with divergence time estimates indicated that ursine bears diversified within less than 2 My. Consistent with a complex branching order within a clade of Asian bear species, we identified unidirectional gene flow from Asian black into sloth bears. Moreover, gene flow detected from brown into American black bears can explain the conflicting placement of the American black bear in mitochondrial and nuclear phylogenies. These results highlight that both incomplete lineage sorting and introgression are prominent evolutionary forces even on time scales up to several million years. Complex evolutionary patterns are not adequately captured by strictly bifurcating models, and can only be fully understood when analyzing multiple independently inherited loci in a coalescence framework. Phylogenetic incongruence among gene trees hence needs to be recognized as a biologically meaningful signal.
species tree; introgressive hybridization; Ursidae; phylogenetic network; coalescence; multi-locus analyses
Computational predictions have become indispensable for evaluating the disease-related impact of nonsynonymous single-nucleotide variants discovered in exome sequencing. Many such methods have their roots in molecular evolution, as they use information derived from multiple sequence alignments. We show that the performance of current methods (e.g., PolyPhen-2 and SIFT) is improved significantly by optimizing their statistical models on evolutionarily balanced training data, where equal numbers of positive and negative controls within each evolutionary conservation class are used. Evolutionary balancing significantly reduces the false-positive rates for variants observed at highly conserved sites and false-negative rates for variants observed at fast evolving sites. Use of these improved methods enables more accurate forecasting when concordant diagnosis from multiple methods is regarded as a more reliable indicator of the prediction. Applied to a large exome variation data set, we find that the current methods produce concordant predictions for less than half of the population variants. These advances are implemented in a web resource for use in practical applications (www.mypeg.info, last accessed March 13, 2013).
evolutionary medicine; nonsynonymous single nucleotide variant; computational prediction
Analysis of genetic differences (gene presence/absence and nucleotide polymorphisms) among strains of a bacterial species is crucial to understanding molecular mechanisms of bacterial pathogenesis and selecting targets for novel antibacterial therapeutics. However, lack of genome-wide association studies on large and epidemiologically well-defined strain collections from the same species makes it difficult to identify the genes under positive selection and define adaptive polymorphisms in those genes. To address this need and to overcome existing limitations, we propose to create a “microbial variome"—a species-specific resource database of genomic variations based on molecular evolutionary analysis. Here, we present prototype variome databases of Escherichia coli and Salmonella enterica subspecies enterica (http://depts.washington.edu/sokurel/variome, last accessed March 26, 2013). The prototypes currently include the point mutations data of core protein-coding genes from completely sequenced genomes of 22 E. coli and 17 S. enterica strains. These publicly available databases allow for single- and multiple-field sorting, filtering, and searching of the gene variability data and the potential adaptive significance. Such resource databases would immensely help experimental research, clinical diagnostics, epidemiology, and environmental control of human pathogens.
microbial variome; adaptive evolution; nucleotide polymorphisms; database
A key goal in molecular evolution is to extract mechanistic insights from signatures of selection. A case study is codon usage, where despite many recent advances and hypotheses, two longstanding problems remain: the relative contribution of selection and mutation in determining codon frequencies and the relative contribution of translational speed and accuracy to selection. The relevant targets of selection—the rate of translation and of mistranslation of a codon per unit time in the cell—can only be related to mechanistic properties of the translational apparatus if the number of transcripts per cell is known, requiring use of gene expression measurements. Perhaps surprisingly, different gene-expression data sets yield markedly different estimates of selection. We show that this is largely due to measurement noise, notably due to differences between studies rather than instrument error or biological variability. We develop an analytical framework that explicitly models noise in expression in the context of the population-genetic model. Estimates of mutation and selection strength in budding yeast produced by this method are robust to the expression data set used and are substantially higher than estimates using a noise-blind approach. We introduce per-gene selection estimates that correlate well with previous scoring systems, such as the codon adaptation index, while now carrying an evolutionary interpretation. On average, selection for codon usage in budding yeast is weak, yet our estimates show that genes range from virtually unselected to average per-codon selection coefficients above the inverse population size. Our analytical framework may be generally useful for distinguishing biological signals from measurement noise in other applications that depend upon measurements of gene expression.
selection; codon usage; gene expression; noise
All modern approaches to molecular phylogenetics require a quantitative model for how genes evolve. Unfortunately, existing evolutionary models do not realistically represent the site-heterogeneous selection that governs actual sequence change. Attempts to remedy this problem have involved augmenting these models with a burgeoning number of free parameters. Here, I demonstrate an alternative: Experimental determination of a parameter-free evolutionary model via mutagenesis, functional selection, and deep sequencing. Using this strategy, I create an evolutionary model for influenza nucleoprotein that describes the gene phylogeny far better than existing models with dozens or even hundreds of free parameters. Emerging high-throughput experimental strategies such as the one employed here provide fundamentally new information that has the potential to transform the sensitivity of phylogenetic and genetic analyses.
phylogenetics; codon model; substitution model; influenza; nucleoprotein; deep mutational scanning
It has long been believed that the male-specific region of the human Y chromosome (MSY) is genetically independent from the X chromosome. This idea has been recently dismissed due to the discovery that X–Y gametologous gene conversion may occur. However, the pervasiveness of this molecular process in the evolution of sex chromosomes has yet to be exhaustively analyzed. In this study, we explored how pervasive X–Y gene conversion has been during the evolution of the youngest stratum of the human sex chromosomes. By comparing about 0.5 Mb of human–chimpanzee gametologous sequences, we identified 19 regions in which extensive gene conversion has occurred. From our analysis, two major features of these emerged: 1) Several of them are evolutionarily conserved between the two species and 2) almost all of the 19 hotspots overlap with regions where X–Y crossing-over has been previously reported to be involved in sex reversal. Furthermore, in order to explore the dynamics of X–Y gametologous conversion in recent human evolution, we resequenced these 19 hotspots in 68 widely divergent Y haplogroups and used publicly available single nucleotide polymorphism data for the X chromosome. We found that at least ten hotspots are still active in humans. Hence, the results of the interspecific analysis are consistent with the hypothesis of widespread reticulate evolution within gametologous sequences in the differentiation of hominini sex chromosomes. In turn, intraspecific analysis demonstrates that X–Y gene conversion may modulate human sex-chromosome-sequence evolution to a greater extent than previously thought.
X–Y gene conversion; sex chromosome evolution; human Y chromosome; recombination hotspots
Complete mitochondrial genomes have been shown to be reliable markers for phylogeny reconstruction among diverse animal groups. However, the relative difficulty and high cost associated with obtaining de novo full mitogenomes have frequently led to conspicuously low taxon sampling in ensuing studies. Here, we report the successful use of an economical and accessible method for assembling complete or near-complete mitogenomes through shot-gun next-generation sequencing of a single library made from pooled total DNA extracts of numerous target species. To avoid the use of separate indexed libraries for each specimen, and an associated increase in cost, we incorporate standard polymerase chain reaction-based “bait” sequences to identify the assembled mitogenomes. The method was applied to study the higher level phylogenetic relationships in the weevils (Coleoptera: Curculionoidea), producing 92 newly assembled mitogenomes obtained in a single Illumina MiSeq run. The analysis supported a separate origin of wood-boring behavior by the subfamilies Scolytinae, Platypodinae, and Cossoninae. This finding contradicts morphological hypotheses proposing a close relationship between the first two of these but is congruent with previous molecular studies, reinforcing the utility of mitogenomes in phylogeny reconstruction. Our methodology provides a technically simple procedure for generating densely sampled trees from whole mitogenomes and is widely applicable to groups of animals for which bait sequences are the only required prior genome knowledge.
next-generation sequencing; genomics; MiSeq; mitochondria; phylogenetics; wood-boring
Walter Fitch; molecular biology and evolution
Archaeplastida (=Kingdom Plantae) are primary plastid-bearing organisms that evolved via the endosymbiotic association of a heterotrophic eukaryote host cell and a cyanobacterial endosymbiont approximately 1,400 Ma. Here, we present analyses of cyanobacterial and plastid genomes that show strongly conflicting phylogenies based on 75 plastid (or nuclear plastid-targeted) protein-coding genes and their direct translations to proteins. The conflict between genes and proteins is largely robust to the use of sophisticated data- and tree-heterogeneous composition models. However, by using nucleotide ambiguity codes to eliminate synonymous substitutions due to codon-degeneracy, we identify a composition bias, and dependent codon-usage bias, resulting from synonymous substitutions at all third codon positions and first codon positions of leucine and arginine, as the main cause for the conflicting phylogenetic signals. We argue that the protein-coding gene data analyses are likely misleading due to artifacts induced by convergent composition biases at first codon positions of leucine and arginine and at all third codon positions. Our analyses corroborate previous studies based on gene sequence analysis that suggest Cyanobacteria evolved by the early paraphyletic splitting of Gloeobacter and a specific Synechococcus strain (JA33Ab), with all other remaining cyanobacterial groups, including both unicellular and filamentous species, forming the sister-group to the Archaeplastida lineage. In addition, our analyses using better-fitting models suggest (but without statistically strong support) an early divergence of Glaucophyta within Archaeplastida, with the Rhodophyta (red algae), and Viridiplantae (green algae and land plants) forming a separate lineage.
origin of plastids; phylogeny; Cyanobacteria; Archaeplastida
Phylogenetic networks can model reticulate evolutionary events such as hybridization, recombination, and horizontal gene transfer. However, reconstructing such networks is not trivial. Popular character-based methods are computationally inefficient, whereas distance-based methods cannot guarantee reconstruction accuracy because pairwise genetic distances only reflect partial information about a reticulate phylogeny. To balance accuracy and computational efficiency, here we introduce a quartet-based method to construct a phylogenetic network from a multiple sequence alignment. Unlike distances that only reflect the relationship between a pair of taxa, quartets contain information on the relationships among four taxa; these quartets provide adequate capacity to infer a more accurate phylogenetic network. In applications to simulated and biological data sets, we demonstrate that this novel method is robust and effective in reconstructing reticulate evolutionary events and it has the potential to infer more accurate phylogenetic distances than other conventional phylogenetic network construction methods such as Neighbor-Joining, Neighbor-Net, and Split Decomposition. This method can be used in constructing phylogenetic networks from simple evolutionary events involving a few reticulate events to complex evolutionary histories involving a large number of reticulate events. A software called “Quartet-Net” is implemented and available at http://sysbio.cvm.msstate.edu/QuartetNet/.
phylogenetic network; split network; quartet; 2-weakly compatible; consistency
Across eukaryotes, Hsp70-based chaperone machineries display an underlying unity in their sequence, structure, and biochemical mechanism of action, while working in a myriad of cellular processes. In good part, this extraordinary functional versatility is derived from the ability of a single Hsp70 to interact with an array of J-protein cochaperones to form a functional chaperone network. Among J-proteins, the DnaJ-type is the most prevalent, being present in all three kingdoms and in several different compartments of eukaryotic cells. However, because these ancient DnaJ-type proteins diverged at the base of the eukaryotic phylogeny, little is understood about the evolutionary basis of their diversification and thus the functional expansion of the chaperone network. Here, we report results of evolutionary and experimental analyses of two more recent members of the cytosolic DnaJ family of Saccharomyces cerevisiae, Xdj1 and Apj1, which emerged by sequential duplications of the ancient YDJ1 in Ascomycota. Sequence comparison and molecular modeling revealed that both Xdj1 and Apj1 maintained a domain organization similar to that of multifunctional Ydj1. However, despite these similarities, both Xdj1 and Apj1 evolved highly specialized functions. Xdj1 plays a unique role in the translocation of proteins from the cytosol into mitochondria. Apj1’s specialized role is related to degradation of sumolyated proteins. Together these data provide the first clear example of cochaperone duplicates that evolved specialized functions, allowing expansion of the chaperone functional network, while maintaining the overall structural organization of their parental gene.
J-protein; Hsp70; gene duplication; yeast; Hsp40; divergent evolution
Model-based analyses of natural selection often categorize sites into a relatively small number of site classes. Forcing each site to belong to one of these classes places unrealistic constraints on the distribution of selection parameters, which can result in misleading inference due to model misspecification. We present an approximate hierarchical Bayesian method using a Markov chain Monte Carlo (MCMC) routine that ensures robustness against model misspecification by averaging over a large number of predefined site classes. This leaves the distribution of selection parameters essentially unconstrained, and also allows sites experiencing positive and purifying selection to be identified orders of magnitude faster than by existing methods. We demonstrate that popular random effects likelihood methods can produce misleading results when sites assigned to the same site class experience different levels of positive or purifying selection—an unavoidable scenario when using a small number of site classes. Our Fast Unconstrained Bayesian AppRoximation (FUBAR) is unaffected by this problem, while achieving higher power than existing unconstrained (fixed effects likelihood) methods. The speed advantage of FUBAR allows us to analyze larger data sets than other methods: We illustrate this on a large influenza hemagglutinin data set (3,142 sequences). FUBAR is available as a batch file within the latest HyPhy distribution (http://www.hyphy.org), as well as on the Datamonkey web server (http://www.datamonkey.org/).
evolutionary model; coding sequence evolution; approximate Bayesian inference; parallel algorithms
Complete genome sequences contain valuable information about natural selection, but this
information is difficult to access for short, widely scattered noncoding elements such as
transcription factor binding sites or small noncoding RNAs. Here, we introduce a new
computational method, called Inference of Natural
Selection from Interspersed
Genomically coHerent elemenTs
(INSIGHT), for measuring the influence of natural selection on such elements. INSIGHT uses
a generative probabilistic model to contrast patterns of polymorphism and divergence in
the elements of interest with those in flanking neutral sites, pooling weak information
from many short elements in a manner that accounts for variation among loci in mutation
rates and coalescent times. The method is able to disentangle the contributions of weak
negative, strong negative, and positive selection based on their distinct effects on
patterns of polymorphism and divergence. It obtains information about divergence from
multiple outgroup genomes using a general statistical phylogenetic approach. The INSIGHT
model is efficiently fitted to genome-wide data using an approximate expectation
maximization algorithm. Using simulations, we show that the method can accurately estimate
the parameters of interest even in complex demographic scenarios, and that it
significantly improves on methods based on summary statistics describing polymorphism and
divergence. To demonstrate the usefulness of INSIGHT, we apply it to several classes of
human noncoding RNAs and to GATA2-binding sites in the human genome.
molecular evolution; population genetics; noncoding DNA; regulatory sequences; probabilistic graphical models
The molecular bases for the evolution of male–female sexual dimorphism are possible to study in volvocine algae because they encompass the entire range of reproductive morphologies from isogamy to oogamy. In 1978, Charlesworth suggested the model of a gamete size gene becoming linked to the sex-determining or mating type locus (MT) as a mechanism for the evolution of anisogamy. Here, we carried out the first comprehensive study of a candidate MT-linked oogamy gene, MAT3/RB, across the volvocine lineage. We found that evolution of anisogamy/oogamy predates the extremely high male–female divergence of MAT3 that characterizes the Volvox carteri lineage. These data demonstrate very little sex-linked sequence divergence of MAT3 between the two sexes in other volvocine groups, though linkage between MAT3 and the mating locus appears to be conserved. These data implicate genetic determinants other than or in addition to MAT3 in the evolution of anisogamy in volvocine algae.
gender-based divergence; gene conversion; male–female sexual dimorphism; MAT3/RB; mating type locus; volvocine algae
Analysis of genomic segments shared identical-by-descent (IBD) between individuals is fundamental to many genetic applications, from demographic inference to estimating the heritability of diseases, but IBD detection accuracy in nonsimulated data is largely unknown. In principle, it can be evaluated using known pedigrees, as IBD segments are by definition inherited without recombination down a family tree. We extracted 25,432 genotyped European individuals containing 2,952 father–mother–child trios from the 23andMe, Inc. data set. We then used GERMLINE, a widely used IBD detection method, to detect IBD segments within this cohort. Exploiting known familial relationships, we identified a false-positive rate over 67% for 2–4 centiMorgan (cM) segments, in sharp contrast with accuracies reported in simulated data at these sizes. Nearly all false positives arose from the allowance of haplotype switch errors when detecting IBD, a necessity for retrieving long (>6 cM) segments in the presence of imperfect phasing. We introduce HaploScore, a novel, computationally efficient metric that scores IBD segments proportional to the number of switch errors they contain. Applying HaploScore filtering to the IBD data at a precision of 0.8 produced a 13-fold increase in recall when compared with length-based filtering. We replicate the false IBD findings and demonstrate the generalizability of HaploScore to alternative data sources using an independent cohort of 555 European individuals from the 1000 Genomes project. HaploScore can improve the accuracy of segments reported by any IBD detection method, provided that estimates of the genotyping error rate and switch error rate are available.
identity by descent; haplotypes; population genetics; computational tools
The evolution of ants is marked by remarkable adaptations that allowed the development of very complex social systems. To identify how ant-specific adaptations are associated with patterns of molecular evolution, we searched for signs of positive selection on amino-acid changes in proteins. We identified 24 functional categories of genes which were enriched for positively selected genes in the ant lineage. We also reanalyzed genome-wide data sets in bees and flies with the same methodology to check whether positive selection was specific to ants or also present in other insects. Notably, genes implicated in immunity were enriched for positively selected genes in the three lineages, ruling out the hypothesis that the evolution of hygienic behaviors in social insects caused a major relaxation of selective pressure on immune genes. Our scan also indicated that genes implicated in neurogenesis and olfaction started to undergo increased positive selection before the evolution of sociality in Hymenoptera. Finally, the comparison between these three lineages allowed us to pinpoint molecular evolution patterns that were specific to the ant lineage. In particular, there was ant-specific recurrent positive selection on genes with mitochondrial functions, suggesting that mitochondrial activity was improved during the evolution of this lineage. This might have been an important step toward the evolution of extreme lifespan that is a hallmark of ants.
comparative genomics; sociality; dN/dS; aging; lifespan; immunity; neurogenesis; olfactory receptors; metabolism; Hymenoptera; bees; Drosophila
The plant hormone auxin is a conserved regulator of development which has been implicated in the generation of morphological novelty. PIN-FORMED1 (PIN) auxin efflux carriers are central to auxin function by regulating its distribution. PIN family members have divergent structures and cellular localizations, but the origin and evolutionary significance of this variation is unresolved. To characterize PIN family evolution, we have undertaken phylogenetic and structural analyses with a massive increase in taxon sampling over previous studies. Our phylogeny shows that following the divergence of the bryophyte and lycophyte lineages, two deep duplication events gave rise to three distinct lineages of PIN proteins in euphyllophytes. Subsequent independent radiations within each of these lineages were taxonomically asymmetric, giving rise to at least 21 clades of PIN proteins, of which 15 are revealed here for the first time. Although most PIN protein clades share a conserved canonical structure with a modular central loop domain, a small number of noncanonical clades dispersed across the phylogeny have highly divergent protein structure. We propose that PIN proteins underwent sub- and neofunctionalization with substantial modification to protein structure throughout plant evolution. Our results have important implications for plant evolution as they suggest that structurally divergent PIN proteins that arose in paralogous radiations contributed to the convergent evolution of organ systems in different land plant lineages.
auxin; auxin transport; PIN protein; plant evolution; phylogeny; protein structure
The mountain pine beetle (MPB; Dendroctonus ponderosae Hopkins), a major pine forest pest native to western North America, has extended its range north and eastward during an ongoing outbreak. Determining how the MPB has expanded its range to breach putative barriers, whether physical (nonforested prairie and high elevation of the Rocky Mountains) or climatic (extreme continental climate where temperatures can be below −40 °C), may contribute to our general understanding of range changes as well as management of the current epidemic. Here, we use a panel of 1,536 single nucleotide polymorphisms (SNPs) to assess population genetic structure, connectivity, and signals of selection within this MPB range expansion. Biallelic SNPs in MPB from southwestern Canada revealed higher genetic differentiation and lower genetic connectivity than in the northern part of its range. A total of 208 unique SNPs were identified using different outlier detection tests, of which 32 returned annotations for products with putative functions in cholesterol synthesis, actin filament contraction, and membrane transport. We suggest that MPB has been able to spread beyond its previous range by adjusting its cellular and metabolic functions, with genome scale differentiation enabling populations to better withstand cooler climates and facilitate longer dispersal distances. Our study is the first to assess landscape-wide selective adaptation in an insect. We have shown that interrogation of genomic resources can identify shifts in genetic diversity and putative adaptive signals in this forest pest species.
structure; connectivity; dispersal; population genetics; outlier detection
The nonsynonymous/synonymous rate ratio (ω = dN/dS) is an important measure of the mode and strength of natural selection acting on nonsynonymous mutations in protein-coding genes. The simplest such analysis is the estimation of the dN/dS ratio using two sequences. Both heuristic counting methods and the maximum-likelihood (ML) method based on a codon substitution model are widely used for such analysis. However, these methods do not have nice statistical properties, as the estimates can be zero or infinity in some data sets, so that their means and variances are infinite. In large genome-scale comparisons, such extreme estimates (either 0 or ∞) of ω and sequence distance (t) are common. Here, we implement a Bayesian method to estimate ω and t in pairwise sequence comparisons. Using a combination of computer simulation and real data analysis, we show that the Bayesian estimates have better statistical properties than the ML estimates, because the prior on ω and t shrinks the posterior of those parameters away from extreme values. We also calculate the posterior probability for ω > 1 as a Bayesian alternative to the likelihood ratio test. The new method is computationally efficient and may be useful for genome-scale comparisons of protein-coding gene sequences.
nonsynonymous/synonymous rate ratio; evolutionary distance; Bayesian estimation; pairwise comparisons; protein-coding sequences
Although many computer programs can perform population genetics calculations, they are typically limited in the analyses and data input formats they offer; few applications can process the large data sets produced by whole-genome resequencing projects. Furthermore, there is no coherent framework for the easy integration of new statistics into existing pipelines, hindering the development and application of new population genetics and genomics approaches. Here, we present PopGenome, a population genomics package for the R software environment (a de facto standard for statistical analyses). PopGenome can efficiently process genome-scale data as well as large sets of individual loci. It reads DNA alignments and single-nucleotide polymorphism (SNP) data sets in most common formats, including those used by the HapMap, 1000 human genomes, and 1001 Arabidopsis genomes projects. PopGenome also reads associated annotation files in GFF format, enabling users to easily define regions or classify SNPs based on their annotation; all analyses can also be applied to sliding windows. PopGenome offers a wide range of diverse population genetics analyses, including neutrality tests as well as statistics for population differentiation, linkage disequilibrium, and recombination. PopGenome is linked to Hudson’s MS and Ewing’s MSMS programs to assess statistical significance based on coalescent simulations. PopGenome’s integration in R facilitates effortless and reproducible downstream analyses as well as the production of publication-quality graphics. Developers can easily incorporate new analyses methods into the PopGenome framework. PopGenome and R are freely available from CRAN (http://cran.r-project.org/) for all major operating systems under the GNU General Public License.
population genomics; software; single-nucleotide polymorphisms