WHEN a novel mutation appears in the genome of an organism, it may have three different effects on the fitness (
w = 1 +
s) of its carrier: The mutation may be deleterious (
s < 0), reducing fitness through reduced fertility or survival rate. It may be neutral (
s ≈ 0), that is, having such a small effect on fitness that the fate of the mutant is mostly determined by random drift. Or the mutation may be advantageous (
s > 0), increasing the fitness of its carrier by increasing its fertility or survival in its environment. The frequency distribution of the different types of mutants and their associated selection coefficients (
s, also known as fitness effects) is a key issue in population genetics (
Bustamante 2005;
Eyre-Walker and Keightley 2007). The ultimate fate of a mutation, whether it will become fixed or lost in a population, depends on the strength of selection and on the effect of random drift due to finite population size. In fact, the fitness effect
s and the population number
N are so closely linked that normally the distribution is expressed in terms of the population scaled coefficient
S = 2
Ns.
Kimura (1968,
1983), in his neutral theory of molecular evolution, proposed that the dominant fraction (
p−) of all novel mutations would be highly deleterious, with a minority fraction (
p0 = 1 −
p−) being neutral. When organisms colonize a new habitat or are subject to environmental change, the opportunity for adaptive evolution would arise, and a fraction (
p+ = 1 −
p0 −
p−) of novel mutations would be advantageous. The magnitudes of these fractions for a protein-coding gene would depend on the protein in question; functionally important or structurally constrained proteins (such as the histones) would be characterized by a very large fraction of deleterious mutations (
p−
p0), while structurally less constrained proteins (such as the fibrinopeptides) would have a larger fraction of neutral mutations (
p0 >
p−). Extensions to Kimura’s theory have been made, including considering the contribution of nearly neutral mutations to the evolutionary process (
Ohta 1973,
1992;
Kimura 1983). Under this latter extension, there is a spectrum of nearly neutral mutations ranging from slightly deleterious to slightly advantageous, with the neutrality of a given change dependent on the population size; evolutionary trajectories consist of a balance between slightly deleterious and slightly advantageous substitutions. Others have argued that, even under more typical conditions, adaptive substitutions would be frequent, the greater probability of fixation compensating for their relative rarity among mutations (
Gillespie 1994).
Akashi (1999) considered that under a neutral model the distribution of
S among novel mutations could be bimodal, with the modes centered around highly deleterious and neutral mutations. During adaptive episodes, the distribution would have three modes, with a small additional mode centered around advantageous mutations. Because deleterious mutations have a vanishingly small probability of becoming fixed in a population, most substitutions (
i.e., fixed mutations) would be neutral. In this case, the distribution of
S among substitutions would be unimodal and centered around neutral mutations. During an adaptive episode, natural selection would drive many positively selected mutations quickly to fixation. In this case, the distribution of substitutions would be bimodal, with modes centered around nearly neutral and advantageous substitutions.
While the effect of mutations can be studied experimentally, these studies are difficult to perform on higher organisms and too insensitive to observe any but the largest fitness effects (
Eyre-Walker and Keightley 2007). Due to these limitations, alternative approaches have been developed that estimate the distribution of fitness effects from biological sequence data. Much of the work on estimation of the distribution of
S from DNA sequence data has been based at the population level (
e.g.,
Sawyer and Hartl 1992;
Bustamante et al. 2002). These methods usually work with allele data from different individuals within a population, and the level of polymorphism within the population and the number of fixed differences with an outgroup species are used to estimate the distribution. These methods look at the evolutionary process over relatively short periods of time and thus normally use approximate mutation models such as the infinite alleles model (
Kimura 1969,
1983, p. 43). More recently, phylogenetic methods that look at the evolutionary process over longer periods of time have been used to estimate the distribution of selection coefficients (
Nielsen and Yang 2003;
Yang and Nielsen 2008;
Rodrigue et al. 2010). Although these use more realistic mutation models than the population-based methods, they ignore polymorphism and assume that all the observed differences among species are fixed. These two approaches sometimes result in different conclusions; population-based methods can yield an extremely large fraction of adaptive changes (
Fay et al. 2001), especially in
Drosophila (
Sawyer et al. 2003,
2007), while phylogenetic methods often result in more modest estimates of
p+ (
Nielsen and Yang 2003;
Rodrigue et al. 2010). Similarly, population methods find the distribution of slightly deleterious mutations falling off leptokurtically, that is, more rapidly than exponentially (such as in a gamma distribution with
α < 1) (
Eyre-Walker et al. 2006), while evolutionary models often yield a more rounded distribution (
α > 1) (
Nielsen and Yang 2003;
Rodrigue et al. 2010). It is not clear whether these differences represent the different methodologies and the approximations that they make or the details of the particular organisms under study. Worryingly, the evolutionary models fail to yield a substantial amount of lethal mutations (
Nielsen and Yang 2003;
Rodrigue et al. 2010) that would be expected on the basis of mutation experiments (
Wloch et al. 2001;
Sanjuan et al. 2004;
Hietpas et al. 2011) and have been obtained by population-based studies (
Piganeau and Eyre-Walker 2003;
Yampolsky et al. 2005;
Eyre-Walker et al. 2006).
One of the difficulties in estimating the distribution of selection coefficients is the complex nature of the selective constraints, even within a single protein, representing a range of functional, structural, and physiological requirements. Certain locations, such as those involved in protein functionality, may be invariant, while other locations may have a wide latitude in the amino acids compatible with that position. It is not only the magnitude of the selective constraints that vary from one location to another; one position may be constrained to hydrophobic residues, another constrained to residues that can take part in hydrogen bonding interactions, and a third requiring a certain degree of flexibility. The types of substitutions that can occur can be substantially different, even among locations that are changing at similar rates. Different approaches have addressed this issue to various degrees. For instance,
Nielsen and Yang (2003) considered that the overall rate of substitutions could vary from one location to another, but considered that this rate variation would affect all possible substitutions equally; that is, slowly varying locations were as unrestricted in the amino acids as rapidly varying locations.
Thorne et al. (2007) relaxed the standard assumption of independent sites, considering the selective constraints imposed by the need to maintain a stable well-defined structure; this was estimated using protein structure prediction algorithms, despite their construction being motivated by a quite different problem.
Rodrigue et al. (2010) adapted a mixture-model approach that grouped locations under similar selective constraints and developed more specific models for characterizing these different types of locations; each individual location was then represented by a mixture of these models (
Koshi and Goldstein 1998). The available data determined the number of components in the mixture that could be justified.
The most specific characterization of the substitution process was developed by
Halpern and Bruno (1998), who proposed a sitewise phylogenetic model where evolution at each amino acid residue in a protein is characterized by a location-specific set of fitnesses and by the nucleotide-level mutation pattern. Although Halpern and Bruno demonstrated its utility for the estimation of evolutionary distances, use of the model has been limited, as the number of adjustable parameters required more data and computational resources than have previously been available. Here we explore the use of this model in the estimation of the distribution of
S. We are interested in assessing how the assumption of site-specific fitnesses may affect estimates of the shape of the distribution of
S among novel mutations and substitutions. We apply a modified version of their model to a data set of 12 mitochondrial proteins in 244 mammalian species. We also apply this model to a data set of a polymerase protein from 401 influenza viruses isolated from avian and human hosts. As the human viruses are the product of a host shift event from an avian host (
Taubenberger et al. 2005), this allows us to investigate the distribution of selection coefficients during a well-defined adaptive episode.