Nearly every computational and statistical method used for comparative gene sequence analysis employ a stochastic model for estimating rates of evolutionary change, either explicitly or implicitly.
A priori knowledge about physical or chemical properties of nucleotide or amino acid residues can be used to define mechanistic models of substitutions. For example, the popular HKY85
[1] nucleotide substitution model estimates two separate substitution rates: one for
transitions-substitutions between chemically similar purines (adenine and cytosine) or pyrimidines (guanine and thymine)-and one for transversions (all other substitutions). Universal evolutionary constraints form another basis for mechanistic model derivation. Codon models of Muse-Gaut
[2] and Goldman-Yang
[3] distinguished amino-acid altering (non-synonymous) and silent (synonymous) substitutions and have formed the basis of a popular and successful suite of methods for the analysis of selective pressures on coding sequences.
Existing literature on probabilistic models for protein sequences is extensive and spans several decades. One of the first of such models was the PAM (point accepted mutation) matrix
[4]. A PAM matrix is derived from the inferred substitutions along a phylogenetic tree relating homologous sequences, by estimating the probability that any given amino acid residue in a protein will be replaced by any other residue after a pre-specified evolutionary interval. Other models based on observed sequence variability in large alignments of homologous protein sequences, such as the BLOSUM family
[5], have proven popular and successful. Karlin and Ghandour
[6] and George, Barker, and Hunt
[7] proposed methods of weighting differences based on chemical, functional, charge and structural properties of amino acids and computing replacement probabilities based on the similarity of the involved residues. Doolittle's group proposed substitution matrices based on amino acid structural similarities combined with the ease of genetic interchange
[8], while Stanfel added information pertaining to biochemical properties to inform the probability of amino acid interchangeability
[9]. More recently, a generalized index of exchangeability based on a meta-analysis of empirical data has been suggested as a means of estimating the tolerability of particular amino acid exchanges
[10]. A similar method based on pairwise amino-acid differences between homologous genes led to the derivation of the ambitiously named Universal Evolutionary Index
[11]. A more statistically robust method for model inference incorporates phylogenetic likelihood
[12], and infers substitution rates from a seed alignment, e.g., from mitochondrial sequences
[13] or a sample from several protein families
[14].
‘Generalist’ models that describe substitution patterns amalgamated from multiple genes and organisms may describe a particular organism or gene poorly. To date, there have been only a few ‘specialist’ models targeted to a particular gene
[15], or genomic region
[13]. In this manuscript we lay out a maximum likelihood framework and an accessible software implementation for estimating an organism/gene specific evolutionary model and alignment scoring matrix, describe common techniques for validating the model and infer a model from a large collection of HIV-1 sequences.
Reliable estimation of substitution rates from short sequences (e.g. 10Kb viral genomes) requires a substantial degree of sequence diversity, which may require millions of years to accumulate in vertebrates or plants. However, rapidly evolving retroviruses, with mutation rates of up to ~10
6 greater than that of vertebrates
[16],
[17], accumulate similar levels of divergence in a matter of years and are abundantly represented in public databases. Dimmic et al.
[15] used a maximum likelihood procedure to estimate an amino acid substitution rate matrix for specific application to reverse transcriptase, a key retroviral polymerase protein that transcribes viral RNA into DNA suitable for integration into the host genome. However, we found that this model fitted HIV-1 data poorly, probably because HIV adopts organism-specific substitution biases, different from other retroviruses.
To improve the predictive accuracy of substitution matrices for HIV protein evolution, we estimated two stochastic models from multiple representative HIV-1 sequence alignments using maximum likelihood. The first model was derived from HIV data sampled from within individual patients (within individual, HIV within, HIV-W
m). The second model was estimated from alignments where every sequence represented a population consensus from a patient (between individual, HIV between, HIV-B
m). At first glance, one might question the need for two separate models, since within-patient evolution could simply be a shorter timescale version of the between-patient evolution. One argument against this intuitive deduction is that most of the substitutions generated in a given individual are selected against during or following transmission and therefore do not persist at the level of host populations
[18]–
[20], resulting in potentially discordant substitution patterns. For example, substitutions which enable the virus to escape the cellular immune response in a given host can be rapidly generated and fixed
[21]. However, many of these substitutions carry a fitness cost in terms of lower replicative capacity and are not likely to persist upon transmission to an individual whose immune system does not target the same genetic region of the virus, obviating the need for a fitness-lowering substitution there
[22]. Indeed, if there were no added benefit in considering two models, one would see similar fits to both within and between-host viral samples with both models. Our findings strongly argue against this scenario (see ‘Results’), showing that substitution patterns shaped by within-and between-host selective regimes are detectably different.