We model individual learning of frequency distributions by assuming that our learners use Bayesian inference, a rational procedure for belief updating that explicitly represents the expectations of learners (

Robert 1997). Assume that a learner is exposed to

*N* occurrences of a linguistic form, such as a sound, word or grammatical construction, partitioned over

*K* different variants. Let the vectors

and

denote the observed frequencies and the estimated probabilities of the

*K* variants, respectively. The learner's expectations are expressed in a prior probability distribution,

*p*(

*θ*). After seeing the data

**x**, the learner assigns posterior probabilities

*p*(

*θ*|

**x**) specified by Bayes' rule,

where

*p*(

**x**|

*θ*) is the

*likelihood*, indicating the probability of observing the frequencies

**x** from the distribution

*θ*, being the probability of obtaining the frequencies in

**x** via

*N* draws from a multinomial distribution with parameters

*θ*. Each draw is a statistically independent event. The learner estimates the parameter

*θ* from a sample of

*N* tokens produced by a speaker before producing any utterances himself. The posterior combines the learner's expectations—represented by the prior—with the evidence about the underlying distribution provided by the observed frequencies.

We select the prior distribution to be neutral between linguistic variants, with no variant being favoured

*a priori* over the others. This assumption differs from other computational models that emphasize the selection or directed mutation at the level of linguistic variants, as discussed above. However, being neutral between variants is not enough to specify a prior distribution: learners can also differ in their expectations about the amount of probabilistic variation in a language. For example, learners facing unpredictable variation may either reproduce this variability accurately or collapse it towards more deterministic rules—a process referred to as

*regularization* (

Hudson & Newport 2005;

Reali & Griffiths 2009). A way to capture these expectations, while maintaining neutrality between variants, is to assume that the prior is a

*K*-dimensional Dirichlet distribution, a multivariate generalization of the Beta distribution (

Bernardo & Smith 1994). This is a standard prior used in Bayesian statistics (

Bernardo & Smith 1994). In the context of language, Dirichlet priors have been recently used in models of iterated learning (

Kirby *et al.* 2007) and language acquisition (

Goldwater *et al.* 2009;

Reali & Griffiths 2009).

More formally, we assume that the prior

*p*(

*θ*) is a symmetric

*K*-dimensional Dirichlet distribution with parameters

*α*/

*K*, giving

where

*Γ*(·) is the generalized factorial function. By using a distribution that is symmetric we maintain neutrality between different variants. When

*K* = 2, the prior reduces to a Beta distribution—denoted as Beta(

*α*/2,

*α*/2). The use of the same parameter,

*α*/

*K*, for all variants ensures that the prior does not favour one variant over the others, with the mean of the prior distribution being the uniform distribution over variants for all values of

*α* and

*K*. However, the value of

*α*/

*K* determines the expectations that learners have about probabilistic variation. When

*α*/

*K* < 1 the learner tends to assign high probability to one of the

*K* competing variants. This situation reflects a tendency to regularize languages, with probabilistic variation being reduced towards more deterministic rules. When

*α*/

*K* > 1, the learner tends to weight all competing variants equally, producing distributions closer to the uniform distribution over all variants (see

*b* for examples). Thus, despite the apparent complexity of the formula, the Dirichlet prior captures a wide range of biases that are intuitive from a psychological perspective.

Some intuitions for the consequences of using different priors can be obtained by considering how they affect the predictions that learners would make about probability distributions. Under the model defined above, the probability that a learner assigns to the next observation being variant *k* after seeing *x*_{k} instances of that variant from a total of *N* is (*x*_{k} + *α*/*K*)/(*N* + *α*) (see the electronic supplementary material for details). This formula captures two aspects of the learners' behaviour. First, the probability that the learner assigns to a variant is approximately proportional to its frequency *x*_{k}. This means that individual variants get strengthened by use. Second, the parameter *α*/*K* acts like a number of additional observations of each variant. The largest effect of these additional observations will be when there are no actual observations, with *x*_{k} = 0. In this case, a learner expecting a more deterministic language (with *α*/*K* small) will assign a very small probability to the unobserved variant, while a learner expecting probabilistic variation (with *α*/*K* large) will assign it a much higher probability. The prior thus expresses the willingness of learners to consider unobserved variants part of the language.

This model can be extended to cover learning a distribution over an unbounded set, such as the vocabulary of a language. In this case, word production can be viewed intuitively as a

*cache* model: each word in the language is either retrieved from a cache or generated anew. Using an infinite-dimensional analogue of the Dirichlet prior (see the electronic supplementary material for details), the probability of a variant that occurred with frequency

*x*_{k} is

*x*_{k}/(

*N* +

*α*), while the probability of a completely new variant is α/(

*N* +

*α*). The parameter

*α* thus controls the tendency to produce new variants, as before. There is also a two-parameter generalization of the infinite-dimensional Dirichlet model, which gives a variant that occurred with frequency

*x*_{k} probability (

*x*_{k} − δ)/(

*N* +

*α*), while the probability of a completely new variant is (

*δ**K*_{+} + α)/(

*N* + α), where

*δ* (0,1) is a second parameter allowing

*K*_{+}, the number of variants for which

*x*_{k} > 0, to influence the probability of producing a new variant (see the electronic supplementary material for details).