|Home | About | Journals | Submit | Contact Us | Français|
Verbal working memory (WM) tasks typically involve the language production architecture for recall; however, language production processes have had a minimal role in theorizing about WM. A framework for understanding verbal WM results is presented here. In this framework, domain-specific mechanisms for serial ordering in verbal WM are provided by the language production architecture, in which positional, lexical, and phonological similarity constraints are highly similar to those identified in the WM literature. These behavioral similarities are paralleled in computational modeling of serial ordering in both fields. The role of long-term learning in serial ordering performance is emphasized, in contrast to some models of verbal WM. Classic WM findings are discussed in terms of the language production architecture. The integration of principles from both fields illuminates the maintenance and ordering mechanisms for verbal information.
Nearly 30 years ago, Albert Ellis (1980) observed that errors on tests of verbal working memory (WM) paralleled those that occur naturally in speech production. Despite this significant observation, the majority of memory and language research since that time has focused on relations between verbal WM and language comprehension and acquisition rather than on the relation between verbal WM and language production (Baddeley, Eldridge, & Lewis, 1981; Caplan & Waters, 1999; Daneman & Carpenter, 1980, 1983; Engle, Cantor, & Carullo, 1992; Gathercole & Baddeley, 1990; Just & Carpenter, 1992). The relative inattention to language production is striking given the production demands of typical verbal WM tasks: the maintenance and sequential output of verbal information. In this review, we examine the relation between verbal WM and language production processes in light of new behavioral and theoretical advances since Ellis’s initial observations.
Verbal WM refers to the temporary maintenance and manipulation of verbal information (Baddeley, 1986). In exploring the production–WM relation, our review emphasizes domain-specific (i.e., verbal) maintenance processes in WM rather than the domain-general, attentional processes that are hypothesized to oversee processing across different domains (e.g., verbal, visual; Baddeley, 1986; Cowan, 1995). Obviously, language production processes must be involved during output in spoken recall, but the hypothesis that we explore here is that the production system is crucial to maintenance as well. We suggest that the domain-specific mechanism underlying the maintenance of serial order in verbal WM is achieved by the language production architecture rather than by a system specifically dedicated to short-term maintenance.
Language production planning naturally involves the maintenance and ordering of linguistic information. This information ranges over multiple levels, including messages (several different points that the speaker plans to make), words within phrases, phrases within sentences, and articulatory gestures for executing the utterance. All of the processes that enable fluent production are potential sources of serial order maintenance in WM tasks. Our particular focus here is the stage of language production termed phonological encoding, the process by which a word is specified as a sequence of phonemes for the purposes of articulation, serving as a midpoint between word selection and articulation (Garrett, 1975). This level is a logical starting place to explore the WM–production relation because so many WM studies have investigated the production of unrelated words or nonwords, absent the complex messages that typify language production in conversation.
The article is divided into three sections. First, we introduce key concepts, theory, and terminology from the WM and production domains. Second, we explore similarities in experimental findings and computational approaches to serial ordering in both fields. These similarities include the means by which serial ordering is achieved, constraints on serial ordering processes, the susceptibility of ordering to phonological similarity, and long-term learning effects associated with lexical and sublexical frequency. We hypothesize that these similarities are not coincidental but reflect the fact that the same mechanisms responsible for serial ordering in language production underlie these processes in verbal WM. Thus, serial order maintenance processes in verbal WM reflect the activity of the language production system, whose behavior is shaped by a lifetime of experience, rather than relying on memory stores specifically dedicated to short-term maintenance (e.g., the phonological store; Baddeley, 1986). In the final section, we consider outstanding questions and limitations of this hypothesis and how future research might address them.
Investigations of verbal WM have yielded a number of robust phenomena concerning behavioral performance in verbal WM tasks. These phenomena are central to any theory of verbal WM because they provide information about the nature of the processing architecture underlying maintenance of verbal material and because the patterns of performance have stood the test of replication. Thus, models of verbal WM must account for these phenomena (Baddeley, 1986; Botvinick & Plaut, 2006; Henson, 1998). The core phenomena consist of effects of phonological similarity (Baddeley, 1966; Conrad, 1964, 1965; Conrad & Hull, 1964; Crowder, 1976; Fallon, Groves, & Tehan, 1999; Wickelgren, 1965a, 1965b), word length (Baddeley, Thomson, & Buchanan, 1975), concurrent articulation (Baddeley, Lewis, & Vallar, 1984; Henson, Hartley, Burgess, Hitch, & Flude, 2003; Larsen & Baddeley, 2003; Levy, 1971; Longoni, Richardson, & Aiello, 1993; D. J. Murray, 1968), irrelevant sound (Colle & Welsh, 1976; Jones & Macken, 1993; Jones, Macken, & Nicholls, 2004; Macken & Jones, 1995; Salame & Baddeley, 1982, 1986), serial position (Baddeley, 1986), and presentation modality (Craik, 1969; Crowder & Morton, 1969). The behavioral results corresponding to each of these are summarized in Table 1.
These core phenomena have motivated a number of theoretical and computational models of verbal WM. Most prominent among them is the multicomponent model (see Figure 1; Baddeley, 1986; Baddeley & Hitch, 1974). The model consists of a central executive that controls two specialized short-term buffers that maintain information. The visuospatial sketchpad maintains visual information, whereas verbal information, our focus here, is maintained by the phonological loop. Key assumptions of the multicomponent model are that (a) it is composed of buffers that are specifically dedicated to the short-term maintenance of information, (b) information passively decays with time, and (c) representations in the phonological store are explicitly phonological in nature. The phonological similarity effect and the irrelevant speech effect are used to justify the phonological nature of the store, with phonological similarity effects emerging from competition among items in the store (Baddeley, 1986). Word length effects are viewed as evidence for the decay of information in the phonological store. Items that take longer to articulate will not be refreshed in the phonological store as frequently as those that are more quickly articulated; thus, they will decay more, leading to poorer recall (Baddeley et al., 1975).
Serial position effects are explained in a similar fashion. Primacy effects occur because early items are rehearsed more than are later items; hence their activations are higher than those of items in the middle. The last items in a list suffer less decay due to their recent presentation (Baddeley, 1986). Finally, the complicated interactions of phonological similarity and word length effects with concurrent articulation are explained in terms of what types of input (auditory or visual) have access to the phonological store. Verbal materials presented auditorily have direct access to the store, whereas visually presented material must be converted into a phonological code (Baddeley, 1986; D. J. Murray, 1967). The privileged access that auditory information has to the phonological store underlies the effect of presentation modality (Crowder & Morton, 1969). Additionally, concurrent articulation abolishes the phonological similarity effect for items presented visually, as concurrent articulation is thought to prevent the conversion of a visual code to a phonological one (Baddeley et al., 1984).
The account offered by the multicomponent model is not without its critics; many of the criticisms apply broadly to other accounts of verbal WM as well. For instance, results that have been attributed to passive decay may be understood as emerging from interference among items in memory (Cowan, Wood, Nugent, & Treisman, 1997; Nairne, 1990). Additionally, the phonological store and articulatory rehearsal processes may reflect more general auditory and motor processing (Jones, Hughes, & Macken, 2006; Jones & Macken, 1993; Macken & Jones, 1995; Reisberg, Rappaport, & O’Shaughnessy, 1984). For the purposes of the current review, the most significant challenge to WM accounts is Crowder’s (1993) suggestion that there is no need to posit stores specifically dedicated to short-term retention (see MacDonald & Christiansen, 2002; Postle, 2006, for a similar conclusion). His arguments are based on an interpretation of the recency effect in serial position curves as evidence for stores specifically dedicated to short-term maintenance, akin to Crowder and Morton’s (1969) precategorical acoustic store. Other researchers have found that recency effects can remain after long periods of time (i.e., long-term recency; Aldridge & Farrell, 1977), thus calling into question the interpretation of recency as evidence for short-term storage. Crowder (1993) has argued for a unitary memory system with a common underlying process (e.g., temporal distinctiveness; Glenberg & Swanson, 1986). On this view, differences between short- and long-term memory are due to general processes acting over different time scales. Another source of evidence against specialized short-term stores that is consistent with Crowder’s (1993) criticism, and one that will be explored in more detail below, is that long-term linguistic knowledge affects verbal WM performance (Hulme, Maughan, & Brown, 1991; Roodenrys, Hulme, Lethbridge, Hinton, & Nimmo, 2002; Jefferies, Frankish, & Lambon Ralph, 2006a, 2006b).
Proponents of specialized short-term stores have addressed long-term effects by adding secondary, long-term mechanisms that interact with short-term representations. The recent addition of an episodic buffer to the multicomponent model (Baddeley, 2000, 2003) represents one example. However, the most prominent example of a secondary mechanism that supplements the specialized storage systems comes in trace redintegration accounts (e.g., Hulme et al., 1991; Hulme, Roodenrys, Schweickert, & Brown, 1997). According to these accounts, information that has decayed in WM must be restored (redintegrated) before it can be output. Long-term knowledge influences performance during the process of redintegration. Degraded traces in WM are compared with long-term memory representations in a late-stage process (e.g., prior to output), and this comparison “cleans up” the information that has degraded.
An alternative to trace redintegration is found in accounts that explicitly incorporate long-term representations. A prominent example is Cowan’s (1988, 1995) embedded process model, in which WM is simply temporarily activated long-term memory under the focus of attentional control, akin to Broadbent’s (1957) supervisory attentional system and to Baddeley’s (1986) central executive. According to this account, memory is a unitary system (Crowder, 1993). Long-term representations are temporarily activated when individuals are exposed to items (e.g., visual, verbal), rendering the information accessible. These activated representations support performance on explicit recall tasks such as word or digit span only when they occupy the focus of attention. Attentional focus is theorized to be capacity limited, thus explaining Miller’s (1956) notion of a short-term memory capacity for about seven chunks of information (e.g., words), although Cowan (2001) himself posits that this capacity is closer to four chunks.
Cowan’s (1988, 1995) approach provides a general framework for conceptualizing WM as emergent from long-term processing under the focus of attention. It does not, however, offer a detailed mechanistic account of the core WM phenomena described above or serial ordering processing in the verbal domain more generally. For example, it does not explain the nature of the long-term representations that might subserve verbal WM and how they encode, maintain, and output a sequence of phonological representations. In the following section, we move toward a specification of these components and introduce the processes serving language production (for a more thorough review see Bock, 1996; Dell, Burger, & Svec, 1997), which we hypothesize to be the long-term, domain-specific system underlying the serial ordering behavior described above.
In its early days, the field of language production relied heavily on analyses of naturally occurring speech errors (Fromkin, 1971; Garrett, 1975; Nooteboom, 1969; Shattuck-Hufnagel, 1979). The results revealed that word-level speech errors (e.g., exchanges of words) and sound-level errors (e.g., exchanges of phonemes) had very different distributions and character. Thus, Garrett (1975) suggested separate levels for processing grammatical (functional) and serial (positional) information. Words are selected and specified within a grammatical structure and then their phonological form is determined in a process termed phonological encoding, a step that necessarily precedes articulation.
The modal view is that production processes can be divided into message-level (i.e., semantic), grammatical (i.e., functional and positional), phonological, and articulatory encoding processes (Bock, 1986; Levelt, 1989). Timing studies have been central in investigating how these levels interact, with the picture–word interference paradigm (Schriefers, Meyer, & Levelt, 1990) serving as an important example. In its most general form, the picture–word interference paradigm requires that an individual produce the name of some pictured object or action while ignoring an auditorily or visually presented word. The time course of encoding various representations (e.g., semantic, phonological) can be examined by varying the onset asynchrony between the interfering word and the picture to be named and the nature of the overlap between the word and the picture name (semantic overlap, phonological overlap, and others). Data from these and other studies have supported the view that semantic activation (in some models, activation of semantic features) precedes grammatical encoding (choosing the grammatical structure, such as active or passive sentence structure) and lexical selection (choosing words to fit activated semantics, as in choosing mug or cup for a hot-liquid container), which in turn precedes phonological encoding of the selected word. Research from neuropsychology has confirmed these separable levels of representation. For example, the dissociation between agrammatism and nonagrammatic aphasia suggests that functional (e.g., what grammatical role a word plays) and positional (e.g., where a word is located in a sentence) levels of processing are largely separate, although there is evidence that these levels interact (Dell & Sullivan, 2004). Additionally, phonological encoding and articulatory planning are differentiated on the basis of a dissociation in error patterns that is observed among individuals with fluent and nonfluent aphasia (Romani, Olson, Semenza, & Grana, 2002). Language production involves interactions among all levels of representation, and all are likely involved to some extent in verbal WM tasks. Investigation of the phonological encoding process in particular has revealed a number of experimental findings that are paralleled in verbal WM performance. We describe this stage of production below.
Analyses of speech errors have revealed a number of replicable phenomena that are central to phonological encoding and to our claim that phonological encoding is the domain-specific mechanism subserving maintenance in verbal WM. For instance, erroneous utterances are subject to constraints in the distance over which elements move within an utterance and across syllable positions (Fromkin, 1971; MacKay, 1970; Nooteboom, 1969; Shattuck-Hufnagel, 1979). Additionally, speech errors are more likely to occur under conditions of phonological similarity than dissimilarity (Shattuck-Hufnagel, 1979). Finally, speech errors are more likely to occur if the resulting utterance is a word than if it is a nonword (Baars, Motley, & MacKay, 1975).
In order to account for these constraints, language production researchers have emphasized interactions across levels of representation (e.g., words, syllables, and phonemes). For instance, in Dell’s (1986) interactive activation model of word production, activation of words feeds forward in the network and activates a syllabic frame specifying the order of phonemes (e.g., consonant-vowel-consonant, as in the word CAT). These syllabic frames then activate phonological representations for each individual speech sound in a language, activating them over time. Critically, each level of representation also feeds back to prior levels; thus, phonological representations feed back to the syllabic frames, and the syllabic frames feed back to the lexical level. Lexical constraints emerge from the fact that the model has a stored, word-level representation that serves as input. Ordering constraints on errors emerge from the activation of the model over time and the representation of structure within syllabic frames. Finally, phonological overlap effects occur due to feedback from phonological representations to the lexical level, where the activation of a phonological form leads to further activation of the desired word (e.g., CAT) and also to phonologically related words (e.g., HAT, BAT, CAP). These phonologically related words then feed back to the phonological representation, sometimes resulting in the selection of inappropriate phonemes.
This brief review shows that one key insight about the serial ordering of verbal information in language production is that serial ordering results from interactions across multiple levels of representation over time, that is to say, as a result of recurrent connectivity. Thus, serial ordering is governed at a phonological level and at a lexical–semantic level. In the following section, we bring together behavioral research in language production and verbal WM by reviewing common findings and approaches with respect to the serial ordering of verbal information.
A common error in serial tasks is the production (or recall) of a correct element but in an incorrect serial position. In both WM tasks and natural speech production, these order errors are not random; they exhibit clear tendencies (or positional constraints) that are informative about the nature of the underlying processes.
An important positional constraint in WM research is the position within a recall list over which items are likely to transpose. When an item is recalled in the incorrect serial position in a list, the erroneous item is likely to move only one or two list positions earlier or later than its correct location (Haberlandt, Thomas, Lawrence, & Krohn, 2005). This strong bias toward nearby positions is paralleled in language production, in which misordered sublexical units, such as phonemes, overwhelmingly appear within one or two words of their intended position (Shattuck-Hufnagel, 1979). WM researchers (e.g., Henson, 1998; Burgess & Hitch, 1992, 1999) have invoked memory-specific mechanisms to account for positional constraints, but they may instead emerge as a natural consequence of utterance planning.
The astute reader will note that the positional constraints described above occur at the item level (position in a list) in WM tasks, whereas they occur at the sublexical level in speech. Word-ordering errors in natural speech production do occur, but they are far less common than errors at the sublexical level. Within the sublexical level, the syllable position constraint occurs at the level of individual phonemes or phoneme clusters, in which planning segments overwhelmingly move to only a small set of possible locations (Fromkin, 1971; MacKay, 1970; Nooteboom, 1969). Importantly for our purposes, segments tend to change with other segments that share the same position in an utterance. For example, syllable onsets—the consonant(s) preceding a vowel in a syllable—exchange with onsets, and codas—the consonant(s) following a vowel in a syllable—exchange with codas, but onsets rarely exchange with codas. The consistency of this latter constraint suggests that there may be a representation of serial position that is at least partially independent of content (Dell, 1986; Shattuck-Hufnagel, 1979). That is, these positional constraints may reflect long-term learning about which speech sounds are likely to follow which others (i.e., the phonotactics of the language; Vitevitch, Luce, Charles-Luce, & Kemmerer, 1997) and how syllable structure shapes utterance planning and articulation (Dell, Juliano, & Govindjee, 1993).
The best known positional constraints in memory recall are primacy and recency. Primacy refers to the superior recall of items occurring at the beginning of a list; recency is the superior recall of the last few items in a list. These properties of memory performance are not specific to WM. As early as the turn of the 20th century, researchers such as Hermann Ebbinghaus noted that memory performance in free recall tasks produced characteristic “learning” curves as a function of the position in which items were encountered (Ebbinghaus, Ruger, & Bussenius, 1913). Although these learning curves were originally applied to phenomena in long-term memory, similar serial position curves with their characteristic U-shaped pattern have been present in nearly every study of verbal WM as well (Baddeley, 1986; Conrad & Hull, 1964; Levy, 1971). Interestingly, these listwise serial position effects occur not only in accuracy and for entire lists but also for word and interword durations during list recall (Haberlandt, Lawrence, Krohn, Bower, & Thomas, 2005). Similar serial position effects have been observed for the syllables comprising the production of individual nonwords in isolation (Gupta, Lipinski, Abbs, & Lin, 2005). As noted previously, primacy has been attributed to increased rehearsal of early items, and recency has been attributed to the function of a short-term store (e.g., Crowder & Morton, 1969). Critical to other serial position effects discussed below, however, the existence of long-term recency effects suggests that recency may reflect something akin to temporal distinctiveness (Glenberg & Swanson, 1986) rather than a short-term store. Primacy effects may also be described as an “edge effect” (Botvinick & Plaut, 2006, p. 213) in that this position is particularly distinct given that no items precede it. The distinctiveness account of serial position offered for list position effects might serve as a point of convergence with the sublexical, syllable position constraint discussed above. Onset and offset syllable positions are edges of the syllable; thus, exchanges within a syllable may be unlikely, whereas exchanges in these positions across syllables may be more likely. Some of the computational accounts of positional constraints discussed below include this relation.
There are two general approaches to modeling the serial ordering phenomena described above. One approach explicitly dissociates the representation of serial order from the representation of content, whereas the other approach does not. The rationale underlying the former approach is based on classic dissociations, such as those between item and order memory and those between structure and content of linguistic representation (Garrett, 1975). Frame-based approaches (e.g., Dell, 1986; Shattuck-Hufnagel, 1979), models using external context signals (e.g., Brown, Preece, & Hulme, 2000; Burgess & Hitch, 1992; Vousden, Brown, & Harley, 2000), and hybrid architectures (Gupta & MacWhinney, 1997; Hartley & Houghton, 1996) use this dissociation, whereas recurrent (e.g., Botvinick & Plaut, 2006; Dell et al., 1993) and ordinal (e.g., Page & Norris, 1998) models do not. These archi tectural choices have profound effects on the treatment of serial ordering.
List position effects (primacy/recency and positional transpositions) have been addressed in several different ways. Accounts that explicitly code the start and end of a sequence (e.g., Henson, 1998) make primacy and recency inherent to the context signal itself Burgess and Hitch (1992) take a classic perspective from the WM literature in positing that primacy emerges from increased rehearsal of the first list positions, whereas recency reflects diminished decay of final list positions. Finally, in the oscillator-based associative recall (OSCAR) model (Brown et al., 2000; Vousden et al., 2000), primacy effects emerge simultaneously from output interference and the assumption that item and context associations weaken over time, whereas recency reflects an increased contextual distinctiveness of list final positions (i.e., temporal distinctiveness; Glenberg & Swanson, 1986).
Distinctiveness or edge effect accounts of serial position effects represent a point of convergence between some models that explicitly dissociate structure and content and those that do not Although primacy is an inherent part of the activation signal in the ordinal model discussed here (e.g., Page & Norris, 1998), both primacy and recency effects emerge from the interaction of primacy signal with a noisy activation function for each of the items at the time of recall. This activation noise leads to item transpositions at recall. The first and last positions have fewer positions over which they can transpose; thus, their recall (on average) is superior. Finally, primacy and recency in recurrent architectures such as Botvinick and Plaut’s (2006) model emerge from the representation of items in the recurrent, hidden layer These representations are more distinct for the first and last list positions relative to the middle ones, given that they are not preceded and followed by other items.
Just as primacy and recency can be explained as edge effects, item transposition gradients can be explained in a similar manner Items close to each other in a list or utterance are likely to have more similar contextual representations than would those that are farther apart, whether the contextual representation is external or intrinsic to the items. As a result, items close to each other are more likely to transpose than are those that are farther apart. It appears that one of the overarching principles of serial ordering is this notion of contextual distinctiveness, of which temporal distinctiveness is just one example, a principle that is successfully captured in Nairne’s (1990) feature model.
In addition to explaining the listwise effects above, contextual distinctiveness can also explain sublexical serial ordering effects, such as the syllable position effect. Frame-based approaches (e.g. Dell, 1986; Hartley & Houghton, 1996) explicitly code for syllable position within phonologically specified frames, and in so doing, they necessarily impose hard constraints on the contextual distinctiveness of different syllable positions. As these frames interact with the content filling them in time, phoneme-based transpositions occur only within given positions of the syllable. Alternatively, the contextual distinctiveness for syllable position provided by the OSCAR model (Vousden et al., 2000) emerges from an oscillatory context signal that is temporarily associated with phonemes (i.e., the content). Similar syllable positions (i.e., onset, vowel, offset for a consonant-vowel-consonant syllable) have similar context signals; thus, they are more likely to transpose with each other than with other syllable positions. In both of these instances, however, syllable position is inherent to the mechanism used for serial ordering, leaving open the question of how syllable position is acquired in the first place.
An intriguing possibility and an alternative to these accounts is that syllable position may reflect distributional properties of the content being ordered. For instance, Dell et al. (1993) trained a simple recurrent network on patterns of phonotactically legal monosyllabic words. After training, the model exhibited normal - patterns of human speech errors in that it abided by the syllable position constraint. The syllable structure emerged as a function of training with a specific set of words. There were many more words that contained a similar vowel-consonant (the “rhyme”) structure than those with similar consonant-vowel structure; hence, the vowel-consonant structure tended to cohere. Additionally, vowels rarely substituted with consonants, in that their output representations were very different. Finally, onset and coda consonants were unlikely to exchange, in that their representations in the hidden layer were different as a result of differing context signals. Similar adherence to the phonotactics of the language was observed in Gupta and Tisdale’s (2008) recurrent model of nonword repetition. Although these authors employed the use of syllabic frames to code syllable position constraints, the use of a vocabulary that reflected the frequency distribution of English led to the model’s learning the phonotactics of the language as a whole, so that when it was tested, the model exhibited better nonword repetition for stimuli with high relative to low phonotactic frequencies.
Interestingly, the same types of distributional properties that govern the combination of individual sounds can also influence listwise memory performance in serial recall. Botvinick and Bylsma (2005) had subjects recall lists of pseudowords formed according to an artificial grammar. After participants had extensive experience with these lists, their errors reflected the distributional properties of the grammar, a result that was subsequently simulated in Botvinick and Plaut’s (2006) simple recurrent network of serial recall. These studies and simulations suggest that contextual distinctiveness, whether at the level of individual items or whole lists, may result from learning the patterns that occur in natural language, rather than from a hard-coded property inherent to the mechanism for serial ordering.
In sum, both list recall and natural production exhibit ordering errors that are severely constrained in their distribution. These similarities are suggestive, but they take on more importance in understanding phonological similarity effects in WM.
The phonological similarity effect is a hallmark finding in WM: Serial recall of lists composed of phonologically similar items is worse than serial recall of lists composed of dissimilar items. Phonological similarity among items is often defined in terms of items that share a common rhyme (e.g., B, D, C, G, which share the /iy/ rhyme) or those that share a set of phonological features (e.g., sonorance, place of articulation). This phenomenon was initially observed using letters as stimuli (Conrad, 1964, 1965; Conrad & Hull, 1964; Wickelgren, 1965a, 1965b) and was later extended to words (Baddeley, 1966) and nonwords (Crowder, 1976). Critically, phonological similarity selectively impairs memory for the order in which items appeared and not memory for the items themselves (Fallon et al., 1999). When individuals try to remember lists of phonologically similar items, they are likely to exchange items with other items in the list; however, their memory for which items appeared is unaffected. On those occasions in which individuals make errors that include items not on the current list (an extralist intrusion), the errors tend to be phonologically related to the item that they did not recall correctly (Wickelgren, 1965a). Finally, in lists composed of mixed overlapping and nonoverlapping items, decrements in recall are observed for only the overlapping items, producing a characteristic sawtooth pattern of recall performance (Baddeley, 1966).
Phonological similarity effects also exist in language production, in which the likelihood of committing a speech error greatly increases if the utterance contains phonological similarity. The most common type of speech error under conditions of phonological similarity is an onset exchange (Shattuck-Hufnagel, 1979), in which the consonant(s) before the vowel in two syllables exchange their serial positions, as in saying the tongue twister She sells seashells … as She shells sea sells. Errors of this sort appear to reflect the operation of several factors that influence speech production, including the phonological similarity effect (here the repeated vowels /i/ and // and the similarity of the onsets /s/ and /ʃ/) and the syllable position constraint described above.
These production patterns offer a different perspective on recall of lists in WM tasks. WM researchers have traditionally interpreted errors as occurring over entire items (e.g., misordering the memoranda C and P in a list such as G, C, B, P, D; Baddeley, 1986; Conrad, 1964), whereas researchers from the language production tradition have concluded that most errors occur at a unit smaller than an item. This production research raises a question about the units over which errors occur under conditions of phonological similarity in verbal WM tasks. The fact that WM researchers have attributed an item-level source to these errors may be an artifact of early research that used letters as stimuli. Phonological similarity typically occurs in the rhyme unit (vowel plus any following consonants), and the items are distinguished by syllable onset consonant(s), as in C, B, P, etc. Research in production has shown that over 80% of speakers’ contextual errors (e.g., exchanges) involve the onset consonants (Shattuck-Hufnagel, 1987). Thus, errors in serial recall may not be due to item missequencing but instead to a misordering of phonemes. If items C and P are exchanged in recall, the error could result from misordering the onsets /s/ (from the letter C) and /p/ (from P) in the utterance, so that the items /siy/ and /piy/ are incorrectly produced as /piy/ and /siy/. Evidence supporting this interpretation comes from Ellis (1980), who systematically investigated the relation between constraints on the errors in verbal WM and language production. He analyzed the types and positions of errors under conditions of phonological similarity and concluded that the relative frequency and positions over which errors occur are equivalent in language production and verbal WM tasks. More recent research confirms these early findings in showing that the nature and distribution of errors in the production (i.e., reading aloud) and recall of tongue-twister stimuli is the same (Acheson & MacDonald, 2008). Ellis (1979) formalized his results in his error equivalence hypothesis stating that both production and verbal WM access a “response output buffer,” which in less WM-laden terms might be viewed as the phonological encoding process itself. Similar conclusions were reached by Haberlandt, Thomas, et al. (2005), who showed that the transposition gradients (i.e., the likelihood that two sounds switch places) in verbal WM tasks mirror those in language production.
Three general, computational accounts have addressed phonological similarity effects: as a late-stage process, as a result of interactions among levels of representation, and as a result of contextual similarity. The late-stage process account stems from the assumption that item and order information are represented separately. Phonological similarity effects emerge at a late stage when a secondary mechanism specifies a phonological form for each of the to-be-recalled items (e.g., Henson, 1999; Page & Norris, 1998). When items have similar phonological forms, the likelihood of one being erroneously selected for another increases.
The interaction account posits that phonological similarity effects emerge from the interaction of phonological and nonphonological representations. For instance, in Burgess and Hitch’s (1992, 1999) phonological loop model, phonological similarity effects are due to feedback from the output to item layers; similar representations at the output layer lead to the activation of similar representations at the item layer. If the activation of incorrect items becomes strong enough, these items ultimately win a competition for representation in a competitive queuing layer (Houghton, 1990), and they will be output in the incorrect position. Dell’s (1986, 1988) interactive activation models of speech production offer a similar, albeit more linguistically motivated, account, as serial ordering errors in different syllable positions emerge from the interaction between lexical and phonological representations. Activation of a word activates a set of phonemes at each position, which in turn feed activation back to the lexical network, activating the original lexical node as well as lexical nodes that share currently activated phonemes in the word-form network. For instance, activation of the word HAT will spread to the phonemes /h/ /æ/ /t/ at each position; these will feed back to the lexical level and activate HAT and also CAT, RAT, HAS, etc. Errors occur when this interaction leads to greater activation for an incorrect phoneme than for a correct one for a given position. Phonological similarity constraints emerge because target words with similar sounding lexical entries feed back to multiple lexical entries whose phonemes compete with those of the target word at each position.
The final account of phonological similarity effects involves contextual similarity (e.g., Nairne, 1990). Positional context and phonological context influence the likelihood of an ordering error due to phonological similarity. For instance, Botvinick and Plaut (2006) modeled phonological similarity effects by adding common, distributed representations at input and output. Errors due to this similarity, which emerged at the hidden layer, came from two sources: input and context representations. At the level of the hidden layer, items that shared similar input structures and similar positions in the list (i.e., adjacent items) shared similar contextual representations in the hidden and recurrent layers. Given that the item-position representations in this model were relative, similarity in these codes led to an increased probability of items changing places with each other. Similarly, phonological similarity in the OSCAR model (Brown et al., 2000; Vousden et al., 2000) emerges from similarity within phoneme feature vectors and in the time-varying context signals. Confusions occur because the feature matrices of similar phonemes are more similar than are those dissimilar phonemes and because the time-varying context signal for the same position is more similar than for different positions When there is similarity in both features and context, the model may accidentally make a speech ordering error, such as a substitution of onset phonemes. Computational accounts of phonological similarity effects represent another point of convergence between WM and language production in that a combination of phonological overlap and representation of serial position influences the contextual distinctiveness of the material that is produced/ remembered.
The lexical status of an item influences both verbal WM and language production processes. For instance, in verbal WM tasks, words are easier to recall than are nonwords (Hulme al., 1991). Errors in language production parallel this result, in that the majority of errors are real words, with the likelihood producing an error increasing if the potential error forms a word (Baars et al., 1975). This constraint often appears in what is particularly humorous speech error, the spoonerism, named after William Archibald Spooner, who often unintentionally exchanged speech sounds (e.g., You’ve hissed my mystery lectures, instead the intended utterance You’ve missed my history lectures; MacKay, 1970). Beyond the fact that words have semantic content whereas nonwords do not, a critical difference between them is that words contain a combination of sounds with which speakers have experience, whereas nonwords do not. The effects of lexicality on both production and WM seem to stem from the fact that words reflect previous learning (although see Hartsuiker, Corley, & Martensen, 2005), thus providing support for the hypothesis that WM performance is governed by activated long-term representations.
Strong evidence for the role of long-term learning in verbal WM and language production concerns lexical frequency effects. In verbal WM tasks, high-frequency words are easier to recall than are low-frequency words (Roodenrys et al. 2002). In language production, high-frequency words are less prone to error than are low-frequency words in both normal (Dell, 1990) and aphasic (Schwartz, Wilshire, Gagnon, & Polansky, 2004) speech errors, and they are also produced more quickly (Levelt et al., 1991). These results are striking evidence for long-term component to performance, as frequency effects necessarily reflect repeated exposure to events in the world.
One finding that supports role for lexical–semantic representation in verbal WM performance is that concrete words are easier to recall than are abstract words (Walker & Hulme, 1999). Beyond this effect, lexical–semantic factors can influence the proportion of phonological errors in mixed lists of words and nonwords (Jefferies et al. 2006a; 2006b). Lists that contain a high ratio of words to nonwords are less likely to suffer phoneme movement errors than are those with high ratios of nonwords to words, suggesting that semantic representations bind phonological elements (see Patterson, Graham, & Hodges, 1994, for more details on the semantic binding hypothesis). Similar factors influence the likelihood of speech and memory errors in aphasic patients. For instance, imageable words are less prone to phonological errors than are nonimageable words among patients with deep dysphasia (Martin, Saffran, & Dell, 1996). Additionally, patients suffering from progressive fluent anomic aphasia produce phonological ordering errors in both serial recall and single word repetition for words that they cannot produce in a picture-naming task; such errors are substantially lessened for items that they can name from pictures (Knott, Patterson, & Hodges, 2000). These findings point to a central role for the interaction between lexical–semantic and phonological levels of representation in determining the serial order of phonological elements that are remembered/produced.
In addition to properties associated with whole words, a number of sublexical factors influence the serial ordering of verbal information in language production and verbal WM. One of these—that speakers’ errors are shaped by syllable position constraints, reflecting their knowledge about syllable structure in their language—has already been discussed. The other sublexical effects also reflect knowledge of the phonological patterns of the language.
Phonotactic frequency refers to the frequency with which sounds are combined in a language (Vitevitch et al., 1997). Words with high phonotactic frequency, such as bell, are composed of very common sounds and sound combinations, whereas other words, such as watch, have low phonotactic frequency. In verbal recall, nonwords with high phonotactic frequency are easier to recall than are those with low phonotactic frequency (Gathercole, Frankish, Pickering, & Peaker, 1999). A related phenomenon is the bigram frequency effect (Baddeley, 1964), whereby strings of letters are well recalled if adjacent items within the string have a high likelihood of occurring in English. In language production, nonwords with high-frequency phonotactics are produced more quickly than are those with low-frequency phonotactics (Vitevitch et al., 1997). Furthermore, when individuals make speech errors, they overwhelmingly abide by the phonotactics of the language (Boomer & Laver, 1968; Fromkin, 1971; cf. Stemberger, 1983). These results suggest that individuals use implicit knowledge of how sounds combine in their language—knowledge acquired over time—to constrain the ultimate order with which sounds are produced.
Related to the effects of phonotactic frequency are effects of phonological neighborhood density, a measure of the number of words that sound like any given word in a language. Two words are phonological neighbors if they differ by only one phoneme (Luce, Pisoni, & Goldinger, 1990). In the case of verbal WM, words that come from dense phonological neighborhoods are recalled better than are those from sparse neighborhoods (Roodenrys et al., 2002). In language production, pictures are named more quickly if the name is from a high-density neighborhood relative to a low-density one (Vitevitch, 2002). Corpus analyses have revealed that speech errors are more likely to be from low- rather than high-density neighborhoods (Vitevitch, 1997). These results have been confirmed in tasks that experimentally elicit speech errors (e.g., tongue twisters); low-density words are more prone to such errors than are high-density words (Vitevitch, 2002). Similar to the effects of frequency described above, these results suggest that exposure to the sounds within a language and to the words to which they belong influences the serial ordering processes in verbal WM and language production.
Long-term effects of linguistic knowledge (e.g., lexicality, frequency, phonological neighborhood density) are arguably the most difficult set of findings for WM models to accommodate. For models that posit specialized short-term representations, one solution is to incorporate principles of trace redintegration. For instance, Page and Norris (1998) modeled effects of word frequency by varying the threshold for omitting an item at recall. In this approach, effects of long-term representation occur at output; this is consistent with some of the error-monitoring approaches offered in the language production domain to account for lexicality bias in speech errors (e.g., Levelt, Roelofs, & Meyer, 1999; Shattuck-Hufnagel, 1979). Although we have argued that the lexicality bias reflects long-term learning in the language production architecture, there is some evidence to suggest that the bias may reflect an error-monitoring strategy at output. Hartsuiker et al. (2005) demonstrated that the lexicality bias in speech errors depends on the context in which errors occur. When individuals produce pure lists of nonwords, errors are nonwords. However, when stimuli are mixed lists of words and nonwords, errors are likely to be words, replicating the classic lexicality bias in the speech error literature (Baars et al., 1975).
Although the output comparison/monitoring accounts described above have had some success in accounting for phonological similarity effects, we believe that a more parsimonious approach is one that does not explicitly dissociate long- and short-term representations and does not require a comparison process at the time of output. To date, the only class of models that can incorporate long-term learning in the course of maintaining and producing sequential output are recurrent architectures (e.g., Botvinick & Plaut, 2006; Dell et al., 1993; Gupta & Tisdale, 2008; Plaut & Kello, 1999; although see Burgess & Hitch, 2006).
Recurrent architectures, like all parallel distributed processing models, encode long-term information in the weights connecting the different layers of the architecture. Frequency effects in these models emerge directly as a result of the material to which the model is exposed; frequent items, or frequent patterns (i.e., higher phonotactic frequencies), are robustly represented in the connection weights. Lexicality bias in these models reflects the fact that the models tend to output learned information. Thus, if a model is trained on words, it will tend to produce words. A similar account might be offered for the effects of phonological neighborhood density on production, in which items from a dense phonological neighborhood have similar phonological representations in the model; the common phonological elements are reinforced each time an item from the phonological neighborhood is encountered. As a result, training with words from dense phonological neighborhoods relative to sparse ones results in robust learning of the phonological elements that define the neighborhood (e.g., a rhyme unit), and the words will be produced efficiently. This account is paralleled in models of reading in which regularity affects reading efficiency by means of the mapping from orthography to phonology (e.g., Harm & Seidenberg, 1999).
To this point, we have reviewed a number of similar patterns in spoken production and list recall, including constraints on the positions over which errors occur, phonological similarity effects, and the influence of linguistic knowledge from long-term memory. Within WM research, these results have been hypothesized to stem from memory-internal processes such as decay over time; interference within a specialized, short-term store; or processes involving comparison to traces in long-term memory. Within language production, these same results have been hypothesized to stem from production-internal mechanisms such as recurrent interactions across different levels of linguistic representation (e.g., words, syllables and phonemes). In many production approaches, these interactions are subject to long-term learning and result in an increased propensity to commit speech errors under conditions of phonological similarity. In some cases, it may be possible to merge these common behavioral results into a unified framework in which key WM effects emerge from maintenance and ordering mechanisms within language production.
A network architecture related to Plaut and Kello’s (1999) model of single word production (see Figure 2) provides a useful starting point from which to pursue a unified architecture. It naturally takes into account long-term learning in the weights between different levels of representation. Furthermore, it incorporates the levels of linguistic representation (i.e., semantic, phonological, lexical, articulatory, etc.) that we have emphasized as important in observing patterns of serial ordering performance across WM and production tasks. Finally, the recurrent connectivity between these different layers has been used successfully in the past to model serial ordering in both language production (Dell et al., 1993) and verbal WM (Botvinick & Plaut, 2006; Gupta & Tisdale, 2008). We prefer this modeling approach to one that explicitly dissociates long- and short-term representation on the grounds of parsimony; if the same behaviors can be captured by a unitary framework, then a dual representation of the same information is unnecessary. This being said, Plaut and Kello’s (1999) model was not designed to accommodate multiword utterances; thus, this is an area that must be developed before the model can address serial recall in lists. Nonetheless, we believe that the general architecture serves as a useful framework for the following sections, in which we discuss outstanding questions and predictions that emerge from reconceptualizing verbal WM maintenance as we have done in this article.
A justifiable criticism of our claims is that qualitatively similar behavioral performance across verbal WM and language production tasks does not mean that both are subserved by the same system. In fact, the existence of short-term memory patients, who show selective memory impairment despite seemingly normal language performance, suggests that the systems are separable (R. C. Martin, Lesch, & Bartha, 1999; Vallar & Papagno, 1995; Warrington & Shallice, 1969).
Our response to this criticism is twofold. First, with regard neuropsychological dissociations, we believe claims about the existence of pure WM deficits deserve additional scrutiny. Few patients have been identified with this type of disorder (see Allport, 1984), and in a number of instances, careful testing of their linguistic abilities suggests that their language performance is not entirely normal (e.g., Caplan & Waters, 1990; N. Martin & Saffran, 1997; N. Martin et al., 1996). Second, it should be the case that deficits in production processes should lead to deficits verbal WM, and there is a substantial amount of evidence support this claim (Knott et al., 2000; N. Martin & Saffran, 1997, 1999). Interestingly, a computational model of single word production was able to model patterns of aphasic speech errors word repetition previously attributed to impairments in verbal WM by varying the amount of time that passed between input and output and the rate with which information decayed (N. Martin al., 1996). This pattern of performance led these researchers conclude that “auditory WM performance depends on storage capacities intrinsic to the language processing system” (N. Martin et al., 1996, p. 83); similar conclusions were reached by Allport (1984).
The argument that impairments in production lead to impairments in WM is again one of association. Thus, a more direct test of the functional dependence of the two systems would involve disrupting the production system while people perform verbal WM tasks. This is one of the future research directions that we suggest below, and therefore we leave further discussion to that section.
The types of serial ordering errors observed under conditions of phonological similarity are fundamentally the same in language production and verbal WM (Acheson & MacDonald, 2008; Page, Cumming, Madge, & Norris, 2007). Furthermore, we have argued that many of the serial ordering errors involving phonological similarity seem to be attributable to errors in the phonological encoding process of speech production. For example, exchanges at the item level, such as switching C and B in a recall list, may be speech errors at the subitem level. Other types of errors (e.g., omissions and additions), however, seem better ascribed to a higher level of production planning, namely at the levels of lexical selection and utterance/sentence planning during grammatical encoding. In line with this perspective, Page et al. (2007) have recently suggested that the “phonological loop” may in fact be a lexical-level production plan.
Although many errors point to a phonological encoding origin, others are not so easily accommodated at this level. Omissions and additions of entire words/items are likely to occur at a stage in production planning that precedes phonological encoding. Thus, other levels of production planning, including the lexical–semantic and articulatory levels, will need to be incorporated to fully account for serial ordering performance. For example, Dell (1986) suggested that omission errors might be failures to activate maintain sufficient lexical activation prior to phonological encoding. Further examinations of item-ordering errors in natural speech production may be informative in better understanding serial order errors in list recall. Word-level errors in sentence production (e.g. Wrote a mother to my letter rather than wrote a letter to my mother) also obey constraints that likely reflect the underlying architecture. Word exchange errors like the one in this example tend to be between items of the same grammatical category, in this case the two nouns letter and mother (Garrett, 1975). This result typically described in accounts of grammatical encoding as stemming from the exchanged words’ identical grammatical category and errors in a process that inserts selected words into a sentence frame (e.g., the nouns mother and letter are inserted into noun slots in a syntactic structure; Garrett, 1975). However, these constraints could also stem from the same contextual distinctiveness factors that have been shown to modulate serial order in recall (e.g. Glenberg & Swanson, 1986) and subitem speech errors (e.g., Dell et al., 1993; Vousden et al., 2000). That is, two words in the same grammatical category (such as nouns) will tend to have some semantic similarity and tend to occur in highly similar contextual environments, such as following a determiner in a noun phrase Moreover, Dell & Reich (1981) showed that word exchange errors increase with the phonological similarity of the items (e.g., letter and mother have a fair amount of phonological overlap), providing another example of how contextual distinctiveness could modulate serial order processes. These results suggest that there may important parallels between serial recall and production the sentence level, in addition to the phonological encoding level One of the levels at which models of verbal WM might inform those in language production is by considering the mechanisms which multiword utterances are planned in serial recall (e.g., via primacy or contextual signal).
When people produce speech errors, they are often (though not always) detected and corrected by the speaker, presumably using some form of self-monitoring (Levelt et al., 1999; Postma, 2000; Shattuck-Hufnagel, 1979). If ordering errors in verbal WM represent production errors, why are people seemingly incapable of self-monitoring?
There are several responses to this question. The first is that, our knowledge, there has been no systematic investigation of the extent to which people correct themselves during serial recall tasks. Anecdotally, we have observed self-correction in a number of studies conducted in our lab, but we have not examined systematically. Thus, one direction for future research would be investigate self-correction in serial recall, perhaps explicitly instructing participants to correct themselves if they believe that they have produced an erroneous utterance. An alternative would be to have participants give confidence ratings of their serial recall performance after each trial. Such ratings have been used successfully in the memory recognition literature to dissociate recollection from familiarity (Rugg & Yonelinas, 2003). In the case of serial recall performance, confidence ratings might indicate the extent to which participants are aware of serial ordering errors and thus indicate their potential to self-monitor. Second, there is a relatively simple answer as to why people would be more likely to self-monitor in language production than in list recall, namely, the former usually contains an intended message whereas the latter does not. The presence of a message greatly enhances perception-based monitoring for output errors (Postma, 2000). Third, the nature of planning for the recall of a sequence of letters, digits, etc., is severely impoverished relative to normal speech production; thus, another source of information for self-correction is lacking. Finally, self-correction in WM tasks may create interference for items not yet recalled. Participants may be aware of an error but make a strategic decision to focus on recall of subsequent items rather than risk creating further errors by making a correction. Thus, whether people self-correct or not in the case of verbal WM (a potential future investigation) would not necessarily conflict with our present stance that serial ordering errors in verbal WM emerge from errors in the language production system.
Although we have suggested that many of the key behavioral findings in verbal WM performance have their source in language production, other important WM results also must be accommodated. In the following section, we discuss the core phenomena not yet addressed (word length, presentation modality, irrelevant sound, and concurrent articulation effects; see Table 1) in light of a language production account that is based largely on Plaut and Kello’s (1999) production architecture. In many cases, these accounts are speculative; thus, we suggest future research that might serve to test the hypotheses described here.
The most widely accepted account of word length effects is that they result from decaying memory representations. Long words relative to short ones take longer to say; thus, they are not “refreshed” in memory as quickly, eventually leading to decay beyond retrieval (Baddeley et al., 1975). While this decay-based account is not incompatible with a language production explanation of word length effects (production models often incorporate a decay parameter), other approaches are also possible. Some recent recurrent computational models have shown that increased length leads to decreased contextual discriminability among to-be-remembered elements in both whole lists (Botvinick & Plaut, 2006) and individual nonwords (Gupta & Tisdale, 2008). Representation of elements within the hidden layer responsible for maintenance becomes more similar as the number of elements increases (Botvinick & Plaut, 2006). Thus, word length effects may be at least partially attributed to distinctiveness and/or interference rather than to decay. This account has been used by some researchers to explain why interitem durations and pauses exhibit serial position effects in spoken recall (Haberlandt, Lawrence, et al., 2005).
If this distinctiveness account of word length is correct, then manipulating the distinctiveness of whole items or elements within items (e.g., syllables) should influence the ease with which individuals can remember and produce them. There are many possibilities for increasing distinctiveness. One is to manipulate the timing with which elements are presented. Research into irregularly timed list presentation reveals that temporally isolated elements are recalled better than are those that are grouped together (Farrell, 2008). Another potential manipulation is the phonological distinctiveness of the elements. This would involve manipulating the transitional probability (i.e., phonotactic frequency) between syllables within a nonword or between items within a list. Although previous research has shown poorer memory for low relative to high phonotactic frequency nonwords (Gathercole, Frankish, et al., 1999), it is possible that in a probed recognition task, memory for low-frequency transitions embedded within a high-frequency-transition nonword/list may be more distinct and therefore more recognizable. These and other studies might be used to address whether word length effects are at least partially attributable to decreased distinctiveness of speech elements represented within the language production architecture.
The superior memory for material presented auditorily relative to that presented visually has been used to argue that auditory information has immediate access to a “phonological store,” whereas visual information must be recoded via articulation (Baddeley, 1986). An alternative to this account is that the modality effect is actually a learning effect. Specifically, adults, even very literate ones, have substantially more experience in mapping from acoustics to meaning or acoustics to articulation than they do in mapping from orthography (i.e., the written form of a word) to meaning or acoustics. Although the Plaut and Kello (1999) model that we have used here to illustrate these principles lacks a representation of orthography, an analogy can be made from those models that do contain orthographic representations. In connectionist models of reading, for instance, mapping from orthography to meaning can occur via two routes (Harm & Seidenberg, 1999). In typical development, the orthography → phonology → meaning pathway is learned first and is strengthened even as the orthography → meaning pathway is acquired (Van Orden, Pennington, & Stone, 1990). Thus, effects of presentation modality may reflect differences in mapping from an auditory representation to meaning or articulation relative to visual one rather than being some special property of phonological stores in verbal WM.
A test of this hypothesis could take the form of a study in which presentation modality and lexical frequency are parametrically manipulated while controlling for the phonotactics of the material. Individuals have more exposure to high- versus low-frequency words; thus, the mapping from acoustic forms to meaning should be stronger than the mapping from visual forms to meaning. The difference in representation for high- versus low-frequency words should be smaller than the equivalent mapping from orthography to semantics because the acoustic to meaning mapping is so overlearned. Thus, we would anticipate that the difference in memory for high- versus low-frequency words should be smaller for auditory relative to visual presentation when visual to phonological recoding is minimized (e.g., under concurrent articulation). A similar manipulation could be done for words with regular (e.g., hint, lint) and irregular (e.g., pint, plough) mappings from orthography to phonology while again controlling for the phonotactics of the speech sounds. Here we would predict virtually no difference in the memory for these items when presented auditorily but large differences when the same items are presented visually, with regular words being remembered better than would irregular words. In both of these potential experiments, effects of presentation modality should emerge from differences in learning the mapping between acoustics, orthography, semantics, and articulation rather than from privileged access to phonological representations.
A slightly more complicated variant of this account is also based on learning and consideration of serial recall as a dual task in which participants are simultaneously processing linguistic input (spoken or written items) and planning an utterance. Conversational turn taking frequently requires an individual to process speech input (someone else speaking) while engaging in production planning for uttering a response when it is his or her turn to talk. Thus, people have enormous practice encoding another’s speech while planning their own utterances, in a way that is not too dissimilar from hearing an acoustically presented memory list while developing the production plan for recall of the list. By contrast, a visually presented list for recall requires participants to read while planning an utterance. This task is one for which speakers have virtually no practice, with the exception of reading aloud, a relatively rare act that does not quite duplicate the task demands of serial recall. Obviously, people who are presented with written items in a memory task do turn the visual information into a phonological code, but this code may not be as rich as the acoustic signal from spoken input. Moreover, there are some suggestions that production processes may be implicated in developing the phonological code for written input (McCutchen, Dibble, & Blount, 1994; Shankweiler, Liberman, Mark, Fowler, & Fischer, 1979), potentially creating interference from simultaneously using phonological encoding processes for production planning and for reading. Pursuing these alternatives will require additional investigation of the demands involved in simultaneous encoding and production planning as well as the potential role of phonological encoding mechanisms in reading. One possible research direction would be to manipulate people’s experience with simultaneous reading and utterance planning (e.g., give people practice reading aloud and examine whether this experience affects serial recall performance with visually presented lists).
Concurrent articulation abolishes the phonological similarity effect for visually but not auditorily presented items (Baddeley et al., 1984). Similar to the account for effects of presentation modality, this effect has been attributed to the privileged access that acoustic information has to a phonological store. Visual information, in contrast, must be recoded (Baddeley, 1986). In other words, concurrent articulation blocks the mapping from orthography to phonology. This view is consistent with the possibility that phonological encoding (the production system) has some role in reading, but there are other possible interpretations of this result.
One possibility is that at least part of this effect may be attributable to adding uninformative noise to a system whose mappings are more or less robust. In the Plaut and Kello (1999) model, the representation of a word is a mapping among semantics, acoustics, and phonology. In order to accommodate visual effects, one would need to add an orthographic component. With this in mind, the same principles that were applied to effects of presentation modality apply here. The mapping from acoustics to semantics (and therefore words) is stronger than the mapping from orthography to semantics. The effect of concurrent articulation would be to add uninformative noise to the representation of a word rather than blocking the mapping from orthography to phonology. The mapping from acoustics to words is overlearned; thus, additional noise will have little effect on the phonological representations at the word level and therefore phonological similarity effects would remain. Orthographically presented material, especially when lacking semantic content (e.g., letters, nonwords), will have a less robust word-level representation and will be more susceptible to the uninformative noise provided by articulation. The noise should affect the orthographic and articulatory representation of words and the phonological one as well; hence, phonological similarity effects should be abolished.
This is a very speculative account of concurrent articulation that is in need of empirical testing. Manipulating the regularity of the mapping between orthography and phonology may be helpful, but it is probably impossible to cross this factor with phonological overlap, as lists of phonologically overlapping, irregular words are likely to be rare. Instead, researchers might manipulate the orthographic regularity of individual items within overlapping (e.g. tough, fluff, buff, rough, stuff) and nonoverlapping (right, cat, none, fall, tree) lists and observe memory for individual items. The mapping from orthography to word will be less robust for irregular relative to regular words; thus, memory for these items might be poor under conditions of concurrent articulation, an effect that might be exacerbated by phonological similarity. A further means of addressing this account of concurrent articulation is considered below in a discussion of parametric manipulations of lexical–semantic representations and concurrent articulation on the phonological similarity effect.
The effect of irrelevant sound is to decrease memory performance (Colle & Welsh, 1976). The effect was initially attributed to speech sounds specifically (Salame & Baddeley, 1982, 1986), but later research has demonstrated that the effect can be induced by having individuals listen to an irregularly changing acoustic stream (e.g., musical notes; Jones & Macken, 1993). Such a result seems readily accounted for by a language production system that maps acoustics, semantics, and articulation. One interpretation is that irrelevant sound adds uninformative noise to word-level representations, reducing their distinctiveness. Acoustic information that varies irregularly may be more noisy than is regularly varying sound, hence the finding that listening to a foreign language is more disruptive to serial recall than is listening to white noise (Colle & Welsh, 1976).
This account suggests that manipulating the distinctiveness of words could modulate the magnitude of the irrelevant sound effect One simple manipulation would be to examine whether lexicality influences the effect. As words are represented in mappings to semantics, words may be more immune to irrelevant sound than is material that lacks semantic content (e.g., letters, nonwords). Furthermore, manipulations of lexical–semantic representations might also influence the magnitude of the irrelevant sound effect. In both instances, the relative strength of the word-level representation in a production-based model would directly affect the extent to which the noise added by irrelevant sound would impact memory performance.
One process that we have not discussed in the current review is the role of articulation in the serial ordering of verbal information. In production, it has been assumed that phonological encoding precedes articulation, but accounts differ about the degree of interaction in the system. If the production system is discrete and strictly feed-forward (e.g., Levelt et al., 1999), then articulation should not influence the serial ordering of phonological representations. Although processing systems that maintain the serial order of phonological information without invoking articulation can be devised (and many have been described here), this does not mean that articulation is not important. The ultimate purpose of language production is to produce a sequence of articulatory gestures that convey a message. Thus, researchers from both traditions have emphasized a need for articulation in the maintenance of acoustic/phonological information. For instance, subvocal articulation (Baddeley, 1986); perceptual–gestural interaction (Jones et al., 2006); and semantic, phonological, and articulatory interaction within the production system (Plaut & Kello, 1999) are all mechanisms that have been proposed for maintaining the serial order of phonological information.
We see two reasons why researchers should consider a role for articulation in serial recall. The first is that the motor system provides powerful constraints on serial ordering. Articulation, like all motor behavior, is inherently serial, in that an articulator can perform only one action at a time. The motor system seems particularly well suited to many of the issues in serial ordering discussed above. In support of this position, a recent study by Woodward, Macken, and Jones (2008) demonstrated that improvements in the coarticulation of novel nonwords were paralleled by improvements in serial recall. Furthermore, the superior digit span for material presented in English relative to Welsh in bilingual subjects has been attributed to higher articulatory difficulty for Welsh digits (A. Murray & Jones, 2002). The second reason to consider articulation in serial recall is that it may inform the study of how phonological representations emerge. The phoneme is the minimal processing unit in many of the models described above; however, other accounts argue against treating the phoneme as a basic unit of speech processing (e.g., Marslen-Wilson & Warren, 1994). Computational models by Guenther (1995) and Westermann and Miranda (2004) have illustrated how the interaction between acoustics and articulation leads to the emergence of phonology and phonological regularities in development. On this view, phonology is a mapping between acoustics and articulation rather than an independent level of representation.
There are a number of areas in which the investigation of articulatory processes should prove informative in understanding serial ordering in production and WM. One approach is to devise studies designed to dissociate serial ordering constraints that emerge from articulatory processes from those that precede it. For instance, improved performance in the Woodward et al. (2008) study was attributed to improvements in coarticulation; however, it is also possible that exposure to the novel material led to improvements earlier in the production process, such as in the phonological encoding or utterance-planning processes. Two different familiarization protocols, one with and one without ;articulation, might be used as a first step in pursuing these effects. Another approach would be to examine whether informative concurrent articulation can modulate serial recall performance. We have argued that a potential explanation of the detrimental impacts of concurrent articulation is that it adds uninformative noise to a system that maps semantics, acoustics, and articulation. This leaves open the question of effects of concurrent articulation that is consistent with the information to be maintained. We would predict that articulatory plans that are consistent with those to be maintained should either minimally impact or even enhance serial recall performance relative to those that are not consistent.
In sum, articulation, which has long been neglected by researchers modeling language production and by many WM researchers as well, should not be ignored. Motor planning in articulation is likely to constrain the process of serial ordering and the nature of the phonological representations that are being ordered.
One of the central motivations for positing specialized storage systems in verbal WM is that they should facilitate language acquisition (Gathercole & Baddeley, 1990). In theory, any new word is a nonword; studies have shown correlations between verbal WM span and nonword repetition ability (e.g., Gathercole, Willis, Emslie, & Baddeley, 1992) as well as correlations between speaking rate and memory span (Cowan et al. 1998; Jarrold, Hewes, & Baddeley, 2000). Furthermore, many studies have demonstrated a correlation between nonword repetition ability and vocabulary acquisition (Gathercole & Baddeley, 1990; Gathercole, Service, Hitch, Adams, & Martin, 1999; Service, 1992). Traditionally, these findings have been taken as evidence that vocabulary acquisition is causally influenced by phonological working memory. The production account, however, offers an alternative interpretation, in that the observed correlations could equally well be attributed to developmental increases in phonological encoding ability.
Partial support for this hypothesis comes in the form of a recent computational model of nonword repetition (Gupta & Tisdale, 2008). In one simulation, the authors varied the vocabulary size of the model (a proxy for the model’s experience) but did not vary the “memory” of the model (i.e., changes in decay rate, activation maintenance, etc.). They found that increases in the number of vocabulary items that the model could correctly produce were associated with increases in the novel nonwords that the model could correctly produce. Thus, variation in experience alone can modulate nonword repetition ability, again emphasizing an important role for long-term learning in WM tasks.
How might this same result be produced in an empirical study? One possibility is a longitudinal investigation of the relations among language production ability, vocabulary acquisition, and verbal WM performance. Rather than using reaction times (e.g., Cowan et al., 1998), we could operationalize children’s language production ability as the propensity to commit speech errors and the extent to which these errors abide by the phonotactics of the language in spontaneous speech. These measures of language production ability could then be used in examining the WM–vocabulary relation as a mediator. Our approach predicts complete mediation of the WM–vocabulary acquisition relation, or at the very least, a severe reduction in the correlation. Some support for this prediction comes from the comprehension domain. Metsala (1999) showed that the correlation between nonword repetition and vocabulary is reduced to nonsignificance when measures of phonological awareness (reflecting learning of the phonological patterns of the language) were included in a regression model. Longitudinal studies are of course extremely time-consuming, but they could prove valuable in examining the relation between lexical learning (vocabulary), sublexical learning (phonotactics), and WM performance. Training studies, described in the next section, could also prove useful in this regard.
Language users learn about words in their language and about the phonotactic patterns embodied in these words. This long-term learning, acquired in the process of perceiving words and through repeated phonological encoding during speech production, clearly shapes verbal WM performance. Language production research has shown that individuals can learn novel phonotactic constraints very quickly and that their speech errors come to reflect this new knowledge (Dell, Reed, Adams, & Meyer, 2000; Taylor & Houghton, 2005; Warker & Dell, 2006).
One means of testing the production-related locus of serial ordering in verbal WM would be to examine the extent to which serial ordering errors in memory tasks abide by newly learned phonotactic constraints. Training of these constraints could occur by having participants repeatedly read sequences of nonword stimuli containing a novel, first-order constraint, as in Dell et al.’s (2000) studies (e.g., certain phonemes such as /f/ or /s/ are always syllable onsets, while /h/ and /g/ are always offsets). Testing of these constraints could occur throughout training in production conditions that involve memory (e.g., serial recall) and nonmemory (e.g., rapid reading). We have hypothesized that the same system is used for normal production and serial recall; thus, analyses of speech errors in the recall and reading tasks should reveal adherence to these novel phonotactic constraints, varying with the amount of experience. Support for these predictions comes from an incidental learning study in which children listened to nonwords that abided by a novel phonotactic grammar (Majerus, Van der Linden, Mulder, Meulemans, & Peters, 2004). Subsequent tests of nonword repetition revealed better performance for nonwords that abided by the novel grammar relative to those that violated it.
A key challenge for the production-based account of serial order is to extend the account beyond the phonological encoding level that has been the focus of this article. In language production, serial ordering of phonological representation is determined at the level of phonological encoding and also as a result of the interaction between phonological and lexical–semantic representations (Dell, 1986; Plaut & Kello, 1999). Support for this interrelation comes from numerous sources, including studies examining mixed lists of words and nonwords (Jefferies et al., 2006b) and from examinations of phonological errors in patients suffering from semantic dementia (Knott, Patterson, & Hodges, 1997). This latter set of findings motivated the semantic binding hypothesis (Patterson et al., 1994), which proposes that semantic representations help bind phonological ones. This hypothesized interaction leads to the prediction that phonological factors will be only one of several levels of representation that are responsible for the maintenance of serial order in WM tasks.
One potential approach to addressing this interaction would be a parametric manipulation of phonological similarity and a semantic property such as word concreteness, both of which affect performance on verbal WM tasks. Concrete words provide a more easily imageable mental picture than do abstract words; thus, these words should yield a more concrete message that can be used to maintain information in memory. Furthermore, a concrete message should result in a robust semantic representation. The semantic binding hypothesis predicts that phonological information should be less susceptible to error when semantic representation is increased. If so, then a manipulation of word concreteness should affect the magnitude of the phonological similarity effect observed in verbal WM tasks. The nature of this interaction could take several forms. A manipulation of word concreteness could decrease the magnitude of phonological similarity effect for concrete words relative to abstract ones by more strongly binding phonologically overlapping items together, or it could have the opposite effect by having little impact on overlapping items while rendering nonoverlapping items even easier to remember.
Furthermore, one might predict that phonological similarity effects would continue to exist with visually presented items under conditions of concurrent articulation. The representation of a non-overlapping, concrete word might be robust enough to overcome concurrent articulation when presented visually, leaving at least partially intact effects of phonological similarity. In contrast, abstract words—such as letters, digits, and nonwords—might not have as robust a word-level representation. Hence, differences between overlapping and nonoverlapping phonology should be minimized. In either case, the manipulation would demonstrate that factors other than phonological ones influence the maintenance of phonological order in verbal WM, further supporting the claim that such processes are emergent from those responsible for normal language production.
One of the major criticisms of our language production hypothesis for serial ordering is that we argue by association that the maintenance of phonological information in verbal WM is achieved by the language production architecture. Our view gains some support from studies of brain activation indicating that maintenance processes may be subserved by regions of the posterior superior temporal cortex (Buchsbaum & D’Esposito, 2008), a region that has also been implicated in the process of phonological encoding in speech production (Indefrey & Levelt, 2004). However, the argument is still one of association, albeit a neurobiological one.
We propose, as the most direct test of the functional reliance of verbal WM performance on language production processes, an approach that would directly disrupt phonological encoding processes through use of repetitive transcranial magnetic stimulation (rTMS). rTMS has been used successfully to isolate WM (Postle et al., 2006) and production (Devlin & Watkins, 2007) processes independent of each other, but the technique has not been used to address the functional overlap between the two. If phonological encoding processes are responsible for verbal WM maintenance, then disruption of phonological encoding should induce serial order errors. Specifically, TMS stimulation to regions associated with phonological encoding during the delay period of the WM task should induce serial ordering errors similar to TMS stimulation during a production task. Such a study would go a long way in addressing some of the limitations of the present review and would also open new lines of research in examining language and memory processes in humans.
We began this review with a citation from Ellis (1980) in which he observed that serial ordering errors in language production are paralleled in verbal WM. This research suggests that using analysis techniques more typical of those in the language production tradition should prove informative in understanding verbal WM. Typical dependent measures in language production include speech initiation latency, word and pause durations, and the analysis of speech errors. Some researchers (e.g., Cowan et al. 1998; Haberlandt, Lawrence, et al., 2005; Woodward et al., 2008) have adopted speaking time measures; however, almost no memory research has utilized the detailed speech error analyses that are prevalent in the production literature (although see Acheson & MacDonald, 2008; Treiman & Danis, 1988). We believe that this is a serious omission on the part of WM researchers, who have tended to focus solely on whether an entire item is recalled or not. The item level is important, but our review has shown that much of people’s serial ordering performance may occur at a subitem level. Subitem speech error analyses, therefore, may provide a substantially richer dataset than that afforded by item recall accuracy alone. Furthermore, detailed taxonomies provided by speech error analyses could potentially change some of the interpretations that have been offered for the so-called core phenomena identified in verbal WM (e.g., the phonological similarity effect) while also providing important insights into processing changes that occur with development or brain damage. Detailed error coding is substantially more time-consuming than is coding item recall, but the ultimate payoff could be substantial.
Our review of phenomena in verbal WM has revealed a number of parallels with language production research, namely serial ordering constraints, phonological similarity effects, and numerous effects of long-term linguistic knowledge. These behavioral similarities are paralleled in computational approaches to serial ordering, where similar mechanisms are posited. We believe that these findings merit reinterpreting the view that maintenance processes in verbal WM are achieved by dedicated short-term storage mechanisms. Instead, we have argued, serial ordering processes language production are a likely candidate for maintenance verbal WM, with particular emphasis placed on the process phonological encoding. We see these theoretical and computational developments as progress toward realizing Cowan’s (1995) view that short-term storage functions are emergent from the temporary activation of domain-specific long-term representation under the guidance of attention. Major domain processes for maintenance in verbal WM, in our view, are language production processes, with phonological encoding processes as particularly important. There are clearly challenges in pursuing this approach, but it offers opportunities to further extend the investigation memory and language processes to areas of language acquisition, disordered language processing, and neuroscience.
This research was supported by National Institute of Mental Health Grant P50 MH644445, National Institute of Child Health and Human Development Grant R01 HD047425, and the Wisconsin Alumni Research Fund. We thank Brad Postle for helpful comments on an earlier version of this article.