|Home | About | Journals | Submit | Contact Us | Français|
Research on language has seen a remarkable rise in the use of artificial language learning experiments since their introduction nearly 90 years ago (Esper, 1925). The term Artificial Language Learning generally refers to an experimental paradigm where participants learn a language, or language-like system, in a lab setting and are then tested on what they learned. The reasons for using artificial languages are diverse and include allowing for a controlled exploration of the principles of universal grammar (Culbertson, Smolensky, & Legendre, 2012; Ettlinger, Bradlow, & Wong, 2013; Finley & Badecker, 2009), of how domain-general cognitive mechanisms might support language and language learning (Saffran, Aslin, & Newport, 1996), principles of language change (Esper, 1925), of the relationship between first and second and adult and child language learning (Finn & Hudson Kam, 2008), and of the processes involved in second language learning (Friederici, Steinhauer, & Pfeifer, 2002; Morgan-Short, Faretta-Stutenberg, Brill-Schuetz, Carpenter, & Wong, 2014; Morgan-Short, Sanz, Steinhauer, & Ullman, 2010).
Despite the recent ubiquity of this research, there has been little work that has established a clear relationship between performance in lab-based artificial language learning experiments and learning a natural language in an ecologically valid environment, which is the primary question addressed by the current study. In considering this question, two related questions also arise: Given interest in a possible dissociation between language learning and general cognitive capabilities (Hauser, Chomsky, & Fitch, 2002; Pinker & Jackendoff, 2009), does a relationship between natural and artificial language learning still hold after controlling for general intelligence? Does this relationship differ for different measures of artificial language learning?
In the present study, we seek to bridge this gap between artificial and natural language learning research by examining the performance of adult learners in both second language (L2) and artificial language learning environments. We elicited participation from a cohort of students enrolled in a Spanish language class for our artificial language learning study. We obtained a number of different measures of their classroom performance, their Spanish ability and general intelligence. The artificial language learning task included a number of different measures in a single task to represent some of the different types of artificial languages that have been used in other studies. Our analysis also examined the relationship between artificial and second language learning measures and a measure of general intelligence, IQ.
Researchers generally use the term Artificial Language Learning to refer to an experimental paradigm where participants learn a language, or language-like system, in a lab setting and are then tested on what they learned (e.g., Friederici et al., 2002; Gomez & Gerken, 2000). Artificial language learning studies, however, take many different forms and go by many different names. The original terminology of “artificial linguistic system” (Esper, 1925, p. 1) has been expanded to include artificial grammar learning, which generally focuses on the combinatoric aspects of language, often in the absence of meaning (Reber, 1967; Saffran et al., 1996); miniature language learning, which generally refers to learning aspects of an invented or part of a natural language, e.g., a determiner system with semantics (Hudson Kam & Newport, 2005) or the case and classifier system of Japanese (Mueller, Hahne, Fujii, & Friederici, 2005); miniature artificial language, which refers to made-up grammatical categories and word-order rules (Braine, 1963); and semi-artificial language learning, which generally refers to a portion of a natural language, often modified for experimental purposes (Williams, 2005).
The paradigm of the first artificial language learning experiment (Esper, 1925) would be quite familiar to experimenters today. The study examined biases in language learning by exposing participants to pairings of words with pictures of abstract shapes of different colors. After training, participants were presented with just the abstract pictures, which they then had to name. In this, and many of the original artificial language learning experiments, the experimenters looked at the mistakes participants made as indicative of learning biases. For example, in one experiment in Esper (1925), the words presented were bi-morphemic with the first consonant-vowel-consonant-consonant (CVCC) morpheme representing color and the second VC morpheme representing shape. The most common error was that participants would often re-segment the stimuli into two consonant-vowel-consonant (CVC) morphemes, suggesting that a principle of linguistic change involved the alignment of morpheme breaks with syllable breaks. This finding has also been interpreted to reflect a bias against morphemes with complex codas and without onsets.
The 1960s and 1970s saw a significant shift in the nature of artificial language learning experiments. Exemplified by Reber (1967), which is often cited as the first modern artificial language learning experiment, studies started to focus on the combinatoric elements of language, reflecting a shift in the focus of research on language from the study of language change to syntax and generative grammar (Chomsky, 1957, 1965). In Reber’s study, participants were trained on sequences of letters generated by a finite-state grammar without any corresponding meaning associated with the sequences. Participants successfully learned which novel letter sequences were valid with respect to the finite state grammar and crucially reported no explicit awareness of the rules, suggesting an implicit process enabling the extraction of regularities from an input. A similar method of artificial grammar learning has been used with syllables and transitional probabilities (Saffran et al., 1996), words and different types of grammars (Opitz & Friederici, 2002, 2003, 2004), musical notes and finite state grammars (Tillmann, Bharucha, & Bigand, 2000) and various other permutations of these paradigms and stimuli.
The subsequent decades saw a dramatic rise, not only in the frequency, but also in the variety of artificial language learning studies. One recent advance is the use of different participant populations, including children and infants (Gomez & Gerken, 2000; Newport & Aslin, 2004; Saffran et al., 1996), primates (Fitch & Hauser, 2004) and songbirds (Gentner, Fenn, Margoliash, & Nusbaum, 2006). Artificial language learning paradigms are also used in brain imaging studies (Friederici et al., 2002; McNealy, 2006; Morgan-Short, Steinhauer, Sanz, & Ullman, 2012) to explore the neural bases of language learning. In addition to continued research on the precise syntactic properties that facilitate language learning (Friederici, Bahlmann, Heim, Schubotz, & Anwander, 2006; Knowlton & Squire, 1996), artificial language learning studies have explored all levels of linguistic structure from phonetics (Wong et al., 2008) and phonology (Ettlinger et al., 2013; Finley & Badecker, 2009; Moreton, 2008; Wilson, 2006) to semantics (Mirkovic, Forrest, & Gaskell, 2011; Mori & Moeser, 1983; Petersson, Folia, & Hagoort, 2010) and pragmatics (Galantucci & Garrod, 2011; Nagata, 1987). Artificial language learning studies have also been used to explore the degree to which language learning may be explained by domain-general learning mechanisms (Aslin & Newport, 2008; Newport, Hauser, Spaepen, & Aslin, 2004; Saffran et al., 1996). Finally, artificial language learning experiments have recently been incorporated into iterated learning studies where findings come not from seeing whether the languages are learned, but rather, seeing what happens to the artificial languages after several cycles of learning and transmission as a method of exploring principles of language evolution and change (Galantucci, 2005; Kirby, Cornish, & Smith, 2008; Rafferty, Griffiths, & Ettlinger, 2013; Reali & Griffiths, 2009; Smith & Wonnacott, 2010). In addressing these and other issues, artificial languages can be viewed as serving as test-tube models of natural language that allow researchers to examine precise issues about language that are not readily testable with natural language (e.g., Morgan-Short et al., 2012).
Despite the use of the artificial language learning paradigm as a way to explore the human language faculty, very little research has explored the relationship between artificial and natural language learning, particularly with respect to second language learning. Indeed, many papers acknowledge the important caveats associated with their findings. For example, Braine (1963) acknowledges that, “[a]lthough experiments with artificial languages provide a vehicle for studying learning and generalization processes hypothetically involved in learning the natural language, they cannot, of course, yield any direct information about how the natural language is actually learned [emphasis added]” (p. 324). Similarly, Ferman, Olshtain, Schechtman, and Karni (2009) note that, “[…] one may argue that the simplified language and laboratory conditions afforded in artificial language paradigms may not express the complexity of natural language or of real-life learning conditions. These arguments, however, express a classic dilemma that inevitably arises as the price of experimental control in laboratory research” (p. 387).
A few studies have explored the relationship between artificial and natural language learning indirectly. Some have shown that people with language-related impairments including aphasia (Christiansen, Kelly, Shillcock, & Greenfield, 2010; Dominey, Hoen, Blanc, & Lelekov-Boissard, 2003; Goschke, Friederici, Kotz, & van Kampen, 2001), specific language impairment (Evans, Saffran, & Robe-Torres, 2009) and developmental dyslexia (Pothos & Kirk, 2004) perform worse on artificial language learning tasks than healthy controls. For example, Evans et al. (2009) showed that children with specific language impairment performed worse in an artificial language learning experiment involving the acquisition of transitional probabilities for both syllables and notes. Furthermore, for individuals in both groups, native language receptive vocabulary correlated with their performance in the artificial language learning task. Similarly, Misyak and Christiansen (2012) found a correlation between artificial language learning and some aspects of native language ability, including vocabulary and the comprehension of complex sentences and Misyak, Christiansen, and Tomblin (2010a) found a correlation between learning an artificial language with non-adjacent dependencies and the processing of long distance dependencies in natural language.
An indirect relationship between artificial and natural language learning may also be inferred by virtue of the fact that both artificial and second language learning correlate with a third variable, verbal working memory. Examples of the relationship between language learning and working memory are well documented (Michael & Gollan, 2005; Robinson, 2005a, 2002; Williams, 2012) and include both first and second language skill. Examples of research showing a relationship between artificial language learning and working memory include Misyak and Christiansen (2012), where a measure of verbal working memory correlated significantly with the statistical learning of adjacent (r = .46) and non-adjacent (r = .53) transitional probabilities. Thus, artificial and second language learning may be related to each other by virtue of being supported by working memory. Were this to be the case, it raises the question of whether artificial language learning studies tap into language-specific learning abilities or whether artificial language learning studies assess participants’ general learning abilities or general intelligence, which in turn play a role in second language learning (Genesee, 1976).
Finally, Robinson (2005b, 2010) directly explored the relationship between artificial and second language learning by comparing performance on two artificial grammar learning tasks with a brief second language learning task. The artificial language component of the study included standard explicit and implicit artificial grammar tasks, which required participants to view and later judge strings of letters generated by an artificial grammar (implicit) or to view a series of letters and choose the letter that best completed the series (explicit). The second language learning task involved exposing the participants to sentences reflecting three different grammatical rules from a natural language (Samoan), and then testing learners on the grammatical rules of the language. In addition, Robinson assessed participants’ language learning aptitude, working memory and intelligence.
No relationship was found between either of the artificial language learning tasks and the natural language learning tasks. In addition, artificial and second language learning tasks correlated with different cognitive abilities: (a) the implicit artificial grammar learning task correlated negatively with IQ, (b) the explicit artificial language learning task correlated positively with aptitude, and (c) the natural language learning task correlated positively with working memory. Robinson suggests that the lack of a relationship between the two learning tasks may be attributed to the fact that the artificial language learning task relied primarily on implicit learning mechanisms and lacked the semantics that are crucially involved in natural second language learning. Similarly, Brooks and Kempe (2013) explored the relationship between learning a small portion of Russian grammar and learning an artificial syntactic grammar using pseudo-words presented auditorily. The results showed that there was no relationship between the auditory sequence learning and L2 learning after controlling for metalinguistic awareness as assessed in a post-hoc interview.
The result in these two studies may be limited to the particular artificial grammar learning paradigm, which did not include semantics, and the limited nature of L2 learning assessment. The findings may not necessarily generalize to the relationship between other types of artificial and second language grammar learning or between artificial grammar learning and other aspects of language, such as vocabulary, word segmentation, phonology and pronunciation, literacy or any other measures of second language aptitude. Crucially, in these previous studies, the second language learning took place over the course of less than a week, whereas typical second language learning generally occurs over the course of weeks, months and years.
Thus, previous research on the relationship between artificial and natural language has primarily focused on language disorders or on first language ability, whereas the studies that focused on the relationship between artificial and second language learning showed null results, perhaps due to the limited nature of the experiments used. Indeed, the majority of research in this area has focused on one specific type of artificial language learning, that of artificial grammar learning, where no meaning is assigned to the artificial structures being acquired. Limiting research to artificial grammar learning studies also fails to provide comparisons of different artificial language learning paradigms.
In the present study, we explored (a) the relationship between artificial and second language learning, (b) whether such a relationship still holds after controlling for general intelligence, and finally (c) whether this relationship differs for different measures of artificial language learning. We addressed these issues in the following manner: We used an artificial language learning paradigm that included semantics; we measured participants’ IQ to examine the relationship between artificial language learning and L2 controlling for general intelligence; we included a number of different artificial language learning measures including recall versus a simple grammar versus a complex grammar; and we included a more comprehensive assessments of L2 ability. This allowed us to consider which aspects of L2 learning are tapped into by artificial language learning experiments instead of thinking of it as a monolithic cognitive function. Given the diversity of metrics that are used to quantify L2 ability, different measures of artificial language learning may reflect different facets of L2 learning.
Because artificial language learning studies are attempts at simplifying language learning for a lab setting, we predicted that the complex artificial language learning measure would most closely correlate with objective measures of L2 ability. By the same token, because classroom performance incorporates a number of different skills, including language learning, homework completion, memorization, test preparation, etc., we predicted that the composite measure of artificial language learning, which includes recall, simple grammar learning and complex learning, would most closely correlate with overall measures of classroom performance. We also predicted that IQ would mediate the relationship between the composite measure of artificial language learning and classroom performance as they both incorporate more general cognitive capabilities beyond just language learning including skills associated with IQ. Conversely, we predicted that IQ would not mediate the relationship between complex artificial language learning and L2 ability, as we hypothesize these measures are indexed more exclusively to language learning.
Participants were 44 adults (23 female), enrolled in a fourth semester Spanish language class at a university in Chicago, Illinois that focused on learning and using vocabulary, grammar and culture for communicative purposes. Participants were recruited over two separate semesters with participants receiving monetary compensation. Participants’ average age was 21.7 years (SD = 2.9) and the mean age of initial exposure to Spanish was 13.5 years old (SD = 5). None of the participants had more than five years of classroom experience with Spanish, though 8 participants indicate ages of acquisition of less than 11 years of age based on general exposure to the language. Thirty-two of the 44 participants were monolingual native English speakers aside from their experience with Spanish and none of them were heritage speakers of the language. The 12 bilingual participants had experience with languages other than Spanish (i.e., 6 Gujarati, 2 Tagalog, 1 ASL, 1 Haitian Creole, 1 Hindi, 1 Tamil). None of the languages participants knew shared the properties critical to the morphophonological system of the artificial language that participants learned in the study.
We evaluated participants using measures of artificial language learning skill, measures of Spanish learning skill and measures of general intelligence. The artificial language learning test made use of a morphophonological grammar learning paradigm that included a semantic component. Participants were tested on both recall of the artificial language and on generalization for two morphophonological processes – simple and complex – to assess artificial language learning ability. The evaluation of Spanish classroom learning incorporated separate measures of classroom performance, subjective teacher assessments, and objective measures of Spanish interpretation and production. The general intelligence assessment included standardized measures of both verbal and non-verbal IQ.
The artificial language in this study has previously been used to explore the relationship between language learning and domain-general cognitive abilities (Ettlinger et al., 2013; Wong, Ettlinger, & Zheng, 2013). In this paradigm, participants were trained on a morphophonological system for combining affixes with words to form new words. Participants were tested on the words they were trained on and then tested on their ability to extend the grammar to another set of withheld words.
The language consisted of 30 noun stems and two affixes: a prefix, [ka-], marking the diminutive (e.g., as in English doggy) and a suffix, [-il], marking the plural (e.g., dogs). The nouns represented 30 different animals and freely combined with the affixes to produce 120 different words.
The phonological inventory consisted of American English consonants and three American English vowels, [i, e, a] each used within a CVC structure to produce ten unique nouns for each vowel. No English words or Spanish words were used. Given the diversity of other languages known by bilingual participants, words from other languages were not overtly avoided.
The grammar of the language had two word formation rules as depicted in Figure 1. The SIMPLE type, applicable to i-stems and a-stems, consisted of concatenating the stems with the suffix [-il] and/or prefix [ka-] without changing any vowels. The COMPLEX type, applicable to e-stems, consisted of concatenation plus changing vowels in the stem and affix. More specifically, the changes reflected two processes absent from English or Spanish. First, vowel harmony changed vowels in the suffix so they had the same (jaw) height as the stem vowel (e.g., the plural of [mez], ‘cat’, became [mez-el] ‘cats’ (compare [vab-il] ‘cows’)); second, vowel harmony was also triggered by the prefix [ka-], which changed stem vowels to low, (e.g., [ka-maz], ‘little cat’). When combined, they yielded complex e-stem words [ka-maz-el] as contrasted with simple i-stem words [ka-bis-il]. Vowel harmony is a relatively common phonological phenomenon and is estimated to occur in hundreds of languages (out of ~6,500) around the world (van der Hulst & van de Weijer, 1995). The particular vowel harmony grammatical system used in the current study was based on the language Shimakonde (Ettlinger, 2008).
A native English speaker was recorded saying each of the words spoken at a normal rate with English prosody and phonology so as to sound natural and fluent using Praat (Boersma & Weenink, 2005). Each word had a corresponding picture of an easily recognizable animal/set of animals, with the small animal picture being a shrunken, diminutive version of the standard sized picture. All stimulus and test items are shown in Appendix I.
Participants were only told that they would be exposed to a language and then tested on what they learned. They were given no instruction on the rules of the language or told that there were even rules to learn. Auditory stimulus was presented over headphones. Visual stimulus (pictures of the words’ meaning) was presented on a computer monitor and participants recorded responses on a button box.
Training consisted of passive exposure to word-picture pairings, with no feedback. During the 20 minutes of exposure, each participant was exposed to four repetitions of 12 nouns in all four forms for a total of 192 tokens in random order. Four nouns were complex (/e/), four were simple with /i/-stems and four were simple with /a/-stems. Each picture was displayed for three seconds with a one second ISI. Each audio clip was about one second long and started 500 ms after the picture appeared.
At the end of training, participants were tested on their recollection of the 48 words for which they had received training. During the testing of trained items, participants saw a picture and heard two words in a two-alternative forced-choice task. The foil reflected the incorrect form of the suffix (e.g. kagadel vs. kagadil) or stem (e.g., kagad vs. kaged) – foils for each item are detailed in the Appendix I. Each trained item was tested once, in random order. The first alternative was heard 500 milliseconds after the picture appeared, the second word 1500 milliseconds after the first (with an ISI of around 500 milliseconds, depending on word length). Order of presentation for the answer and foil were randomized and balanced across the study. Participants had 3 seconds after the beginning on the second word to respond.
After the testing of trained items, participants were tested on their ability to apply the grammar they learned to new words in a version of a wug-test (Berko, 1958). Wug-tests, which are used to assess grammatical knowledge, particularly in children, involve prompting a participant with a new word (e.g., wug) and then asking them to produce the word with a modified meaning (e.g., wugs), ensuring that participants are displaying knowledge of the grammar, rather than recall of an inflected word. Here, participants saw a picture (from the group of 18 withheld nouns) for 1500 ms and heard the corresponding new word. After seeing this word-picture pairing, participants saw a blank screen (one second), then another picture of the same animal but either as a plural, diminutive or diminutive-plural (e.g., first a lion, then many small lions). After seeing the second picture, participants had to choose from two heard alternative words for naming the second picture they saw using a button-press. The trials, which included both simple and complex untrained test items, were presented in random order with no feedback provided. The foils for this two-alternative forced choice wug-test are shown in Appendix I.
After the artificial language learning test, participants were administered the Kauffman Brief Intelligence Test, Second Edition (KBIT; A. S. Kaufman & Kaufman, 2004). This test measures IQ, including verbal and non-verbal subcomponents. The test takes approximately 20 minutes and was administered in English.
There were several components of the Spanish language assessment used to measure participants’ success in learning Spanish in the classroom. One component was the participants’ final classroom grades. Instructors reported students’ overall final grade in the course, which was comprised of students’ grades on (a) chapter exams (55%), which included assessment of vocabulary, grammatical concepts, cultural readings and videos from particular chapters in the instructional text (VanPatten, Leeser, Keating, & Roman-Mendoza, 2005); (b) pop quizzes (15%), comprised of ten quizzes administered over the course of the semester that assessed any area that the class instructor wanted to test; (c) online homework (15%), which reflected participants’ performance on weekly online homework assignments on vocabulary, grammar and culture throughout the semester (students could attempt each homework activity up to three times, with scores reflecting the best attempt only) and (d) participation in class (15%), which was the average of a daily score based on attendance, preparedness, and participation in the Spanish classroom. The instructors also reported the students’ grades specific to each of the four graded elements included in the final classroom grade. In addition to reporting student grades, we also asked each instructor to rate participating students on a 1-5 scale based on their reading, writing, speaking and comprehension abilities.
We also used two other more objective assessment instruments. The first was the Elicited Imitation Task (Vinther, 2002). Used for decades as a measure of implicit knowledge of second language (Naiman, 1974), the Spanish version of the Elicited Imitation Task involves listening to sentences in Spanish, then repeating the same sentence in Spanish within a limited time-span. Because people are limited in what they can repeat based on what they can process (Gray & Ryan, 1973), the Elicited Imitation Task serves as a useful tool for rapid language assessment, and correlates significantly with more time-intensive measures (Erlam, 2006). The specific Elicited Imitation Task used here was adopted from Ortega (2000) and was modified to reflect Mexican Spanish, which is the dialect reflected in the course textbook. The Elicited Imitation Task is comprised of 30 sentences that vary in their grammatical complexity as well as syllable count (with a range of 7 – 17 syllables). Participants were instructed to listen to the recorded sentences in Spanish, which were presented one at a time, and to repeat each sentence out loud after hearing a beep that sounded after each sentence. Participants’ responses were digitally recorded. The digital recordings were transcribed by two independent raters and then scored following the protocol from Ortega (2000). Each sentence could earn a score of 0, 1, 2, 3, or 4 based on the accuracy of the repetition, yielding a maximum possible score of 120. Any discrepancies between the raters one and two were resolved by a third rater who listened to the recordings independently and provided a final score.
The second objective measure was a brief test of a specific aspect of grammatical knowledge in Spanish, the subjunctive of doubt construction, which is taught explicitly in Spanish language classrooms. It had been originally taught during the previous semester of Spanish and was targeted for review in the present semester, ensuring adequate opportunity to use it. The test was adopted from Farley (2001) and was comprised of two sections: a comprehension portion and a production portion, with the order counterbalanced across participants. For the 24 interpretation items, participants were required to choose between two possible main clauses to begin a sentence whose ending was provided. The critical items were designed to assess knowledge of clauses that require the use of the subjunctive such that participants had to interpret the subjunctive form of the verb in the sentence ending in order to decide which of the two possible main clauses could begin the sentence. Eleven of the test items were distractors and the thirteen critical items were scored as correct or incorrect resulting in a percent correct for each participant.
The production portion of the language assessment was a fill-in-the-blank task where participants were required to complete a provided sentence with the correct form of a verb (provided in infinitive form). This portion of the task included 18 items with ten critical items that were designed to elicit a choice between the subjunctive and indicative moods.
Examples questions for both tests are provided in Appendix II.
Participants came in to the lab over a two-week span in the sixth and seventh weeks of the semester. This served to control for the amount of instruction in this class that they had received prior to the study and to minimize the differences in skill level between participants, though some differences remained. Participants began with the Elicited Imitation Task, then filled out the LEAP-Q questionnaire regarding their language background (Marian, Blumenfeld, & Kaushanskaya, 2007). Participants then took the artificial language learning test followed by the KBIT and Spanish test of the subjunctive construction. Information about the participants that was provided by the instructors (i.e., classroom grades, teacher assessments) was obtained the week after the last day of the semester.
The means, standard deviations and ranges for all measures obtained are provided in Table 1.
For the artificial language learning test participants performed significantly above chance on the recall of the trained items (t(43) = 7.6, p < .001) (Figure 2). Participants also performed significantly above chance on the simple untrained items (t(43) = 4.3, p < .001), but were not above chance on complex untrained items (t(43) = .11, p = .91) (Figure 2). However, fourteen of the 44 participants successfully learned the complex grammar and performed significantly above chance for the complex measure (at p < .05 for binomial probability, proportion correct > .66), reflecting a substantial range in learning success across participants for complex items. As highlighted in Ettlinger et al. (2013), below chance performance on complex items may indicate an interesting aspect of learning and performance. Therefore, we conducted additional statistical analyses taking that into consideration and included these analyses in Appendix III.
With respect to second language learning, there were significant positive correlations amongst all measures of Spanish ability (Table 2) suggesting that there is internal consistency amongst the different measures. The Elicited Imitation Task, in particular, does correlate significantly with almost all of the other measures of Spanish ability. The two measures that do not correlate with the Elicited Imitation Task are homework and class participation grades, which arguably measure effort rather than acumen. All measures of teachers’ subjective evaluations of the students in reading, writing, speaking and comprehension were highly correlated with each other, suggesting minimal distinctiveness among the measures. Also, only final exam score correlates with IQ among the Spanish ability measures, corroborating previous research suggesting that general intelligence or IQ only explains a small portion of the variance observed in language learning (Robinson, 2005a). Correlations also show positive, but not significant, correlations between the three artificial language learning measures (recall vs. simple: r(43) = .14, p = .34; recall vs. complex: r(43) = .24, p = .11; simple vs. complex: r(43) = .03; p = .87).
Our primary interest is in the relationship between the artificial language learning and natural language learning measures. The three main questions that are being addressed are: Is there a relationship between artificial language learning and natural language learning? Does this relationship still hold after correcting for IQ? Does this relationship differ for different measures of artificial language learning?
As a preliminary exploratory analysis to address the first question, we consider the overall correlations shown in Table 2. There were a number of positive correlations between the three artificial language learning measures and the different measures of Spanish learning ability. These extend up to r = .49 for class grades, and r = .58 for teacher evaluated performance, which compares favorably to previous studies showing a relationship between natural language learning ability and measures of working memory and artificial grammar learning, which show correlations around r = .40 (Misyak & Christiansen, 2012; Robinson, 2005b).
To address the second question on the relationship between artificial language learning and second language learning, independent of the effects of general intelligence, we performed a first-level analysis using correlations amongst key measures controlling for IQ (Table 3). Importantly, there was still an overall significant correlation between overall final class grade and composite artificial language learning performance (r(41) = .44, uncorrected p = .001).
A more conservative analysis utilizes Bonferroni correction, given the large number of comparison involved in this study. For the 50 comparisons (10 classrooms measures x 4 artificial language learning measures + IQ), a very conservative threshold of p < .001 may be used. After this correction, and when controlling for IQ, Composite artificial language learning was still significantly correlated with Exam Score, Quiz Score, and Reading, Writing and Comprehension assessment score, and Complex artificial language learning was still significantly correlated with Elicited Imitation Task.
The relatively low number of participants for this individual differences study motivates additional significance testing. First, a Monte Carlo simulation can be used to estimate the likelihood of obtaining the correlations reported in Table 3 simply by chance. For 10,000 simulation iterations of the correlations, scores were generated for 44 participants for 17 performance measures using R (R Development Core Team, 2010). The scores were generated by randomly sampling, with replacement, a value from the actual scores for each performance measure. Thus, the Recall scores for each of the 44 simulation participants were generated by randomly selecting from one of the 44 actual Recall ALL scores; then the Complex ALL scores were randomly selected from the actual Complex ALL scores, and so on for all 17 measures. This ensures the distributional properties of the scores are retained, even if they are not normal, and simulates what the results of our study would be had there been no relationship between artificial language learning, natural language learning and IQ for each participant. The correlations between these randomly generated scores were calculated in the same manner as the results in Table 3. A histogram of the correlation coefficients is shown in Figure 3. This histogram shows the correlation values one would get for conducting these analysis on random results.
As expected, most of the correlations are close to zero. Correlations above 0.27 were present in 5% of the time, correlations above .39 were present in only 0.1% of the simulations and there were no correlations as large as .45 in any of the 10,000 simulations of 48 correlations. This suggests that the correlations observed in Table 3, particular those above .39, are extremely unlikely to be due to chance.
Second, the possibility of any individual participant significantly biasing the findings can be mitigated using a leave-one-out analysis. The correlations between composite artificial language learning score and overall classroom grade and Elicited Imitation Task performance were calculated, controlling for IQ, 44 times, each time leaving out one participant. The correlation between composite artificial language learning score and overall classroom performance ranged from .40 to .51, with a mean of .44 and standard deviation of .020 and the correlation between composite artificial language learning score and Elicited Imitation Task performance ranged from .35 to .52, with a mean of .39 and standard deviation of .025. While leaving out certain participants changed the significance value for these correlations, the results were still significant and there is no evidence that a small number of participants drove the correlations found.
Considering the third question, on the differing relationship between the different measures of artificial language learning and classroom learning, there were a number of positive correlations to consider among the different artificial language learning measures.
After controlling for IQ, Complex artificial language learning was found to be correlated to exam grade (r(41) = .42, p = .005), overall class score (r(41) = .32, p = .03), teacher-rating of comprehension ability (r(41) = .34, p = ..025), production on the Spanish test of subjunctive (r(41) = .30, p = .048), and the Elicited Imitation Task (r(41) = .49, p < .001).
Simple artificial language learning performance was correlated with exam grade (r(41) = .32, p = .041) and teaching ratings for reading and writing (r(41) = .56, p < .001, and r(41) = .58, p < .001, respectively).
Recall of trained items was related to homework grade (r(41) = .32, p = .04), quiz grade (r(41) = .33, p = ..035) and teaching ratings of reading, speaking and comprehension ability (r(41) = .34, p = ..030, r(41) = .40, p = .001, and r(41) = .52, p < .001, respectively).
This different set of relationships for complex and simple artificial language learning (e.g., both show a relationship with Exam grade but only complex artificial language learning shows a relationship with Final Grade, see Figure 4) suggests that second language learning is not a unitary process; it involves a number of different skills and abilities including understanding, speaking, written communication and explicit knowledge of the language (i.e., what is tested on exams).
Bivariate correlations do not take collinearity into consideration, and our data had a large number of collinearities, particularly since some measures are embedded in others by design (e.g., aggregate and component artificial language learning scores). Furthermore, there was variability in the linguistic background of the participants: Some were bilingual and they all had different amounts of prior exposure to Spanish. Therefore, we used a step-wise regression to look for unique variance explained. We also included number of years of Spanish and whether the participant was bilingual as co-variates. This also allowed us to address the question of how well artificial language learning tests predict second language learning and the reverse question of what aspects of second language learning are tapped into when conducting an artificial language learning test.
We conducted three step-wise multiple regressions with different dependent variables that incorporated both forward selection and backward elimination. The first regression has composite artificial language learning score as the dependent variable, and the initial model included all of the measures of Spanish ability plus IQ, number of years of exposure to Spanish and whether the participant was bilingual as independent measures. After the regression, the final model included Quiz score, Exam score and IQ (Table 4) and accounted for a significant amount of variance in artificial language learning performance (R2 = .41, p = .0001). Crucially, none of the language experience measures – years of Spanish and bilingualism – factored into performance, possibly due to the narrow standard deviation of years of exposure to Spanish and low number of bilinguals.
The second regression included Final Grade as the dependent variable, with the different artificial language learning scores, plus IQ, bilingualism and previous Spanish exposure as the independent measures in the initial model. After the step-wise regression, the final model for this second regression (Table 5) accounted for a significant amount of variance in classroom performance (R2 = .23, p= .01). As shown, IQ does play a role in Final Grade, as expected, but artificial language learning explained variance beyond IQ. In our study, the Complex and Simple AGL scores were included in the model as explaining this variance, while the recall score did not explain any additional variance in performance. As above, years of experience with Spanish and bilingualism did not account for any additional variance.
Finally, the third step-wise multiple regression included Elicited Imitation Task score as the dependent variable, also with the different artificial language learning scores, plus IQ, bilingualism and previous Spanish exposure as the independent measures in the initial model. The final model for this third step-wise multiple regression (Table 6) accounted for significant variance in Spanish ability (R2 = .24, p < .001). The final model included only Complex AGL as predicting performance on the Elicited Imitation Task, suggesting that this Complex AGL measure is a useful innovation over previous artificial language learning experiments. The fact that IQ and the other artificial language learning measures did not explain additional variance in Elicited Imitation Task performance suggests that this language-learning ability (as contrasted with classroom ability) is distinct from other measures of intelligence such as IQ and recall ability.
Although bilingualism and previous Spanish exposure accounted for no additional variance in any of the multiple regression analyses, we compared performance between bilinguals and monolinguals and correlated performance to years of Spanish exposure to further ensure that these are not factors. There was no evidence of a difference between bilinguals and monolinguals for overall artificial language learning (unpaired t-test t(42) = .29, p = .78), classroom performance (t(42) = .62, p =.53) or Elicited Imitation Task (t(42) = 1.3, p = .20) and there was also no evidence for a relationship between previous Spanish exposure and overall artificial language learning (r(41)=.11, p = .47), overall classroom performance (r(41) = -.08, p = .60) or Elicited Imitation Task (r(41) = -.15, p = .31).
Thus, these results show a strong relationship between performance on an artificial language learning task and L2 learning. Artificial language learning performance correlated with a broad range of L2 measures, including classroom performance, teacher evaluation of language ability, and more objective measures of language ability. The Complex artificial language learning task showed the strongest relationship with more objective measures like the Elicited Imitation Task. This is in accordance with the idea that artificial language learning paradigms are simplified versions of language learning, and thus the most complex artificial language learning will most closely resemble L2 learning, though correlations were found for Simple and Recall aspects of artificial language learning, as well. Conversely, measures of overall classroom performance, which are based on a number of non-language learning related skills, correlated most closely with Composite measures of artificial language learning. Crucially, these relationships still hold when controlling for general intelligence, or IQ. Finally, we considered which different aspects of L2 learning are captured by artificial language learning overall. The results of a multiple regression with artificial language learning as the dependent variable suggested that artificial language learning taps primarily into IQ and classroom performance on quizzes and exams.
In the present study, we examined the relationship between artificial language learning and natural language learning, how it may differ for different measures of artificial language learning, and how the relationship may be mediated by IQ. Our primary finding—a positive correlation between performance on an artificial language learning task and second language learning in an ecologically valid environment—provides a key link between studies that use artificial language learning experiments and our understanding of second language learning in the classroom.
By virtue of using an artificial language learning task with several measures and meaning, we were also able to show that the more complex grammar tracked most closely with classroom performance and Spanish ability. This suggests that artificial language learning studies that incorporate a semantic component and involve more complicated grammatical systems may closely resemble second language learning. On the other hand, the composite measure of artificial language learning, which included recall and simple and complex grammatical generalization, is most closely related to overall classroom performance, which includes study skills, motivation, and so on. Because different aspects of artificial language learning were related to different aspects of second language learning, we may conclude that not all artificial language learning paradigms would be expected to approximate language learning
Further research can explore the generalizability of these results to other artificial language learning paradigms. A more comprehensive study would be longitudinal and follow students over the course of a number of semesters, to observe changes in proficiency, which is more reflective of learning, and would include a larger sample size, as is important in individual differences studies (e.g., Desmond & Glover, 2002).
Juxtaposing the differences between the present artificial language learning paradigm and other studies, which (1) found no relationship between artificial and second language learning (Robinson, 2005b), (2) found a relationship mediated by other factors (Brooks & Kempe, 2013), or (3) found an indirect relationship (Evans et al., 2009; Misyak et al., 2010a; Misyak, Christiansen, & Tomblin, 2010b) suggests that the specific methods used in an artificial language learning paradigm matter in terms of engaging natural language learning processes. The current paradigm differs from previous studies by having a semantic component, by being multimodal with auditory and visual-picture referents, and by focusing on morphophonology. Future research manipulating modality, semantics and the parts of artificial language acquired can provide further clarity on what is necessary to best approximate natural language learning.
Finally, the results address our question on whether a relationship between artificial language learning and classroom learning still holds when factoring out general intelligence. The correlations are still significant after controlling IQ, suggesting that the ability being tapped into by artificial language learning and L2 learning is distinct from general intelligence.
Further characterizing the nature of this language learning ability remains an interesting challenge. The results of the present study could mean that there is a distinct language-learning skill underpinning second language learning ability and that artificial language learning studies are a useful method of exploring and evaluating that ability. This is generally the assumption made by authors using artificial language learning studies to explore language function, including those showing an overlap between neural mechanisms associated with language processing and neural mechanisms associated with artificial language learning (e.g., Friederici et al., 2002).
Alternatively, the results could be interpreted to mean that there is some third skill or capability that is crucial for both artificial language learning and second language learning distinct from IQ. This skill may be related to perceptual learning in the auditory system (Maye, Werker, & Gerken, 2002), general pattern matching, or different memory subsystems.
Indeed, auditory working memory has been argued to be involved in both first and second language learning (Alan D. Baddeley, 1992; A. D. Baddeley, Gathercole, & Papagno, 1998; Ellis & Sinclair, 1996) as well as in artificial language learning success (Amato & MacDonald, 2010; Misyak & Christiansen, 2012).
The procedural and declarative memory systems have also been suggested to play a role in both artificial and second language learning (Conway, Bauernschmidt, Huang, & Pisoni, 2010; Ettlinger et al., 2013; Morgan-Short et al., 2014; Ullman, 2004, 2005). In particular, previous research has suggested that L2 learning is supported by procedural memory (Ettlinger, 2008; Morgan-Short et al., 2014); that procedural memory is an important component of artificial language learning (Reber, 1967); and that procedural memory is distinct from general intelligence (Cohen & Squire, 1980). Thus, procedural memory may be the common substrate for artificial language learning and L2 learning that is distinct from IQ. However, the fact that the inclusion of semantics may be an important part of a predictive artificial language learning paradigm suggests that there may be more than procedural learning involved.
Implicit statistical learning may also play an important role (Conway et al., 2010; Misyak et al., 2010b) as it has also been shown to be distinct from IQ (S. B. Kaufman et al., 2010). This is further supported by evidence showing a role for dopamine in second language learning (Wong, Morgan-Short, Ettlinger, & Zheng, 2012) and in more general implicit learning processes (Jocham et al., 2009).
Ultimately, there may be some unique learning mechanism (Hauser et al., 2002) or unique combination of general mechanisms (Pinker & Jackendoff, 2009) that are specific to acquiring linguistic systems. The present study provides no evidence to distinguish these possibilities, but understanding the interaction between general cognitive capabilities underlying artificial language learning and second language learning will provide insight into understanding human language learning as a unique ability (Hauser et al., 2002) or as an ability primarily shaped by domain-general cognitive function (Elman, Bates, Johnson, & Karmiloff-Smith, 1996).
The results of the current study provide evidence for a relationship between artificial language learning and second language learning. This suggests that previous research using artificial language learning experiments may provide insight into naturalistic language learning, particularly when they include a semantic component and complexity. However, the full theoretical implications of this finding still remain unclear given that the nature of this relationship is still unknown. Future research and larger, longitudinal studies can provide more insight into the specific cognitive components involved in artificial and natural language learning. These future studies can then address the question of whether artificial language learning experiments provide insight into language-specific learning abilities or whether it is more a function of motivation, different memory subsystems, perceptual abilities, some other cognitive ability or, as is likely, some combination of these abilities.
Statement of Support:
This work was supported by National Science Foundation Grant BCS-1125144, the Liu CheWoo Institute of Innovative Medicine at The Chinese University of Hong Kong, the US National Institutes of Health grants R01DC008333 and R01DC013315, the Research Grants Council of Hong Kong grants 477513 and 14117514, the Health and Medical Research Fund of Hong Kong grant 01120616, and the Global Parent Child Resource Centre Limited to PCMW.