Joint action as a general form of coordinated social action begins with the earliest relationship. Affective entrainment in social contexts seems to be a predisposition of the infant and mother (Trehub,
2003), as in the “intuitive parenting” that helps to regulate the infant’s emotional state and guide learning in his pre-verbal environment (Papoušek,
1996b). One of the highlighted aspects of the mother–infant interaction is its musical nature. Musical qualities of typical infant-directed speech include slow and regular tempo, repetition, exaggerated prosody and the accompaniment of movement and gesture. These features may serve as coordination smoothers, as well as functioning to engage the infant’s attention (Fernald,
1991), to enable emotional communication (Trainor et al.,
2000; Trainor,
2008), and to promote language acquisition (Papoušek,
1996a; Kuhl,
2004).
The infant is responsive to such expressive musical characteristics in the mother’s (or other caretaker’s) displays, and also participates in “communicative musicality” (Malloch and Trevarthen,
2009; the infant’s behavior has also been referred to as “protomusical,” e.g., Cross,
2003), for example, in predicting the timing of the mother’s expressions and producing responses that are to some degree temporally coordinated (e.g., Crown et al.,
2002). Jaffe et al. (
2001) define temporal coordination as a form of interpersonal contingency, and they describe interactional (non-periodic) rhythms which allow an infant and parent to each predict the timing pattern of the other’s behavior. This ability to predict timing is necessary in order to eliminate the time lag of a sensorimotor reaction (Merker et al.,
2009), and is considered to be essential to bonding within the dyad (Jaffe et al.,
2001). In a common game played by adults to engage infants, violation of expectation in timing is used to make the infants laugh, also supporting infants’ ability for predictive timing (Stern,
2007). Infants’ production of musical actions, as in sung tones or rhythmic movements, begins to emerge during the first year (perhaps especially with encouragement) and can be considered in the context of the communication of the pre-verbal infant (Trehub,
in press). Yet before full musical actions emerge, the infants’ responsiveness to musicality and the timing of their responses may build upon a repertoire of non-verbal vocal and gestural expressions (Eckerdal and Merker,
2009).
Infants thus engage in a kind of “sympathetic conversation” with their mothers, the timing of which enables them to anticipate and relish in the mothers’ expressive displays, and causes them distress if timing is mismatched (see Trevarthen and Aitken,
2001). Temporally coordinated actions can be simultaneous, dovetailing, or alternating (Feldman,
2007), though these interactions are not typically strictly periodic, and are stochastic and bidirectional in organization (Cohn and Tronick,
1988). When mother–infant interactions take the form of chorusing, they have been referred to as “coaction” (Dissanayake,
2000). In this context, the behaviors that infants display with their caregivers include social gaze, facial expressions, and vocal behaviors (by 3

months of age), gesture, and shared attention to objects (after 6

months of age; Feldman,
2007). The roughly unison or matched forms of interaction between infants and their caregivers might be attributed to motor resonance (e.g., Meltzoff and Decety,
2003), and reflect emergent coordination behavior. The function of this emergent coordination is to establish affective entrainment, cooperation, and bonding between the infant and her caregiver (Feldman,
2007; Feldman et al.,
2007).
Turn-taking in infancy has been studied primarily in the context of observing spoken conversation, and indicates the importance of the development of social cognition and attention. Infants shift gaze or attention as they observe videos of adults engaged in conversational turn-taking (von Hofsten et al.,
2005), and gaze shifts become increasingly predictive with age (Bakker et al.,
2011). By 4–6

months infants are sensitive to cues of social cognition (such as selective attention to face-to-face interactions) in turn-taking (Augusti et al.,
2010), which coincides with the development of their ability to use infant-directed speech cues to choose their preferred social partners (Schachner and Hannon,
2011). Before semantics and syntax are shared between conversational partners, cues to turn-taking are manifest in culturally dependent vocal prosody, eye gaze, and body movements (Wilson and Wilson,
2005) but with a universal target of timing in turn-taking (Stivers et al.,
2009). A further cognitive skill in interpretation of conversational turn-taking is anticipating transitions in conversation (i.e., when the next speaker will begin to speak), which does not develop until around 3

years of age and might be influenced by language development (von Hofsten et al.,
2009). This anticipatory timing skill might also correspond to improvements in planned coordination of actions (Figure ).
Spatiotemporal imitation appears in the repertoire of young infants – for example in facial (see Meltzoff and Moore,
1997) and manual gestures (Bekkering et al.,
2000), as well as affective mirroring, beginning with the first social smiles at just 6

weeks of age (Rochat,
2007) – which are all examples of emergent coordination (Figure ). The timing of imitation is limited by cognitive–motor maturity but is taught in part by caregivers to young children, often via imitation games aimed at encouraging joint action (Gergely et al.,
2002; Sebanz et al.,
2006; Papoušek,
2007). In early stages imitation is automatic and relies primarily on motor resonance (Paulus et al.,
2011), even more strongly in practiced behaviors such as crawling than in novel behaviors (i.e., walking; van Elk et al.,
2008). While many actions of young infants may rely on resonance, according to von Hofsten (
2004), even from birth such actions are not mere reflexes but can be motivated, informed, and goal-directed. Evidence for this claim includes infants’ interest in tracking and imitating the purpose and the outcomes of observed actions (e.g., von Hofsten and Siddiqui,
1993; Gergely et al.,
2002; Gergely and Csibra,
2003). The goal-directed nature of infants’ actions, revealing planning, prediction, and motor representations, could enable the progression (with muscular and especially cognitive development) from resonant imitation to more complex forms of joint and complementary joint action.
In the earliest joint actions that are performed by infants with adults, the adult typically helps the infant to achieve her goal, in which case a precise motor representation is not required (Vesper et al.,
2010). By around 1

year of age, several cognitive changes facilitate joint action. First, joint attention has emerged (see Rochat,
2007), in which individuals knowingly attend to the same object or event. This skill coincides with monitoring of relative shared attention between the infant and his social partner (Rochat,
2007). The 1-year-old understands intentions, which has been argued (Tomasello et al.,
2005) to be the basis for understanding beliefs (i.e., theory of mind) – emerging around 15

months (Onishi and Baillargeon,
2005) and continuing to develop with experience with language and shifting of perspective (Tomasello et al.,
2005). At 1

year action-planning is evident, as infants show goal-directed eye movements that reflect motor representations (Falck-Ytter et al.,
2006). These motor representations are thought to be supported by the mirror neuron system: a brain network that is recruited similarly during action perception and action execution (Falck-Ytter et al.,
2006), and is thought to be involved in the interpretation of music and dance in adults (Stevens et al.,
2001; Zatorre et al.,
2007; Overy and Molnar-Szakacs,
2009).
Beyond the first year emerge the abilities to interpret goal-directed actions in a rational manner (e.g., Gergely et al.,
2002), suppress imitative motor representations, and eventually perform complementary joint actions (Sebanz et al.,
2006). Motivation for these changes stems in part from “shared intentionality,” or the sharing of psychological states in order to reach mutual collaboration (Tomasello and Carpenter,
2007). Planned coordination calls upon the more advanced cognitive–motor skills of precise (and shared) sensory and motor representations, action monitoring, and behavioral modification (cf. Keller,
2008). Presumably, concurrent with the maturation of the above-mentioned cognitive and motor skills (hence less reliance on scaffolding by adults) as well as language development (e.g., von Hofsten et al.,
2009), the practice of musical activities during childhood may continue to engage attention and foster coordination skills. For example, prioritized integrative attending, when musical ensemble performers monitor the aggregate sounds (which children’s choirs can do to an extent) presumably builds upon the earlier abilities for joint attention and dividing attention between actors, once a shared goal representation is also established. Between the ages of 2.5 and 3

years, the skills of coordination (timing) in joint action show substantial improvement even if individual performance on a task improves only marginally (Meyer et al.,
2010). Complementary roles in joint action appear to be mastered from the age of 3

years (similar to linguistic turn-taking; see von Hofsten et al.,
2009), as action-planning and control become refined (Meyer et al.,
2010). Improvisation in social contexts may have a role in the building of planned coordination capacities upon emergent coordination capacities. For example, turn-taking has been observed in children’s vocal play, even in improvised and complementary forms (Dissanayake,
2000), and such vocal play is expressed in virtual social contexts, as in imaginary dialogs and play-acting (see Rochat,
2007).
To achieve temporal entrainment in music, it may be necessary to practice the skill of sensorimotor synchronization – that is, the ability to move in time with perceived external events (Repp,
2005). For example, the clapping games and nursery rhymes, songs, and group dances, of school-age children begin to demonstrate the kind of coordination that is required to keep time as a group (e.g., Provasi and Bobin-Begue,
2003), with an external periodic pulse (McAuley et al.,
2006), and using multiple levels of metrical hierarchy (Drake et al.,
2000). Such music and games, which represent collective social entrainment (Phillips-Silver et al.,
2010), may also foster the development of ensemble playing, and play an important role in the improvement of automatic and deliberate adaptive timing skills, attention, and auditory–motor imagery (Keller,
2008).
The ability for precise synchronization seems to mature gradually, probably building on early perceptual abilities for processing the musical beat (Hannon and Trehub,
2005; Phillips-Silver and Trainor,
2005; Winkler et al.,
2009). In studies of synchronization of body movement with music or with a musical partner (in children between the ages of 5

months and 5

years), the children’s motions – or the sounds produced by them – are not tightly synchronized (phase-locked) to the musical beat (Eerola et al.,
2006; Kirschner and Tomasello,
2009; Merker et al.,
2009; Zentner and Eerola,
2010). This suggests that sensorimotor synchronization in music is not typically developed until sometime later in childhood or near adolescence (Merker et al.,
2009; although cases of exceptional childrens’ musical performances can suggest otherwise, e.g., Merker et al.,
2009; Sowinski et al.,
2009)
1.
Practice of the affective component of entrainment in joint action is natural, as infants and young children show a predisposition to “groove” to music in social contexts – that is, they are compelled to move to the music and derive pleasure from it (Janata et al.,
2012). Infants and toddlers display a variety of dance gestures in response to music (Eerola et al.,
2006), and they produce spontaneous dance motions more to music than to speech (Zentner and Eerola,
2010). The social component is clear in that infants’ and toddlers’ dancing is associated with positive affect (Zentner and Eerola,
2010), and young childrens’ musical drumming is enhanced in a social context (Kirschner and Tomasello,
2009). The development of action simulation in childhood could further facilitate the understanding of others’ affective states (Decety and Grèzes,
2006) and complementary joint action as in musical exchange (Kirschner and Tomasello,
2009), especially when the roles are truly complementary.