|Home | About | Journals | Submit | Contact Us | Français|
Recent research has shown that humans, like many other animals, have a specialization for assessing fighting ability from visual cues. Because it is probable that the voice contains cues of strength and formidability that are not available visually, we predicted that selection has also equipped humans with the ability to estimate physical strength from the voice. We found that subjects accurately assessed upper-body strength in voices taken from eight samples across four distinct populations and language groups: the Tsimane of Bolivia, Andean herder-horticulturalists and United States and Romanian college students. Regardless of whether raters were told to assess height, weight, strength or fighting ability, they produced similar ratings that tracked upper-body strength independent of height and weight. Male voices were more accurately assessed than female voices, which is consistent with ethnographic data showing a greater tendency among males to engage in violent aggression. Raters extracted information about strength from the voice that was not supplied from visual cues, and were accurate with both familiar and unfamiliar languages. These results provide, to our knowledge, the first direct evidence that both men and women can accurately assess men's physical strength from the voice, and suggest that estimates of strength are used to assess fighting ability.
Multiple converging lines of evidence indicate that the ability to fight has been a powerful selective force acting on human males over evolutionary time. Males have been shown to set thresholds for acceptable resource division (i.e. welfare trade-off ratios; see Tooby et al. 2008) based partly on physical strength: stronger men are more prone to anger (Archer & Thanzami 2007; Sell et al. 2009b). Stronger men also prevail more in conflicts of interest, and have a higher sense of entitlement to contested resources (Sell et al. 2009b). Stronger males are also rated as more physically attractive, have more sexual partners and lose their virginity at earlier ages than weaker men (Sell 2006; Frederick & Haselton 2007; Gallup et al. 2007). Most significantly, cross-cultural evidence indicates that a man's fighting ability is a powerful determinant of his access to resources in most, if not all, cultures (Daly & Wilson 1988). Documented cases include the Yanomamo of Venezuela (Chagnon 1983), the Achuar of Ecuador (Patton 2000), the Tsimane of Bolivia (von Rueden et al. 2008), the Dani of highland New Guinea (Sargent 1974), the Samoan islanders (Freeman 1983), Mexican gangs (Lewis 1961), the Montenegrins of Eastern Europe (Boehm 1984), the Inuit (Balikci 1970) and American gangs (Toch 1969). Finally, a large range of anatomical and physiological sex differences testify to an evolutionary past in which males frequently engaged in interpersonal aggression. For example, men mature later, have larger bodies (Plavcan & Van Schaik 1997), have higher basal metabolic rates (Garn & Clark 1953), larger hearts, better heat dissipation, more haemoglobin, more muscle, less fat and denser bones (Archer 2009; Lassek & Gaulin 2009).
Given the importance of men's strength and fighting ability in resolving conflicts and in female mate preferences, it would be surprising if both men and women did not have adaptive specializations for detecting relevant cues of these attributes. Such abilities have been routinely documented across a broad array of non-human species (for review, see Huntingford & Turner 1987; Krebs & Davies 1993). In fact, recent research has shown that both men and women can judge physical strength from photographs of conspecific males (Sell et al. 2009a); approximately 40 per cent of the variance in adult male strength can be detected from static photographs of the body, and 25 per cent from static photographs of the face alone.
Ancestrally, there would have been many occasions when the assessment of fighting ability from visual cues would have been problematic: e.g. distance, darkness and intervening physical obstructions (as when chimpanzee male coalitions call from neighbouring territories). Such conditions, if sufficiently recurrent, would have selected for supplementary assessment systems that did not rely on sight. Moreover, it seems probable that the voice conveys an array of different kinds of information relevant to conflict that is largely absent from a visual inspection of the body. For example, the vocal cords are regulated by the vagus nerve, which receives information from the viscera, and is also involved in the parasympathetic–sympathetic regulation of heart rate, blood pressure and other aspects of the so-called fight-or-flight system. The voice is likely to reveal short run changes in the maintenance or loss of fine-motor control compared with gross motor control (relevant to rage); parasympathetic/sympathetic system state; and the current tension of the individual's muscular system—itself an indicator of fear. As muscular tension goes up, pitch rises (Titze 1994). The voice may reveal longer term, more stable characteristics as well, such as anatomical relationships related to resonance, and the degree of exposure to testosterone during development. The vocal cords have androgen receptors, and sex differences in vocal fold size are surprisingly large (Titze 1994; Newman et al. 2000).
If the voice contains adaptively relevant information not available visually, then it seems likely that specializations evolved for extracting this information. For these reasons, it becomes important to ask whether there are non-visual cues of fighting ability to which humans may have been designed to detect and respond. A number of lines of evidence suggest that normal adult male speech may contain cues of formidability.
Measuring fighting ability in humans is ethically difficult and complicated by an evolutionary history of weaponry. Upper-body strength, however, has been shown to be a good proxy for fighting ability: it correlates with a history of aggression, self-reported success in conflicts and objective ratings of fighting ability by third parties viewing photographs of the targets (Sell et al. 2009a). An engineering analysis also shows that it is primarily upper-body strength that determines the hitting power of the weaponry available to human ancestors (Brues 1959).
In the eight studies presented below, we tested the following predictions derived from the theory that humans possess perceptual adaptations for extracting vocal cues of strength and fighting ability.
In addition to these primary predictions, we analysed the voice samples acoustically to determine the relationship of fundamental frequency (F0) and formant dispersion (Df) to voice ratings and physical strength and body size (i.e. height and weight).
Voice samples, body and strength measurements were taken from eight populations. For each sample, strength measures were z-scored and then averaged to create a single strength score for each subject that weighted each strength measure equally (Sell et al. 2009a).
Sixty-three male psychology students from the University of California, Santa Barbara (UCSB), (mean age = 18.7, s.d. = 0.88, range: 18–22) were given course credit for participating.
Subjects were instructed to say in their normal speaking voice, ‘This is an experiment, over and out’. Voices were recorded using the built in microphone on a portable audio cassette recorder (Sony CFD-S350; frequency range = 80 Hz–10 kHz). Acoustic analyses were performed on recordings acquired in a variety of ways, many using relatively low sampling rates, and three using compression (samples 3, 5 and 7). Research has shown that F0 measurements are highly comparable across formats in the range of sampling rates used in the current studies, including recordings using lossy compression (mp3; e.g. Gonzalez et al. 2003; Deliyski et al. 2005).
In sample 1 only, subjects were asked to report how many physical fights they had engaged in during the last 4 years (M = 4.0, s.d. = 7.4) and were given the Physical Aggression subscale of the Buss–Perry Aggression Questionnaire (M = 2.5, s.d. = 0.75; Buss & Perry 1992). This scale contains items designed to measure the tendency to react with overt physical attacks such as, ‘If somebody hits me, I hit back’ and ‘Once in a while I can't control the urge to strike another person’.
Forty-nine adult male Tsimane Indians provided voice samples in their native Tsimane language and had their strength and body size measurements taken as part of a larger project (mean age = 35.8, s.d. = 13.5, range: 19–68; von Rueden et al. 2008). The Tsimane language resembles Moseten (the language of a Bolivian group similar to the Tsimane), but otherwise these two are distantly related to other South American languages.
Subjects were instructed to say in a normal speaking voice, ‘Nobi cojiro tsun quin dyem' venchuban aca'yaty anic fer no'bacni tsun’, which translates as, ‘We will cross the river and then arrive home; it was a tough crossing for us’. Voices were recorded with a portable digital recorder (iPod with Belkin Voice Recorder; frequency range = 500 Hz–12 kHz).
Twenty male Andean herder-horticulturalists from the villages of Gobernador Solá and Ingeniero Maury in the province of Salta, Argentina, provided voices and were measured on strength, height and weight (mean age = 34.8; s.d. = 19.1, range 15–71). Their voice samples were given in Spanish.
Subjects were instructed to speak in a normal speaking voice, ‘Cuando llueve se inundan las chacras, y la gente junta el maíz y prende fuego’, which translates to, ‘When it rains the ranches get flooded, and the people gather the maize and light fires’. Voices were recorded using a built-in microphone on a portable digital recorder (Olympus WS-100; frequency response: 100 Hz–12 kHz).
Fifty male students (sample 4) and 50 female students (sample 6) were taken from the student population at UCSB (males: mean age = 20.2, s.d. = 2.24, range: 18–31; females: mean age = 18.8, s.d. = 0.95, range: 18–22) and were paid $10 for their participation.
Subjects were asked to read the first sentence of the Rainbow Passage (Fairbanks 1960), ‘When the sunlight strikes raindrops in the air, they act like a prism and form a rainbow’. The recording was taken with the built-in microphone on a portable cassette recorder (Sony Pressman TCM-400DV; frequency range: 250–6300 Hz).
Forty-four male students (sample 5) and 30 female students (sample 7) from the University of Timisoara in Romania (males: mean age = 21.7, s.d. = 3.48, range: 20–38; females: mean age = 21.1, s.d. = 1.89, range 20–29) participated as part of a course requirement.
Height and weight measures were not available for these subjects.
Subjects were instructed to say in a normal speaking voice, ‘Iesi si taci’, which translates to, ‘Get out and be quiet’. Voices were recorded with the built-in microphone on a portable digital recorder (Olympus WS-100; frequency response: 100 Hz–12 kHz).
Fifty-four male students from the student population at UCSB (mean age: 19.9, s.d. = 2.0, range: 18–23) were brought to a sound studio on campus to give voice samples and have strength measurements taken. High-quality recordings are needed to get accurate Df measures, so this sample served primarily to determine the relationship between Df and voice ratings and body measurements.
Subjects were instructed to produce five vowels (‘a’ as in bait, ‘e’ as in beet, ‘i’ as in bite, ‘o’ as in boat and ‘oo’ as in boot). Accurate measures of Df require averaging across a range of vocal tract configurations. Voices were recorded digitally (44.1 kHz, 16 bit) using AKG C 414B-XL II microphones with the polar pattern set to Cardioid (frequency range: 20 Hz–20 kHz).
Undergraduates from UCSB were instructed to rate the voices on physical strength, height and weight using a 7-point scale. Types of ratings were done separately and randomly so that a given rater would rate all voices in a given sample on one variable at random (e.g. strength), then the same voices on the next variable (e.g. height) and then on the final variable (e.g. weight). Voices were also randomized within each block. Because height and weight were not available for samples 5 and 7, these voices were rated on physical strength only. For sample 1 only, raters also assessed the voices on ‘how tough he would be in a physical fight’. Each sample was rated by a different group.
Raters for sample 1. Fifty-three undergraduates (22 female, mean age: 20.7, s.d.: 2.8) were paid $5 to rate the 63 United States (US) male voices.
Raters for sample 2. Thirty-one undergraduates (10 female, mean age: 19.7, s.d.: 1.5) were paid $5 to rate the 49 male Tsimane voices.
Raters for sample 3. Thirty undergraduates (17 female, mean age: 18.8, s.d.: 1.1) were paid $5 or given course credit to rate the 20 Andean voices.
Raters for sample 4. Fifty-four undergraduates (42 female, mean age: 20.5, s.d.: 5.7) were paid $5 to rate the 50 US male voices.
Raters for sample 5. Twenty undergraduates (12 female, mean age: 19.1, s.d.: 1.0) were given course credit to rate the 44 Romanian male voices.
Raters for sample 6. Forty-seven undergraduates (30 female, mean age: 20.2, s.d.: 1.9) were paid $5 to rate the 50 US female voices.
Raters for sample 7. Twenty-one undergraduates (14 female, mean age: 19.0, s.d.: 0.9) were given course credit to rate the 30 Romanian female voices.
Raters for sample 8. Thirty-six undergraduates (25 female, mean age: 19.4, s.d.: 1.4) were paid $5 to rate the recordings of vowel sounds from the 54 US males.
All analogue (i.e. cassette recorded) speech samples were digitized (44.1 kHz, 16 bit) and resampled to 11.025 kHz with an anti-aliasing filter. Acoustic analysis was performed using PRAAT (v. 4.6.03). Average (F0) was calculated on entire utterances using PRAAT's autocorrelation algorithm with a search setting of 75–500 Hz, as recommended for adult males, or 100–600 Hz for female voices. In sample 8, Df was calculated as (F4 – F3) + (F3 – F2) + (F2 – F1)/3 with separate calculations of all formants for each vowel, and those values averaged for the final Df value. Overall F0 was calculated by averaging F0 values across the five vowels.
All samples were analysed independently. In order to compare our results with previous studies that averaged ratings together and correlated them with body variables, we computed Pearson correlations between averaged voice ratings and their actual counterparts. This is the traditional technique for assessing the accuracy of voice ratings, but it provides ambiguous information about the accuracy of individual raters because a few raters with large correlations can cause sizeable aggregate correlations even if the average rater is not accurate at all. A more informative measure of the average individual accuracy can be done with hierarchical linear regression models (HLM) that we used to compute the average within-rater slopes of the relationship between voice ratings as a dependent variable and actual body characteristics as level 1 predictors (Raudenbush & Bryk 2002). For example, a regression slope relating ratings of strength to actual strength can be computed for each of the 53 raters in study 1. Hierarchical regression then computes a regression coefficient (γ) that represents the average of those 53 slopes and tests the null hypothesis that this average slope is zero. All variables were standardized (z-scored) before entry into the HLM, making the gamma statistic interpretable as a standardized regression coefficient. All HLM analyses controlled for sex of rater effects (all non-significant). All HLM calculations were performed with HLM 6.0 software from Scientific Software International, Inc.
So that our data could be compared with previous studies, we have reported the Pearson correlations between actual body characteristics and the average ratings of those characteristics from the voice on the left side of table 1. HLMs with ratings of a trait as the dependent variable and the actual measured trait as the level 1 predictor variable were used to estimate the average individual accuracy of strength, height and weight assessment from the voice and are reported in the three middle columns of table 1. These gammas represent the estimated accuracy for a random individual rater.
The results confirmed predictions (i) through to (v).
Yes. In sample 1, raters assessed ‘how tough he would be in a physical fight’ in addition to strength, height and weight. Perceptions of the targets' fighting ability were virtually identical to the perceptions of physical strength (r = 0.98), showing that—regardless of their relationship in the real world—raters were treating physical strength and fighting ability as synonymous. This same relationship was found when assessing fighting ability from photographs (Sell et al. 2009a).
Yes. In sample 1, the only sample for which the data were available, the average perceived fighting ability scores for target males were positively correlated with their fighting history (how many fights they reported having been in during the last 4 years), r = 0.36, p = 0.002, and their score on the Physical Aggression subscale of the Buss–Perry Aggression Questionnaire, r = 0.49, p = 0.0001. In other words, raters are detecting a cue in the voice that tracks actual measures of fighting behaviour. Although this finding suggests that they are tracking actual fighting ability, it ought to be treated cautiously because we do not know how many of these fights the targets won. Nonetheless, research has shown that more formidable individuals are those more likely to engage in fights (Archer & Thanzami 2007; Sell et al. 2009,b).
Yes. To test prediction (iv), our strength measures for samples 1 and 4 were recomputed without the photograph rating component. The new strength measure was then predicted in a simultaneous regression analysis by both the average voice rating of strength and the average photograph ratings of strength. The results of the model showed independent contributions from both the photo ratings (sample 1: β = 0.52, p = 0.00001; sample 4: β = 0.50, p = 0.0002) and voice ratings (sample 1: β = 0.25, p = 0.022; sample 4: β = 0.27, p = 0.032) on actual strength. These results demonstrate that there are cues in the voice which predict strength that are not available during visual inspection of static photographs, and suggest that the system evolved to take advantage of this additional information.
Because height, weight and strength are positively related to each other, the ability to assess strength from the male voice could be a by-product of height and weight detection. Alternatively, the ability to accurately estimate height and weight from the voice could be a by-product of a mechanism for strength assessment. Our prediction is that a mechanism designed by natural selection to assess formidability would extract cues of physical strength rather than just height or weight. This prediction stems from previous research showing that physical strength, more than height or weight, is related to anger and aggression (Sell et al. 2009a,b). To distinguish between these hypotheses, we constructed three HLM models each with actual strength, height and weight predicting one of the three ratings. The independent accuracy of each of the three body measurements across six studies are shown on the right side of table 1. The results indicate:
The ability to estimate strength, in men at least, is not a by-product of height and weight assessment; the data are more consistent with the proposal that height and weight assessment are by-products of strength assessment.
Furthermore, raters seemed to be giving the same ratings regardless of whether they were asked to rate height, weight or strength. The inter-correlations between the average ratings of height, weight and strength are extremely high, especially for samples with more raters: Cronbach's α (sample 1 = 0.97; sample 2 = 0.85; sample 3 = 0.90; sample 4 = 0.98; sample 5 = 0.91, sample 8 = 0.94; all n = 3). In summary, regardless of whether you ask individuals to rate height, weight, strength or fighting ability, they will produce the same ratings, which track cues of physical strength independent of height and weight.
No. In four of the six male samples, the stimulus subjects were asked to report their own physical strength (samples 1, 3, 4 and 5). The speakers rated their physical strength relative to other males—e.g. ‘I am stronger than ____% of others of my sex’. This self-report measure has been shown to correlate with actual lifting strength in previous studies (Sell et al. 2009a; electronic supplementary material). For each of the four studies in which self-report was available, a simultaneous regression analysis was performed, predicting the voice rating of strength from both self-reported strength and actual strength. In all four samples, self-reported strength did not account for any significant variance in voice ratings when actual strength was in the model: sample 1: actual strength β = 0.47, p = 0.002, self-reported strength β = −0.083, p = 0.57; sample 3: actual strength β = 0.47, p = 0.047, self-reported strength β = −0.05, p = 0.82; sample 4: actual strength β = 0.51, p = 0.005, self-reported strength β = 0.004, p = 0.98; sample 5: actual strength β = 0.53, p = 0.0005, self-reported strength β = −0.17, p = 0.22. In other words, actual strength determined how strong the subject's voice sounded to others, and not how strong the subject believed himself to be. This is the opposite of what would be predicted if men who believed themselves strong were modulating their voice and that this relationship was responsible for the relationship between physical strength and ratings of strength from the voice. This is not to say that there are not vocal cues of confidence or lack of fear that may be produced during aggressive interactions (we believe there are), but those cues are not responsible for the ability to assess strength from normal speaking voices.
F0 is the acoustic correlate of perceptual pitch and is the result of vocal fold vibration during speech. Studies generally show no relationship between F0 and physical size in adult males (see Evans et al. (2006) for a notable exception), but experiments that manipulate F0 suggest that humans might use F0 as a cue to strength. For example, men show more deference to a voice with lower F0 (Puts et al. 2006). Rendall et al. (2007) suggested three possibilities for people's misguided use of F0 in body size judgments. One is that listeners might be generalizing the between-speaker differences that do exist (e.g. women and children, on average, have higher pitched voices than adolescent and adult males). Second, listeners might be generalizing from other negative pitch–size relationships (i.e. larger things produce lower frequencies on average in the world). Last, listeners might be letting other F0 correspondences (e.g. F0 is related to testosterone exposure) intrude on pitch–body size judgments (see also Collins (2000) for related discussion).
To address these questions, all voice samples were analysed for F0 and correlated with average voice ratings and actual body ratings. Results are reported in table 2. F0 failed to reliability correlate with measures of strength. While F0 did tend to affect raters' estimations of strength (i.e. lower pitch correlated with perceptions of greater strength), this percept was erroneous. Recent work has explored this ‘perceptual pull’ that F0 exerts on judgments of size, and the results are consistent with ours (Rendall et al. 2007).
Df is the most reliable known acoustic index of body size in humans, as well as other primates. Df is the averaged distance between adjacent resonance frequencies resulting from the vocal source filtering of the superlaryngeal vocal tract. Fitch (1997) proposed that because vocal tract length (VTL) positively correlates with height, the distance between adjacent frequencies should provide a reliable index (i.e. Df should negatively correlate with VTL). There is only limited evidence that listeners can detect body size from this cue however (Rendall et al. 2007). Rendall and his collaborators found that judges could discriminate between speakers with different Df but only when the actual height difference between presented pairs exceeded 10 cm. These data are consistent with other work showing that VTL just noticeable differences are approximately 4–7% in synthesized speech (Ives et al. 2005; Smith et al. 2005). Detecting formant spatial relationships is clearly difficult even for a system that probably has related specializations for perceiving formant information in other domains (e.g. formant transitions in vowel sounds). Nevertheless, there is a distinct possibility that humans respond to Df as a cue of physical strength.
High-quality acoustic recordings of vowels representing a range of vocal tract configurations are needed for accurate measurement of Df. Sample 8 was designed primarily to assess the relationship between Df and formidability and ratings of strength. The voice samples consisted of vowels produced by subjects in a sound studio. Df did not correlate with physical strength or weight (strength: r = −0.08, p = 0.58; weight: r = 0.04, p = 0.77), but did show a significant negative relationship with height (r = −0.36, p = 0.01). Like F0, Df showed significant negative relationships with voice ratings as well (strength: r = −0.45, p = 0.001; height: r = −0.51, p = 0.0001; weight: r = −0.37, p = 0.01). However, Df cannot be responsible for the accuracy of strength assessment as it is not correlated with strength. Additionally, when controlling for Df, ratings of strength are still accurate (partial r = 0.30, p = 0.03). Ratings of height become only marginally significant when controlling for Df (partial r = 0.25, p = 0.07), suggesting raters may be using Df in their height assessments. In summary, Df is a genuine cue of height that raters might be using when assessing height, but Df is not related to physical strength and is not the cue raters are using to accurately assess strength.
For the moment, the acoustic variables that people use to assess physical strength are unknown.
These findings support the hypothesis that the human voice—especially the male voice—contains cues of physical strength, and that natural selection built specializations into the human neurocognitive architecture designed for assessing fighting ability based on these cues. These specializations extract information not available from the visual channel alone and cues that are uniquely predictive of upper-body strength rather than height and weight. These results complement and support earlier work, indicating that fighting assessment from visual channels focuses on predicting upper-body strength in men (Sell et al. 2009a).
These results offer an explanation as to why a 30 year research programme examining the ability of raters to detect height and weight from the voice has returned such inconsistent findings. From an evolutionary point of view, determining a conspecific's height or weight had no clear function in itself. On the other hand, determining physical strength was vitally important. Consistent with this, ratings tracked physical strength independent of both height and weight across all samples. However raters were not able to consistently extract unique cues of weight or height.
Both theoretical analyses and evidence from other species indicate that natural selection would favour cognitive mechanisms that accurately assess fighting ability. The studies presented here provide direct evidence that both men and women can accurately assess adult men's physical strength from cues in the speaking voice. Furthermore, the cues are not solely cues of size but instead appear to track correlates of upper-body strength more directly and in ways that do not require experience with a particular language. Taken together, the results strongly support the hypothesis that the human cognitive architecture contains specializations for formidability assessment through auditory channels.
The studies presented here in were approved by the Human Subjects Committee ethics board at the University of California, Santa Barbara.
We thank Christina Larson, June Betancourt, Mahsa and Golsa Afsharpour, Lauren Click, Andy Delton, Max Krasnow, Jim Roney, Roxana Toma, Silviu Nisu and Phillip Smith for their valuable insights and assistance. We would also like to thank Professor P. Petrescu of the University of Timisoara, Norma Naharro, Mirta Santoni, the head of the Museo de Antropología de Salta, and the villagers of Gobernador Solá and Ingeniero Maury as well as the Tsimane people. We thank Kevin Kelly for donating his time and expertise with the UCSB sound studio. Financial support for this project was provided by an NIH Director's Pioneer Award to L.C. and the UCSB Pre-Dissertation Field Site Visit Grant awarded to D.S.