|Home | About | Journals | Submit | Contact Us | Français|
Despite decades of interest in moral character, comparatively little is known about moral behavior in everyday life. This paper reports a novel method for assessing everyday moral behaviors using the Electronically Activated Recorder (EAR)—a digital audio-recorder that intermittently samples snippets of ambient sounds from people's environments—and examines the stability of these moral behaviors. In three samples (combined N = 186), participants wore an EAR over one or two weekends. Audio files were coded for everyday moral behaviors (e.g., showing sympathy, gratitude) and morally-neutral comparison language behaviors (e.g., use of prepositions, articles). Results indicate that stable individual differences in moral behavior can be systematically observed in daily life, and that their stability is comparable to the stability of neutral language behaviors.
“Living a moral, constructive life is defined by a weighted sum of countless individual, morally relevant behaviors enacted day in and day out (plus an occasional particularly self-defining moment).”
Morality has received a great deal of attention from psychologists in recent years. However, little of this work has examined moral behavior in naturalistic, “real-world” contexts. As such, the present study aims to establish a novel, reliable method for objectively and unobtrusively measuring moral behaviors that are observed in ordinary, everyday settings, and to use this method to examine the stability of individual differences in moral behaviors.
To place the current work into context, we highlight important gaps in the existing literature on morality. First, while classic social psychological research (e.g., Darley & Batson, 1973; Milgram, 1974) examined overt behavior, modern research has largely focused on moral cognition and emotion. Psychology has lately seen a surge of research on moral decision-making and the cognitive and emotional factors that influence moral judgments (Graham, Meindl, & Beall, 2012; c.f., Aquino & Freeman, 2009; Schwitzgebel, 2009), but, little contemporary work has examined overt moral behaviors, especially frequent, everyday moral acts (as opposed to exceptional moral acts).
To the extent that moral behavior has been studied, the research relies heavily on self-reported and laboratory-based measures. This is appropriate for research on moral identity, values, and judgments, but is problematic for studying moral behavior. People, on average, view themselves in a positive light (Alicke & Sedikides, 2009) and are especially likely to have distorted self-views for traits and behaviors that are highly evaluative (i.e., positively or negatively valenced; Vazire, 2010). Moral behaviors are arguably among the most evaluative behaviors (Goodwin, Piazza, & Rozin, 2013; Wojciszke, Bazinska, & Jaworski, 1998), which raises concerns about the accuracy of self-reports. Thus, although both self-views and behaviors are important to study and understand, self-reports of behavior are an inadequate substitute for measuring actual moral behavior (Graham, 2014).
At the same time, while studies have directly assessed moral behavior, these have mostly taken place in staged laboratory environments (e.g., Batson, Kobrynowicz, Dinnerstein, Kampf, & Wilson, 1997; Zhong, Bohns, & Gino, 2010; cf., Bateson, Nettle, & Roberts, 2006; Schwitzgebel, 2009). This methodology is insufficient for examining individual differences in moral behavior because people's laboratory behavior may not adequately reflect everyday behavior (Graham, 2014). Recent work has begun to explore morality in more natural contexts (Hofmann, Wisneski, Brandt, & Skitka, 2014), but this work frequently relies on self-reports of behaviors. To begin developing a more complete understanding of everyday moral functioning, the present study seeks to establish a reliable method for objectively observing moral behaviors outside the laboratory.
The existing literature emphasizes the variability of morality (Graham et al., 2012; Hartshorne & May, 1928), and how even subtle situational manipulations influence moral actions (Blanken, van de Ven, & Zeelenberg, 2015; Darley & Batson, 1973; Doris, 2002). Although this emphasis has sparked important research, relatively little of this work directly addresses the stability of individual differences in moral behavior. Where individual differences in morality have been examined, the focus has usually been on differences in moral perceptions and values (e.g., moral foundations; Graham, Nosek, Haidt, Iyer, Koleva, & Ditto, 2011), rather than actual moral behavior. To fill this gap, we examine the temporal stability of individual differences in actual, naturalistically observed, moral behaviors.
We present a method for objectively measuring everyday moral behaviors and examine the degree to which individual differences in these behaviors are stable across context and time. We use repeated observations in natural contexts to examine the consistency of moral behaviors—that is, whether people who act in more morally desirable ways than others at one time are also likely do so at another time. Our goal is to provide evidence for the viability of a new naturalistic method for studying actual, everyday moral behavior, as well as evidence about the degree to which individual differences in moral behavior are stable.
Our method employed the Electronically Activated Recorder, or EAR, a pocket-sized, wearable device that intermittently records short “sound-bites” of the wearer's audible environment, allowing researchers to unobtrusively capture ambient sounds from people's moment-to-moment lives (Mehl, Robbins, & große Deters, 2012). This method allows us to objectively assess actual behaviors, addressing calls to reemphasize the study of behavior in personality and social psychology (Baumeister, Vohs, & Funder, 2007; Furr, 2009). Although we are not able to assess all moral behaviors with this method, such as grand acts of heroism and self-sacrifice, this method does allow us to measure what is perhaps the most common form of morality (Hofmann et al., 2014): everyday, moral behaviors with a prosocial (or anti-social) focus. Additionally, the EAR enables the collection of representative samples from the full spectrum of participants’ daily lives over several days, maximizing the generalizability and ecological validity of research findings (Brunswik, 1956) and allowing us to capture patterns of behavior that are more likely than single instances to reflect individual differences in moral personality.
Following previous work on the stability of personality and self-reported moral constructs, we predicted that individual differences in moral behavior would be relatively stable over time (exhibiting moderate effect sizes of r = .30 to .50). This prediction is based on the test-retest stability of personality traits such as agreeableness and conscientiousness (Fleeson, 2001; Roberts & DelVecchio, 2000), which are related to behaving morally (Cohen, Panter, Turan, Morse, & Kim, 2014; Matsuba & Walker, 2004). Furthermore, studies examining explicitly moral constructs have shown that the rank-order stability of individual differences in moral judgments is relatively high (Bollich, Hill, Harms, & Jackson, 2015; Graham et al., 2011). Although most of this work relies on self- or other-reports of traits rather than observed behavior, it nevertheless provides grounds for predicting stable individual differences in moral behavior. The methodology of the present study enables us to directly test this prediction.
We report how we determined our sample sizes, all data exclusions, and all relevant measures in the study. We used data from three samples, for a total of 186 participants1. Sample 1 consisted of 11 rheumatoid arthritis patients (11 women; Mage = 56.38, SDage = 13.32; for more information on this sample, see Robbins, Mehl, Holleran, & Kasle, 2011). Sample 2 consisted of 73 adults participating in a randomized controlled trial of the effects of a meditation intervention on healthy adults (47 women, 26 men; Mage = 32.16, SDage = 7.99; Raison, 2014). Sample 3 included 102 adults consisting of 52 women with breast cancer undergoing adjuvant cancer treatment (Mage = 56.16, SDage = 13.95) and their co-habitating partners (7 women, 43 men; Mage = 59.41, SDage = 14.61; for more information on this sample, see Robbins, Lopez, Weihs, & Mehl, 2014). Sample sizes were determined by the availability of resources and preexisting data. These sample sizes provided 80% power to detect effect sizes of at least r = .69 for Sample 1, at least r = .32 for Sample 2, and at least r = .28 for Sample 3. (Data from the three samples can be accessed at https://osf.io/xpqhw/.) For some analyses, the three samples were combined into one dataset—we note below when this is the case.
All participants wore the Electronically Activated Recorder (EAR; Mehl et al., 2012), a small electronic recording device that turns on intermittently and records sound-bites from participants’ daily lives over the course of the study. The EAR consists of a mobile device (HP iPAQ 110 or Apple iPod touch) and a recording application (a software for the HP device and an iTunes App for the iPod touch). Participants in Sample 1 wore the EAR on two weekends four weeks apart, and it recorded 50 s every 18 min (average number of valid files with audible speech = 101, SD = 43). Participants in Sample 2 wore the EAR on two weekends about 10 weeks apart, and it recorded either 50 s every 9 min or 30 s every 12 min (average number of valid files with audible speech = 137, SD = 58). Participants in Sample 3 wore the EAR on one weekend, and it recorded 50 s every 9 min (average number of valid files with audible speech = 78, SD = 36). In all three samples, participants were informed their files would be coded for a broad set of daily behaviors, but moral behaviors were not specifically mentioned.
In total, 19,063 EAR files containing audible speech were coded and transcribed by trained research assistants. For Samples 1 and 2, each research assistant who coded the first weekend files for a given participant also coded that same participant's files from the second weekend. For all files in which participants were talking, coders coded each file for a set of positive and negative moral behaviors.
The first step in this research was determining what moral behaviors could be coded from the EAR data. We sought to identify behaviors that are both acoustically detectable and occur with some regularity in everyday life. This necessarily excluded rare behaviors (e.g., rescuing a drowning child) and emphasized interpersonal behaviors (because they are more common and more likely to be acoustically detectable), while excluding solitary behaviors without a readily audible component (e.g., cheating on a test). In addition, because we had only the auditory channel and not other channels of information (e.g., visual) available, we were limited to verbal behaviors (i.e., words) as opposed to physical acts (e.g., gestures). However, many everyday moral acts are expressed in words (e.g., apologizing, criticizing), and so although more channels of information would clearly have provided added value, we arguably had the most important single channel (auditory) for detecting a broad range of everyday moral behaviors.
We sought to cover a range of everyday moral behaviors whose presence (e.g., showing sympathy) or absence (e.g., acting condescending) is indicative of moral conduct. Although moral conduct includes more than just prosocial (or antisocial) actions, recent research shows that prosocial behaviors (i.e., behaviors related to harm or care) are the most commonly self-reported type of moral behavior in daily life (Hofmann et al., 2014), so, moral behavior was defined as behavior with a prosocial (or antisocial) focus. Using these three criteria (audibly detectable, occurring regularly, and prosocial), we selected 14 categories of moral behavior (e.g., showing affection, showing gratitude, praising or complimenting, acting condescending or arrogant, criticizing others) that the authors determined would be sufficiently audible and common, and which pilot testing suggested could be reliably assessed. Table 1 provides a complete list of the 14 behavioral categories and examples from actual EAR recordings (more detailed information on considerations around the development of an EAR coding system and the implementation of EAR behavior coding are provided in the online Supplemental Material).
All sound files in which the participant was talking were coded by three independent coders for the presence or absence of all 14 moral behaviors. Any one file could contain more than one kind of moral behavior (e.g., a participant might apologize and show gratitude in one file). This behavior counting approach to EAR coding complements existing behavioral observation studies that are based on validated behavior rating systems (Funder, Furr, & Colvin, 2000), and yields data that are based on non-arbitrary, and intuitively interpretable metrics (i.e., the number of times a behavior was displayed or the percentage of interactions in which a behavior was present), thereby facilitating effect size calibration and interpretation (Blanton & Jaccard, 2006).
How much temporal stability is necessary to say that a behavior is stable? To provide a benchmark against which to compare the stability of individual differences in moral behaviors, we computed the stability of individual differences for a set of neutral behaviors. To select these neutral behaviors, “morally empty” language variables were chosen that matched the moral behaviors in terms of base rate. The categories we chose were: articles (e.g., a, the), prepositions (e.g., of, between), adverbs (e.g., around, here) and references to space (e.g., above, near), time (e.g., early, bye), and numbers (e.g., first, five).
These language variables provide a particularly strong test for several reasons. First, language can be coded objectively and reliably from EAR files (Mehl & Pennebaker, 2003), which means that stability estimates of language will not be attenuated due to unreliability. The selected language variables are maximally evaluatively neutral—that is, they have minimal inherent positive or negative connotation (e.g., article and preposition use), and thus are (largely) theoretically and empirically independent of the examined moral behaviors examined (average |r| = .11). Finally, these neutral language behaviors are not likely to be subject to strong self-presentational effects (i.e., participants are not likely to intentionally vary their use of prepositions from one context to another for self-presentational reasons). For these reasons, we expected individual differences in the neutral language behaviors to be reliable and stable, and thus a high benchmark for judging the stability of moral behaviors.
Trained coders transcribed all of participants’ utterances contained in their EAR files. Specifically, one coder transcribed the conversations while coding other behaviors, and subsequent coders proofread transcripts while coding for other behaviors.
In all analyses, we only used EAR files with audible speech by the participant, as these are the only files in which participants could be (audibly) performing the moral and language behaviors of interest to the present study.
We calculated the inter-rater reliability for each of the moral behaviors separately in each sample and with all three samples combined2. For each participant, we averaged each coder's ratings of each moral behavior across all of a participant's EAR files. This provided us three average ratings per participant (one from each coder) for each of the moral behaviors we examined. Table 2 (columns 4, 7, 10, and 13) shows the inter-coder reliability (ICC[1,3]) for each moral behavior.
Verbatim EAR transcripts were analyzed using the Linguistic Inquiry and Word Count text analysis program (LIWC; Pennebaker, Francis, & Booth, 2007). This program analyzed participants’ word use and calculated the percentage of participants’ total words spoken in which they used particular categories of words (e.g., prepositions). We selected word categories that (a) are evaluatively neutral (i.e., have no or minimal positive or negative connotation), and (b) had relatively similar base rates to the moral behaviors (using base rates from all samples combined; Table 2, column 11). These categories were: articles, prepositions, adverbs, space, time, and numbers.
We took two approaches to measuring the stability of moral behaviors and language use. First we assessed rank-order stability, which we were able to test using participants who wore the EAR on two separate weekends (i.e., Samples 1 and 2). We averaged each moral and language behavior for the first weekend and the second weekend separately. This gave each participant two composite scores for each of the behaviors, which were then correlated with each other to obtain a measure of rank-order stability.
We also considered an alternative form of temporal stability: momentary stability. Using a similar method to Epstein's early work on trait and behavior stability (1979, 1980), we grouped EAR files by odd-numbered and even-numbered files—that is, a person's first sound file was odd, her second sound file was even, and so on. We then averaged behaviors within the odd files and within the even files. This gave each participant two composite scores for each of the moral and language behaviors, which were then correlated with each other in order to measure momentary stability, which we calculated for Samples 1-3. Because Sample 3 was comprised of romantic partners, we accounted for the dependency between couple members by using the Actor Partner Interdependence Model (APIM; Kenny, Kashy, & Cook, 2006). Parameter estimates from structural equation models were constrained across partners after establishing that there were no differences between breast cancer patients and their partners on the behaviors examined. Effect estimates were standardized for interpretability and comparison with estimates from Samples 1 and 2.
An important feature of these two methods for assessing stability is that we are aggregating across multiple observations (e.g., each of the odd vs. even and first weekend vs. second weekend aggregates contains numerous EAR codings per participant). By aggregating behaviors spread over several days, we are able to reduce measurement error and improve the reliability of the measures of behavior (Epstein, 1979). Including multiple instances of behaviors also increases the likelihood that we are capturing a representative sample of each participant's situations. As a result, these aggregated assessments capture patterns of behavior that are more likely to reflect individuals’ typical levels of behavior compared to a single instance of behavior, often observed in a laboratory.
The percentage of files in which participants displayed the moral behaviors can be found in Table 2, along with base rates for the matched neutral language behaviors. The moral behaviors were modestly correlated with each other (average within-sample |r| = .26, range = .03 - .71), suggesting that the moral behaviors we assessed were diverse and captured non-overlapping variance in participants’ moral acts. There were substantial individual differences in how often participants engaged in the moral behaviors we examined. For example, although on average people expressed gratitude in only 3.2% of their conversations, one person expressed gratitude during 17.5% of her conversations, whereas 16 people never expressed gratitude in any of their EAR recordings. In addition, although on average people only criticized others in 4.6% of their conversations, one person criticized others in 22.2% of her conversations, whereas 10 people never criticized others in any of their EAR recordings.
First, we examined whether some people regularly behave more morally than others. To test this, we examined the rank-order stability of behavior by correlating the aggregate of behavior over the first weekend with the aggregate of behavior over the second weekend for each behavior (using Samples 1 and 2 only). Individual differences in moral behavior were moderately stable (average r = .47 over 4 weeks in Sample 1 and .52 over 10 weeks in Sample 2; Table 3). This was comparable to the stability of individual differences in the neutral language behaviors (average r = .26 over 4 weeks in Sample 1 and .45 over 10 weeks in Sample 2; Table 3). Taken together, these results show that individual differences in moral behavior are relatively stable over time.
Next, we examined the momentary stability of moral behaviors and neutral language behaviors. In all three samples, we created an aggregate of odd files and another aggregate of even files and then correlated the two aggregates separately for each sample. For Sample 3, we used the Actor Partner Independence Model (APIM; Kenny et al., 2006) to control for the dependency between couple members.3 Moral behaviors evidenced moderate to strong momentary stability (Sample 1 average r = .42, Sample 2 average r = .71, Sample 3 average standardized b = .38; Table 4), and this was comparable to the momentary stability of neutral language behaviors (Sample 1 average r = .32, Sample 2 average r = .66, Sample 3 average standardized b = .29; Table 4). Together, these findings show that a person's typical level of engaging in moral behavior is a reliable, stable characteristic.
The present study establishes a novel method for naturalistically assessing everyday moral behaviors, and provides evidence that there are substantially stable individual differences in these moral behaviors. Indeed, individual differences in moral behavior were at least as stable as individual differences in neutral language behaviors. This is impressive—we expected neutral language behaviors to be highly reliable and stable (because they can be measured without coder error and because they are not subject to self-presentational concerns), and thus considered them a high benchmark for gauging the stability of moral behaviors.
These findings present important evidence that socially significant moral behaviors can be reliably observed in daily life using the Electronically Activated Recorder (EAR). Given the potential biases in both self- and peer-reports of morality, using the EAR provides a complementary way to assess morality that sidesteps these limitations, and could ultimately be used to examine the accuracy of self- and peer-reports of morality. By bringing the study of morality out of the lab and into the real world where moral behaviors naturally occur, this method opens up the study of moral character to a variety of questions that will deepen our scientific understanding of the complexities of morality. For instance, it is possible to conduct a study specifically designed to capture moral behaviors as they occur across different settings (e.g., home, work, and social contexts), allowing researchers to directly examine the cross-situational consistency of moral behaviors (Bleidorn & Denissen, 2015).
As with any study, there are limitations that deserve attention and provide ideas for future research. The measure of rank-order stability we use spans a short time period (4 or 10 weeks), and future work should examine the stability of moral behaviors over longer periods. Additionally, two of the samples include patient populations (i.e., people with rheumatoid arthritis [Sample 1] and breast cancer [Sample 3]), and thus it is important to conduct additional work exploring other ages, health groups, and cultures. Nevertheless, it is worthwhile to note that these samples increase the generalizability of these findings across diverse ages and groups compared to the typical sample of college students. Furthermore, given that individuals in these three samples were dealing with the stresses of cancer, coping with chronic illness, or involved in a meditation intervention, the substantial stability in moral behavior we observed may actually be less than what might be observed in other populations that are experiencing less extraordinary circumstances (i.e., not undergoing personal upheavals or interventions).
While providing important evidence of temporal stability, the present study does not directly address the consistency of moral behaviors across diverse situations. Our results suggest there is stability from day to day, and as Epstein (1980) points out, because no two situations can be exactly the same, these findings indirectly suggest the presence of some cross-situational consistency. However, momentary stability does not strictly test consistency across situations and future research that simultaneously measures situations and daily moral behaviors will provide a more formal test of the stability of moral behaviors across contexts. Moreover, we cannot rule out the possibility that the stability of individual differences in moral behavior is due to the stability of individual differences in situations. That is, if there are individual differences in the situations people consistently find themselves in, this could be driving individual differences in moral behavior. However, the stability of individual differences in situations could itself be driven by personality differences (i.e., situation selection effects), so disentangling these processes will require extensive repeated assessments of situations and behavior over time.
Although the EAR offers a unique look at everyday, naturally-occurring moral behaviors, it also limits the type of moral behaviors that can be assessed. For example, we could not assess inaudible behaviors, such as cheating or dishonesty, or uncommon behaviors, such as acts of bravery or heroism. Nor could we fully assess the context in which behaviors occur because the sound bites are brief (<1 min), making it difficult to assess more complex moral behaviors. Like other behavioral research, the EAR does not allow confident attributions of mental states: the data do not allow us to speak to whether a participant had morally desirable or undesirable motivations or intentions. Finally, as evidenced by some lower intercoder reliabilities (Table 2), some moral behaviors are more difficult to code from the EAR. It is worth noting that behaviors with lower intercoder reliabilities also had lower base rates. However, we do not know if these base rates are specific to our samples, and encourage future research to continue assessing these behaviors (see Supplemental Material for recommendations for coding in future samples).
Despite these limitations, the present findings make important contributions to our understanding of individual differences in moral behavior. In addition, the use of the EAR to study moral behavior is an important advance in the study and measurement of moral behavior. Future research should examine how individual differences in moral behavior are related to self- and other-perceptions of morality, as well as moral judgments, emotions, and intentions. Together, these approaches will help us capture a more complete picture of morality as it is manifested in everyday life.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
1Samples 1 and 2 were used in previous work (Robbins, Focella, Kasle, López, Weihs, & Mehl, 2011; Robbins, Lopez, Weihs, & Mehl, 2014; Robbins, Mehl, Holleran, & Kasle, 2011). However, the present analyses do not overlap with those in previous publications. A broad overview of the project was summarized in Mehl, Bollich, Vazire, & Doris (2015).
2The combining of all three samples does not take into account some dependence among participants in Sample 3. Ignoring this non-independence (resulting from within-couple similarity) should leave our effect estimates unbiased but might result in biased standard errors (and confidence intervals) and an overestimation of the effective degrees of freedom (Kenny et al., 2006). Given that the combined sample size is large (N = 186) and we tested for zero-order effects, using slightly biased standard errors and significance tests seemed preferable over (randomly) excluding data from one member of each dyad in Sample 3.
3This analytic method excluded two participants who did not have partner data.