|Home | About | Journals | Submit | Contact Us | Français|
Cynthia Huang-Pollock, The Pennsylvania State University, Department of Psychology, 254 Moore Bldg, University Park, PA 16802
Measurement reliability is assumed when executive function (EF) tasks are used to compare between groups or to examine relationships between cognition and etiologic and maintaining factors for psychiatric disorders. However, the test-retest reliabilities of EF tasks have rarely been examined in young children. Further, measurement invariance between typically-developing and psychiatric populations has not been examined.
Test-retest reliability of a battery of commonly-used EF tasks was assessed in a group of children between the ages of 5–6 years old with (n=63) and without (n=44) ADHD.
Few individual tasks achieved adequate reliability. However, CFA models identified two factors, working memory and inhibition, with test-retest correlations approaching 1.0. Multiple indicator multiple causes (MIMIC) models confirmed configural measurement invariance between the groups.
Problems created by poor reliability, including reduced power to index change over time or to detect relationships with functional outcomes, may be mitigated using latent variable approaches.
There is increasing recognition that behaviorally-based diagnostic categories, such as those used in DSM-5, result in the creation of groups that are phenotypically and mechanistically heterogeneous (Insel et al., 2010; Sanislow et al., 2010). To resolve the issues created by diagnostic heterogeneity, researchers have increasingly turned to endophenotype measures and biomarkers (Kendler & Neale, 2010; Lenzenweger, 2013; Nolen-Hoeksema & Watkins, 2011). Neurocognitive processes, such as working memory, inhibition, and other executive functions (EF), have been specifically highlighted by the recent NIMH Research Domain Criteria Initiative (RDoC) as potential endophenotypes or biomarkers that may help elucidate mechanisms of psychiatric disorders, aid in treatment matching, and facilitate development of novel treatments (Insel et al., 2010; Nolen-Hoeksema & Watkins, 2011; Sanislow et al., 2010). However, the psychometric properties of these measures may limit their use for these purposes, especially when used with young children.
Attention Deficit Hyperactivity Disorder (ADHD) is emblematic of the problems created by etiologic heterogeneity within DSM diagnostic categories and measures of EF, speed/variability of response, and response to reward contingencies feature prominently in theoretical models of ADHD (Barkley, 1997; Castellanos, Sonuga-Barke, Milham, & Tannock, 2006; Diamond, 2005). However, a central concern and substantial obstacle to using these measures as endophenotypes or biomarkers is that their test-retest reliability is not always known (Kuntsi, Neale, Chen, Faraone, & Asherson, 2006). When test-retest reliability is low, it not only attenuates between group differences but also reduces statistical power to detect associations with genes, disease symptoms, or other outcome measures (Green et al., 2004; Kendler & Neale, 2010). Thus, better characterization of the reliability of cognitive tasks is essential.
Studies that have directly assessed test-retest reliability of EF tasks in children have focused primarily on middle-childhood and adolescence. In this age range, there is at least some evidence of adequate reliability for the most commonly used neurocognitive measures, including working memory span tasks, reaction time measures, and computerized measures of inhibitory control (Archibald & Kerns, 1999; Bishop, Aamodt-Leeper, Creswell, McGurk, & Skuse, 2001; Kindlon, Mezzacappa, & Earls, 1995; Kuntsi, Andreou, Ma, Borger, & Van der Meere, 2005; Kuntsi, Stevenson, Oosterlaan, & Sonuga-Barke, 2001; Soreni, Crosbie, Ickowicz, & Schachar, 2009; Thorell, 2007). However, reliability estimates are population specific, and tasks that demonstrate adequate or better reliability in middle childhood and adolescence may not be adequate for younger children. In particular, the rapid pace of maturation and development in early childhood may result in lower reliability estimates if the rate of development is not consistent across individuals. Similarly, individual differences in learning effects (which occur when the initial task exposure results in improved performance on later administrations) may also lower test -retest reliability. While learning and maturational effects are both reflected in test-retest data as improvements in performance between testing sessions, shorter test-retest intervals (several weeks) are primarily influenced by learning effects whereas longer intervals (several months) also capture the effects of maturation.
Neither maturation nor learning effects prevents a measure from being reliable, as long as those effects were consistent across the entire sample (Rousson, Gasser, & Burkhardt, 2002); however, how each of these processes affect task reliability remains unclear because few studies have assessed the reliability of neurocognitive tasks in early childhood. Gnys & Willis (1991) found good test-retest reliability for the Tower of Hanoi and verbal fluency tests (rxx<.70) in a sample of 96 typically-developing preschool and kindergarten children. Beck et al. (2011) similarly found good test-retest reliabilities (ICCs<.69) for a series of tasks measuring inhibitory control for typically-developing children ranging from 2–5 years old. However, both studies used same-day test-retest intervals, potentially leading to inflated reliabilities, and both studies noted the need to examine longer retest intervals. Thorell et al. (2006) found adequate or better test-retest reliabilities for inhibitory control and working memory tasks in a group of 4–5 year-old children over a two-week retest interval, however the sample was small (n=22). One of the largest studies of executive task reliability in young children to-date comes from Willoughby & Blair (2011). Here, authors found moderate reliabilities (rxx=.52 – .66) over a 2–4 week time period for a test battery of inhibitory control, working memory, and attention shifting tasks in an epidemiological sample of over 100 preschool-age children. Thus while there is evidence of moderate reliability for at least some tasks in early childhood, the number of studies examining these effects is small and studies have used relatively short test-retest intervals, which emphasize learning effects, leaving the additional effects of maturation on task reliability unclear.
An additional issue is that traditional test-retest analyses rely on the correlations between individual tasks to assess reliability, which conflates true score and error variance and does not provide an adequate measure of the reliability of the underlying construct being measured. In contrast, latent variable models, such as confirmatory factor analysis, partition variance into common and specific (task specific and error) variance (Kline, 2013; Vandenberg & Lance, 2000). Thus, the test-retest reliability of the factor scores may better reflect the stability of the underlying EF abilities as compared to individual tests. Consistent with this hypothesis, Willoughby & Blair (2011) applied confirmatory factor analysis in their sample of preschool children to demonstrate that reliability of the underlying EF factor approached unity even when individual task reliabilities did not.
Although this suggests that latent variable approaches may be useful for improving psychometric properties of individual neurocognitive tasks, several questions remain. First, all of the studies described focus on the measurement of EF factor structure and reliability in a single, typically-developing population. If these methods are to be applied to studying psychiatric populations, it is necessary to establish that the tasks function similarly in each group. In other words, measurement invariance must be established (Muthén, 1989; Vandenberg & Lance, 2000; Woods, 2009). Measurement invariance refers to whether a measure, in this case tests of EF, are psychometrically equivalent in different groups. If they are not, then groups cannot be compared because the tests may be capturing fundamentally different processes in each of the groups, rendering comparisons meaningless.
A critical first step in assessing measurement invariance is the demonstration of configural invariance, which requires that the number and pattern of factor loadings must be the same between groups (Muthén, 1989; Sass, 2011; Vandenberg & Lance, 2000). The question of configural invariance is particularly relevant for young children with and without ADHD. In typically-developing preschool-age children, a single factor is often adequate to capture EF abilities (Wiebe, Espy, & Charak, 2008; Wiebe et al., 2011; Willoughby & Blair, 2011; but see Schoemaker et al., 2012), but during middle childhood, EF becomes more differentiated and is better represented by multiple factors (Lee, Bull, & Ho, 2013; Miyake, Friedman, Emerson, Witzki, & Howerter, 2000; Shing, Lindenberger, Diamond, Li, & Davidson, 2010). It remains unclear at which age multiple as opposed to single factor models become appropriate and whether factors representing different executive abilities are equally reliable. In addition, neuroimaging studies have found that children with ADHD are characterized by protracted development of prefrontal areas of the brain supporting EF (Gilliam et al., 2011; Kofler et al., 2013; Lijffijt, Kenemans, Verbaten, & van Engeland, 2005; Mackie et al., 2007; Shaw et al., 2007), and the disorder is often conceptualized as a maturational lag, suggesting that the appropriate factor model may differ between children with and without ADHD of the same age. If this were the case, latent variable approaches would be inappropriate for between-group comparisons, and so establishing invariance is particularly important to inform additional studies of EF.
The current study examined the reliability of a battery of common neurocognitive tasks in a sample of kindergarten-age children with and without ADHD with assessments in the fall and spring of the kindergarten year. The study adds to a small literature examining individual task reliability in this age range and expands the range of test-retest intervals that have been examined. Further, we expand on prior research by directly comparing reliability in typically-developing and ADHD populations on individual tasks, as well as by using confirmatory factor analysis models to test for configural measurement invariance between children with and without ADHD and to establish the stability of latent EF factors.
All data were collected as a part of a larger study examining the impact of a social-emotional intervention program on self-regulatory skills in young childhood. For the larger study, children either participated in: 1) a 30-session (16–18 week) small group social skills training intervention condition directly aimed at building social-emotional competency, self-regulatory skills, and EF; or 2) a control condition with the same number of sessions that focused on tutoring in emergent literacy skills (e.g. letter identification, letter-sound correspondence). The control condition was not expected to affect EF, self-regulatory skills, or social-emotional and behavioral outcomes. Because children in the intervention condition were intentionally provided instruction meant to improve executive processes, they were excluded from the current study.
One hundred and seven children ages 5–6 were recruited in two successive cohorts from 48 kindergarten classrooms in six Pennsylvania school districts that included both urban and rural areas. Brochures describing the study were distributed to parents of all 5–6 year-old children in the participating classrooms. Interested parents provided their contact information, as well as informed consent for child participation in the study. All procedures received approval from the university Institutional Review Board.
Initially, teachers completed two behavioral rating scales: Conners’ ADHD Rating Scale, Short Form—Revised (CTRS-R) (Conners, 2003) and the DuPaul ADHD Rating Scale (ADHD-RS) (DuPaul, Power, Anastopoulos, & Reid, 1998). Children with elevated teacher ratings, as well as children without teacher-rated ADHD symptoms were identified. Their parents then completed a structured diagnostic interview (the Diagnostic Interview Schedule for Children (DISC-IV) (Shaffer, Fisher, Lucas, Dulcan, & Schwab-Stone, 2000) and parent-report rating forms (the Conners’ Parent Rating Scale, Long Form—Revised (CPRS-R) (Conners, 2003) and the Behavioral Assessment Scale for Preschool Children, 2nd Edition (BASC-2) (Reynolds & Kamphaus, 2004).
Children were considered to have ADHD (n= 48) if: (a) they met full clinical criteria for a diagnosis of ADHD on the DISC-IV, including criteria for impairment, chronicity, and cross-situational severity; and (b) both the parent and teacher reported age-inappropriate levels of inattention or hyperactivity defined as at least one T-score ≥ 60 (84th percentile) on the Cognitive Problems/Inattention, Hyperactivity, ADHD Index, or DSM-IV Total Index of the Conners’ or the Hyperactivity or Attention Problems Indices of the BASC-2), or ≥ 3 inattentive symptoms or ≥ 3 hyperactive/ impulsive symptoms or ≥ 4 total symptoms endorsed as “often” or “very often” on the]ADHD-RS. Children were considered to have emerging ADHD (n=15) if they did not meet full diagnostic criteria on the DISC-IV, but did have elevated levels of inattention or hyperactivity based on at least one parent-report measure and at least one teacher-report measure (criterion b above). Finally, children were considered non-ADHD controls (n= 44) if: (a) they did not meet diagnostic criteria for ADHD on the DISC-IV; and (b) teacher ratings of behavior on all relevant indices of the Conners’ and BASC-2 T-Scores ≤ 59; and (c) and the total number of symptoms endorsed following the “or” algorithm yielded ≤ 2 inattentive symptoms, and ≤ 2 hyperactive/impulsive symptoms, and ≤ 3 total symptoms. For all children, dimensional scores of inattention and hyperactivity symptom counts were determined following DSM-IV field trials (Lahey et al., 1994) using an “or” algorithm between parent report on the DISC-IV and teacher report on the ADHD-RS (where a rating of “often” or “almost always” would indicate that a symptom was present).
Given that many symptoms, particularly inattention symptoms, have low endorsement at this age but are highly endorsed as children reach middle childhood (Curchack-Lichtin, Chacko, & Halperin, 2013) and that diagnostic stability is improved across early and middle childhood when sub-threshold symptoms are considered (Bauermeister et al., 2011), children with full and emerging ADHD were grouped together for between-group analyses. However, as noted in the Results section, primary results were all confirmed using a continuous ADHD symptom count in addition to the categorical diagnostic indicator to ensure that this grouping strategy did not account for results. Description of sample characteristics can be found in Table 1.
Exclusionary criteria for the larger study included: a) parent report of a sensorimotor disability, frank neurological disorder, or psychosis; b) estimated FSIQ < 70 as measured by a 2-subtest short form (Vocabulary and Matrices) of the Stanford-Binet, 5th Edition; c) low levels of English proficiency that preclude children from completing the assessment battery; or d) if they were in a temporary custody situation with uncertain outcome. Children taking psychotropic medications were not excluded from the study. One child was prescribed Focalin and asked to discontinue medication 24 hours prior to each testing session. A second child was prescribed Strattera, which could not be safely discontinued, and participated while taking their regular dose at both the test and retest visits.
Children were assessed at two time points at the school, in a quiet room outside of the classroom setting and away from peers. The time between assessments ranged between 15–26 weeks (mean= 21.13, SD= 1.69). All children were tested individually by trained examiners.
Inhibitory control was assessed with five tasks: the Walk-a-line Slowly task (Kochanska, Murray, Jacques, Koenig, & Vandegeest, 1996), the Peg Tapping Task (Diamond & Taylor, 1996), the Head-Toes-Knees-Shoulders task (HTKS) (Ponitz et al., 2008), a Choice Delay Task (Sonuga-Barke, Taylor, Sembi, & Smith, 1992), and a Go/No-Go Task (Berlin & Bohlin, 2002),.
For the Walk-a-line Slowly task children were asked to walk along a six-foot piece of string taped to the floor as the examiner timed them. Children were then asked to repeat the task twice, walking slower, and then walking even slower -- as slowly as they could. The total score represented the average percentage by which a child reduced his/her speed on successive trials. This task has demonstrated adequate inter-rater reliability with preschool children (intra-class correlation = .98), indicating that raters are able to accurately determine the timing difference between trials (Smith-Donald, Raver, & Hayes, 2007).
In the Peg Tapping Task, children were asked to tap their peg twice when the interviewer tapped once, and vice versa. After a short set of practice items, their final score was the number of correct trials out of 16 total trials.
The HTKS task is a more complex version of the Head-to-Toes task. In this task, children habituated to several oral commands (e.g., “touch your head” and “touch your toes”). They were then asked to play “a silly game” in which,, in response to the command, “Touch your toes,” they were to touch their head. Then, they played another silly game in which they were to touch their knees when asked to touch their shoulders and vice versa. Children earned 2 points for a correct response, 0 points for an incorrect response, and 1 point if they made any motion to the incorrect response but then self-corrected. The outcome variable used was the total number of correct points, with a maximum of 40 points possible.
In the Choice-Delay Task (Sonuga-Barke et al., 1992), children chose between two rewards: 1) a one-point reward available after two seconds or 2) a two-point reward available after 30-seconds. Each trial began immediately after the reward was received from the preceding trial. Children had 20 trials in which they were instructed to earn as many points as possible. The variable used in analyses was the percentage of choices for the 2-point, delayed reward.
In the Go/No-go task children viewed four stimuli (blue triangle, blue square, red triangle, and red square) and were asked to make a key press every time they saw a blue shape (target, 75% of trials), but to withhold a response when they viewed a red shape (25% of trials). Each stimulus appeared for 1000 ms; children were allowed a total of 2000 ms to respond. Inhibitory control outcome measures included percent correct hits and commission errors. Reaction time on correct hits and standard deviation of reaction time for correct hits were also recorded as measures of processing speed, which is a separable EF factor, at least in the middle childhood age range (Rose, Feldman, & Jankowski, 2011).
Both verbal and visuospatial working memory were assessed. Verbal working memory was assessed using the Backward Word Span task (Carlson, 2005; Davis & Pratt, 1995). For this task, children listened to a list of words read out loud and then were asked to repeat the words in backwards order. The list started with one word, and increased by one additional word with successive trials. Children received a score equal to the highest number of words they were able to repeat correctly in reverse order.
Visuospatial working memory was assessed using the Finger Windows task from the Wide Range Assessment of Memory and Learning, Second Edition (WRAML-2) (Sheslow & Adams, 2003). In the forwards condition of this task, the child watched as the examiner put a pencil in a series of holes on a card. The child then recreated this series. The WRAML-2 Finger Windows task includes only a forwards condition; however, a backwards condition was also created for this study in which children needed to point to the holes the examiner identified in reverse order. In both conditions, children received one point for every correct sequence recalled.
The Dimensional Change Card Sort (DCCS) (Frye, Zelazo, & Palfai, 1995) was also included as a measure of working memory, consistent with prior literature indicating that performance in early childhood is related to ability to use higher-order if-then rules (Zelazo, 2004; Zelazo & Frye, 1998) and use working memory to overcome response conflict (Munakata, 2001). Children were shown picture cards that varied along the dimensions of color and shape (e.g. red and blue, rabbits and boats). After learning to sort the cards according color, children were then asked to sort the cards according to shape instead. The score represented the number of trials (out of 6) in which the child correctly shifted sets after the sorting criteria changed.
Two children initially recruited into the study did not participate in either the pre- or post-test due to absence from school on the day of testing, and were excluded from all analysis. Seven additional children completed only part of the pre- or post-test battery and 13 children’s files were lost to file corruption on the computerized go/no-go task at either Time 1 or Time 2. Finally, two scores were identified as outliers (> 5 Standard Deviations from the sample mean): one on the Go/No-go Reaction Time measure and one on the Walk-a-line Slowly task. The outlying scores were treated as missing data for these tasks. In assessing reliability of individual tasks, children were only included if they completed the task at both time points (pre- and post-test). The final Ns for each task are reported in Table 2. In assessing factor structure and factor reliability, all available children were included in analyses using the full information maximum likelihood algorithms in MPLUS to handle missing data.
Learning/maturation effects were assessed with a multivariate repeated-measures ANOVA including the full battery of EF tasks. Time was the within-subjects factor and ADHD diagnosis as the between-subjects factor. A main effect of Time indicates the presence of learning/maturation effects and a Time*ADHD interaction indicates differences in these effects based on ADHD status.
Test-retest reliability for each group was calculated as the age-partial inter-class product-moment correlation coefficients (Pearson correlation) between the two test administrations using SPSS (Rousson et al., 2002). There are no firm criteria for what constitutes “good” reliability. Prior research examining reliability of neurocognitive tasks has adopted the criteria that reliabilities between .50–.70 are “adequate” and those above .70 are “good” (Kindlon et al., 1995; Kuntsi, Stevenson, et al., 2001). We adopt the same criteria here. The age effects were included to account for differences in age at the initial assessment time period (Kail, 2007; Williams, Ponesse, Schachar, Logan, & Tannock, 1999). Fisher’s r-to-z tests were used to compare the correlation coefficients and determine whether these differed significantly between the diagnostic groups.
Confirmatory factor analyses (CFA) were conducted in MPLUS v.7.2. First, a series of CFA models were tested in the two diagnostic groups separately. Then, a multiple-indicators multiple-causes (MIMIC) model (Muthén, 1989) in which ADHD diagnostic status was included as a covariate was used to test for measurement invariance (Muthén, 1989; Woods, 2009). MIMIC models can be used to test for configural invariance by including grouping variables as covariates in the factor model, rather than testing separate models for each group as is required for multiple-group CFA (Kim, Yoon, & Lee, 2012; Muthén, 1989). MIMIC models assume equivalent factor loading across groups, rather than testing this directly (Muthén, 1989), thus they cannot be used for testing metric and other types of invariance. However, MIMIC models are preferred for testing measurement invariance in small samples (Muthén, 1989; Woods, 2009), and configural invariance must be established for subsequent tests of metric and other types of measurement invariance to be meaningful (Vandenberg & Lance, 2000). Further, using a MIMC model approach, both categorical and continuous measures of ADHD could be used in tests of measurement invariance, which is not possible with a multiple groups approach. Thus, using the MIMIC model approach to establish configural invariance is a first and critical step in determining the utility of latent variable models for comparing between typically-developing and psychiatric populations. MIMIC models were estimated using ADHD diagnostic status as a covariate. Each direct effect was estimated in a separate model with false discovery rate correction employed (Benjamini & Hochberg, 1995). Results were confirmed with total ADHD symptoms (continuous) as a covariate using the same procedures. As a final step in the analyses, pre- and post-test assessments were used in a single CFA model to determine the test-retest reliability (stability) of the identified factors across time.
MIMIC models estimate a much smaller number of parameters than multiple group CFA approaches by treating the grouping variable as a covariate, thus making them more appropriate in small samples. Monte Carlo simulation power analysis in MPLUS indicated adequate power (>.80) for all models, including the MIMIC models. Thus, models were adequately powered to detect configural invariance where it existed.
A multivariate repeated-measures ANOVA including the EF measures revealed a significant main effect of Time (p < .001) and Condition (p <.001), but no significant Time*Diagnosis interaction effect (p = .205), indicating the presence of learning and maturational effects (i.e., improvements over time) but no difference in these effects based on diagnostic status. Follow-up univariate tests for the main effect of Time indicated significant learning effects for eight of the twelve measures. In all cases performance improved at the second administration of the test. See Table 2 for means and standard deviations at each time point and summary of significance tests for learning/maturation effects. Results were also confirmed using the continuous ADHD symptom count rather than categorical diagnosis.
Table 3 shows the correlation matrix for all EF tasks in the full sample and Table 4 shows the age-partial test-retest correlations by group. In the typically-developing group, three tasks reached adequate or better levels of reliability: mean and standard deviation of RT from the Go/No-go task and HTKS. In the ADHD sample, six tasks reached adequate or better reliability: DCCS, Finger Windows Backwards, HTKS, as well as Go/No-go Hit Accuracy and commission errors, and Peg Tapping. Test-retest correlations for Word Span, Go/No-go Hit rate, and Peg Tapping were significantly higher in the ADHD than in the typically-developing group.
A three factor CFA with 1) Inhibitory Control (GNG Accuracy, GNG Commissions, Peg Tapping, Delay Aversion, Walk-a-Line Slowly), 2) Working Memory (Backward Word Span, DCCS, HTKS, Finger Windows Forward, Finger Windows Backward), and 3) Processing Speed (GNG RT, GNG SDRT) was fit in the full sample. The model did not converge and provided a poor fit to the data. Problems with model convergence were a result of the linear dependency between the two indicators of the Processing Speed factor: RT and SDRT. This factor was also problematic in that CFA factors defined by only two indicators are not identified and so a minimum of three indicators is recommended to include a factor in CFA (Muthen & Muthen, 2009). Thus, in remaining analyses we focus on a two-factor CFA model using only the Working Memory and Inhibition factors.
In the two-factor CFA model for the full sample, the Delay Aversion and Walk-a-line Slowly tasks did not load significantly on the Inhibitory Control factor at either the test or re-test time point. Further, a model excluding these tasks showed significantly better fit to the data at both times, and so these tasks were excluded from additional analyses. Thus, the final two-factor CFA model included an Inhibitory Control factor (GNG Accuracy, GNG Commissions, Peg Tapping) and a Working Memory factor (Backward Word Span, DCCS, HTKS, Finger Windows Forward, Finger Windows Backward). Based on modification indices provided in MPLUS, the residual covariances for Finger Windows Forward and Finger Windows Backward were allowed to correlate. The two-factor CFA model fit well in the full sample (Adjusted BIC= 2073.0; χ2=10.1, p=.928; RMSEA=0.0; CFI=1.00) and is shown in Figure 1. The 2-factor model fit significantly better than a 1-factor CFA model (Adjusted BIC= 2126.5; χ2=66.7, p<.001; RMSEA=.15; CFI=0.84).
The two-factor model fit well in both the typically-developing and ADHD groups separately at both Time 1 and Time 2 (all p> .05 for χ2 tests, all RMSEA< .03, all CFI> 0.98) and fit significantly better than the one factor model for both groups at both time points. We next combined the two groups into a single MIMIC analysis in which the ADHD diagnostic indicator was regressed onto the factors to assess measurement invariance between the diagnostic groups. A significant direct effect of ADHD diagnosis on the factor indicators (i.e. on observed task performance) would indicate a lack of measurement invariance. At Time 1, the model fit was good (Adjusted BIC= 2048.1; χ2=19.8, p=.711; RMSEA=0.0; CFI=1.00). ADHD diagnosis was significantly related to the factors, indicating that the well-documented ADHD-related deficits in working memory and inhibitory control are captured by the latent variables; however, there were no significant direct effects of ADHD diagnosis on any of the factor indicators (all p>.05). The lack of direct effects indicates measurement invariance for this time point. All results were replicated for Time 2, confirming measurement invariance across the diagnostic groups at both time points. All results were also confirmed using total ADHD symptoms as a continuous covariate instead of the categorical diagnostic indicator.
Finally, a two time point, two-factor CFA model was fit to establish the stability of the factor scores across time. Residual covariances between Time 1 and Time 2 tasks scores were allowed to correlate. The model was a good fit for the data (Adjusted BIC= 3927.0; χ2=93.0, p=.393; RMSEA=.018; CFI=.99). The full model is shown in Figure 2. Consistent with prior studies the stability of the latent variables approached unity. The correlation (stability coefficient) for the Inhibitory Control factor was .98 and for the Working Memory factor was .99.
Adequate test-retest reliability for neurocognitive tasks is not only critical to the accurate estimation of between group effect sizes (Huang-Pollock, Karalunas, Tam, & Moore, 2012), but also to the ability to detect associations between these cognitive processes, putative genetic mechanisms, symptom domains, and other outcome measures (Green et al., 2004; Kendler & Neale, 2010; Kuntsi et al., 2005; Kuntsi et al., 2006). In the current study, individual measures used to assess neurocognitive functioning in young children showed a wide range of reliabilities, with only a handful achieving adequate levels. Despite moderate or worse reliabilities for many individual tasks, latent variable modeling indicated test-retest correlations approaching 1.0 for factors measuring both Inhibition and Working Memory. High reliability was achieved over a longer test-retest interval that has previously been examined in this age range, suggesting that maturational effects did not limit the reliability of the underlying EF constructs as measured by the latent variable models.
In addition, MIMIC models confirmed configural invariance across typically-developing and ADHD samples, which is a critical for establishing that latent variable approaches can be used for comparison of these populations. The question of configural invariance between diagnostic groups in this age range is not trivial. In particular, in preschool-age children EF appears to be best represented as a unidimensional construct captured by a single factor (Hughes, Ensor, Wilson, & Graham, 2009; Wiebe et al., 2011; Willoughby & Blair, 2011). However, in middle childhood a multidimensional factor structure with three to five factors is most often found, suggesting that EF abilities become more differentiated with age (Lee et al., 2013; Miyake et al., 2000; Shing et al., 2010). Children in our age range are on the cusp of these two time periods. Further, children with ADHD are often conceptualized as having maturational delays in prefrontal regions supporting EF (Gilliam et al., 2011; Kofler et al., 2013; Lijffijt et al., 2005; Mackie et al., 2007; Shaw et al., 2007), which implies that different factor structures may be needed to capture EF in typically-developing and ADHD children of the same age. However, this was not the case. In the current study, the same multi-dimensional factor solution fit the ADHD and typically-developing groups equally well. Although groups differed in factor means, capturing well-documented ADHD-related deficits on both working memory and inhibitory control, there were no indirect effects of ADHD diagnosis on individual tasks. The lack of indirect effects confirms that the tasks functioned similarly in both groups. Future studies with larger samples will be needed to establish strong measurement invariance, including metric and scalar invariance. However, these results suggest that latent variable approaches are a viable solution for addressing problems created by low individual task reliabilities, which reduce power for detecting between-group effects, limit the ability to detect developmental change, and interfere with the use of neurocognitive measures in endophenotype studies.
Although the clear recommendation from this study is to capitalize on latent variable approaches to maximize reliability, this may not always be possible and so several individual tasks that did not achieve adequate reliability are worth highlighting. Consider, first, the delay aversion task. Previous delay tasks for which adequate reliability has been reported in preschool-aged children either used test-retest intervals that were exceptionally short (~15 minutes), which would have artificially inflated their reliabilities, or used tangible rewards (cookies, pennies) (Beck et al., 2011; Thorell & Wahlstedt, 2006). Thus, it may be that young children require tangible rewards to elicit reliable reward choices.
Second, in middle childhood, mean RT and SDRT have each demonstrated adequate reliability and shown promising associations with behavioral symptoms and putative genetic mechanisms of ADHD (Kuntsi et al., 2005; Kuntsi, Oosterlaan, & Stevenson, 2001; Wood, Asherson, van der Meere, & Kuntsi, 2010). In contrast, in this study, reaction time measures failed to achieve adequate reliability in young children with ADHD and other studies have found that they do not differentiate young children with and without ADHD (Kalff et al., 2005). The current results suggest that the inability to differentiate groups may be due to low reliability, rather than because young children with ADHD do not have deficits in processing speed and efficiency.
This difference in interpretation has important implications for the search for endophenotypes in particular. One requirement of cognitive endophenotypes is that they should be stable over time regardless of disease course (Gottesman & Gould, 2003). Thus, if speed and variability of information processing do not characterize young children with ADHD, then this would argue against their mediating between gene action and symptom domains. However, if measures of speed and variability of information processing are unreliable in young children with ADHD, then this suggests that reliable measures first need to be identified before developmental trends can be assessed.
This problem with interpretation is not limited to reaction time measures. There are a growing number of studies aimed at characterizing the relationships between neurocognitive processes and ADHD symptom domains. These include studies comparing the relative heritability of different cognitive processes (Stins et al., 2005), the strength of relations between endophenotype and symptom domains in different age groups (Brocki, Fan, & Fossella, 2008), and the strength of relations between different neurocognitive processes and symptom domains. In each case, specific attention to task reliability is critical, given that any differences in the strength of association may reflect differences in task reliability rather than the underlying construct being assessed.
The current study confirmed configural invariance only for the working memory and inhibitory control factors. A third processing speed factor could not be tested in the CFA models because there were too few and too highly correlated indicators for this factor. Future test batteries that incorporate a larger number of tasks assessing processing speed will be required to address this limitation. In addition, the Delay Aversion and Walk-a-Line Slowly tasks did not load on the Inhibitory Control factor. The inhibition of gross motor movement required for Walk-a-Line Slowly makes it substantially different from the other inhibitory control tasks, which emphasize inhibition of prepotent fine motor movements and require stimulus discrimination and choice between two responses options. The low factor loading of the Delay Aversion task is consistent with prior work suggesting reward delay tasks tap a different type of inhibitory control than non-reward based tasks (Willoughby, Kupersmidt, Voegler-Lee, & Bryant, 2011; Zelazo & Carlson, 2012). In particular, whereas the majority of inhibitory tasks for the current study tapped “cool” executive processes (i.e., inhibitory control elicited in relatively emotion-free contexts) the delay aversion tasks taps a “hot” inhibitory process, elicited by the emotionally-salient rewards offered. Future studies in which more “hot” executive tasks are administered could test for the presence and reliability of an additional “hot” executive factor. Finally, additional studies with larger sample sizes will be required to confirm the current results and to test for other types of measurement invariance, including invariance of factor loadings between groups.
Current results provide evidence that many measures of cognitive function used with older children and adults do not achieve adequate levels of reliability in preschool-aged children. The use of measures with poor reliability hinders the field’s ability to identify associations with symptom domains and may lead to erroneous conclusions about the developmental stability of cognitive phenotypes in psychiatric disorders. Selection of reliable measures is increasingly important as greater emphasis is placed on the use of neurocognitive measures to identify genetic mechanisms with relatively small individual effects and the current study suggests that latent variable approaches are a viable solution to problems created by low reliability of individual tasks.