|Home | About | Journals | Submit | Contact Us | Français|
The overarching goal of this review is to examine the current best evidence for assessing bipolar disorder in children and adolescents and provide a comprehensive, evidence-based approach to diagnosis. Evidence-based assessment strategies are organized around the “3 Ps” of clinical assessment: Predict important criteria or developmental trajectories, Prescribe a change in treatment choice, and inform Process of treating the youth and his/her family. The review characterizes bipolar disorder in youths - specifically addressing bipolar diagnoses and clinical subtypes; then provides an actuarial approach to assessment - using prevalence of disorder, risk factors, and questionnaires; discusses treatment thresholds; and identifies practical measures of process and outcomes. The clinical tools and risk factors selected for inclusion in this review represent the best empirical evidence in the literature. By the end of the review, clinicians will have a framework and set of clinically useful tools with which to effectively make evidence-based decisions regarding the diagnosis of bipolar disorder in children and adolescents.
There have been radical changes in our scientific understanding and clinical practices around the diagnosis of bipolar disorder in children and adolescents. Whereas the condition used to seldom be diagnosed before puberty, there has been a recent surge in rates of diagnosis such that a large proportion of psychiatrically hospitalized youths now carry clinical diagnoses of bipolar disorder 1, and there has been a more than 40 fold increase in rates of diagnoses over a 10-year period 2. There has been debate about whether the increase in diagnosis is primarily due to a correction of previous under-diagnosis, versus concerns that it is now overdiagnosed or even a case of “diseasemongering” 3. Discussion has also focused on whether bipolar disorder in youth is the same illness as in adults, versus representing a different condition or perhaps a pediatric subtype 4-6. Although the topic is still portrayed as controversial in the popular media, at this point more than 350 peer-reviewed publications have investigated different aspects of pediatric bipolar illness 7. Growing evidence from clinical and epidemiological studies around the world indicates that bipolar disorder often first manifests in adolescence or earlier 8,9, and many apparent differences between adult and child presentations appear to be an artifact of definitional issues and not real variations in clinical presentation 7. Prospective longitudinal studies also are documenting moderate to high levels of developmental continuity with adult bipolar disorder 10-12. All lines of evidence strongly indicate that bipolar symptoms in youths are associated with considerable impairment and warrant clinical attention.
The goal of the present review is to provide a step-by-step, evidence-based approach to the assessment of bipolar disorder in children and adolescents. The review is organized around clinical decision-making and then monitoring progress over the course of treatment. The review does not discuss the potential merits of all the most commonly used instruments for psychological, educational, or psychiatric assessment; instead, it concentrates on those tools that have research supporting their validity with regard to pediatric bipolar disorder. The majority of the assessment tools routinely used for psychological 13 and psychiatric evaluation have not been validated for work with pediatric bipolar disorder 14. Rather than using standardized assessment batteries out of convention or habit, we believe that the assessment endeavor will perform best when each component is chosen based on its demonstrated validity and relevance to clinical intervention. Evaluation strategies should address one of the “3 Ps” of clinical assessment: 1) Predict important criteria or developmental trajectories, 2) Prescribe a change in treatment choice, or 3) Inform the Process of treating the patient or family 15. Narrowing assessment batteries down in this manner has many benefits that include creating a strong link between assessment and treatment, reducing time and expense by eliminating unnecessary testing, and improving decisions and treatment outcomes by reducing “information clutter” and providing more focused information that directly pertains to the individual patient. The 3 Ps provide a rubric to help navigate the assessment process from establishing risk of pediatric bipolar disorder, confirming the diagnosis, informing treatment selection, measuring progress and outcome, and monitoring for relapse prevention.
The diagnostic criteria for mood disorders are unusual in that they require a two-stage evaluation 16,17. First, the clinician must evaluate the lifetime history of mood episodes, not just characterizing the current presenting problem. Only after gathering data about the possible occurrence of each type of potential mood episode over the lifetime can the clinician proceed to establishing the formal diagnosis. Diagnosing bipolar disorder requires this complexity, because the presentation of the illness can change dramatically as it transitions into different episodes.
The diagnostic mood episodes include major depressive episodes, dysthymic episodes, hypomanic symptoms, hypomanic episodes, manic episodes, and mixed episodes. These categories are not exhaustive in terms of phenomenology. Additional mood presentations are possible and frequently encountered in clinical practice, including mild depressions, mixed hypomanias, and periods of mood dysregulation that are too brief or mild to meet current criteria for an index episode 18. However, the formal diagnosis of mood disorder is anchored to the index episodes, not the other clinical presentations. It is only after ascertaining both the present and past lifetime mood episodes that the clinician can diagnose mood disorders accurately. Table 1 shows how the combination of present and past episodes is often necessary to make a diagnosis on the bipolar spectrum. Unless the clinician inquires about past mood history, many cases of bipolar disorder will be misdiagnosed as unipolar depressive or dysthymic disorders – particularly given that people affected by bipolar disorders tend to spend more days depressed than manic, and are much more likely to seek services for depression than mania 19. The situation may be somewhat different with pediatric bipolar disorder, both because referrals are more often initiated by the parent rather than the youth in outpatient settings, and because mania and mixed episodes appear to be more common in younger cohorts and then decrease steadily with age 8,20.
Bipolar I, often considered the most serious form of bipolar illness, has received the greatest attention from the research community. As Table 1 makes evident, a bipolar I diagnosis only requires the presence of one manic or mixed lifetime episode 17. Bipolar II disorder, in contrast, requires two distinct mood episodes in order to assign the diagnosis: at least one major depressive episode and a hypomanic episode. Without systematic assessment for lifetime hypomanic episodes, bipolar II is very likely to be misdiagnosed as a unipolar depression 21.
The DSM-IV includes cyclothymic disorder as another condition to be considered in the bipolar family of disorders. The diagnostic criteria for cyclothymic disorder require long periods of moderate mood disturbance. The depressive symptoms cannot become too severe, or else the diagnosis would change to a major depressive episode or bipolar II disorder. Similarly, the hypomanic symptoms cannot become too extreme; otherwise, if they meet criteria for a full manic episode, then the diagnosis would change to bipolar I. Cyclothymic disorder is difficult to distinguish from temperament; indeed, much research in the area has used rating scales assessing cyclothymic temperament 22. The diagnosis of cyclothymic disorder is rarely used in clinical practice 23, nor is it tracked in most large epidemiological studies or clinical research samples 8,10,24,25. However, those studies that have investigated cyclothymic disorder have found that it is a highly impairing condition that warrants clinical attention 26-33.
Bipolar “not otherwise specified” (NOS) is a fourth diagnostic option in the bipolar section of DSM-IV. Bipolar NOS is a residual category, intended to be used when bipolar features are present, but the clinical presentation does not fit into any of the three above categories. DSM-IV provides some examples of presentations that would be appropriate to code as bipolar NOS. These include having recurrent hypomanias without any lifetime history of manic, mixed, or major depressive episodes 34; having a disturbance in mood but with an insufficient number of the possible seven B criteria symptoms (e.g., elated mood plus one or two other symptoms; or irritable mood plus fewer than four other symptoms); or cases where the duration of the index mood episode is not long enough to satisfy the thresholds specified for hypomania (four days), or mania or mixed episodes (one week, or else severe enough to necessitate psychiatric hospitalization). Both the “insufficient number of symptoms” 12,35,36 and the “insufficient duration” forms of bipolar NOS 26,37 have been documented in multiple studies in both children and adults. Although the definitions do not identify identical sets of cases, the cumulative evidence shows that either definition is associated with considerable chronicity and clinical impairment. If the core feature of episodic mood disturbance is present, then most evidence suggests that bipolar NOS falls on the bipolar spectrum. In short, bipolar NOS (a) appears to be at least as prevalent as bipolar I in epidemiological and clinical samples, (b) has become well established as an impairing mood disorder, and (c) deserves clinical attention.
There is an important consideration about the potential overlap between bipolar NOS and cyclothymic disorder. In practice, most practitioners and researchers tend to lump cyclothymic disorder into the bipolar NOS category. Technically this is a departure from the official nosologies 16,17, and it adds to the heterogeneity that is found under the rubric of bipolar NOS. The combination of short durations for mood states combined with long lengths of episode should trigger careful evaluation of the possibility of a cyclothymic disorder.
When making any of the bipolar diagnoses, the clinician must rule out the possibility that the mood symptoms are due to schizophrenia, a general medical condition, or induced by a substance 16,17. The substance induced exclusion criteria creates the most challenges. Street drugs that have a strong dopaminergic effect can mimic the symptoms of mania, and hallucinogens can create symptoms that appear psychotic. A more subtle point is that manic symptoms secondary to the use of prescription medications, including antidepressant or stimulant medications, technically lead to diagnoses of “substance induced mania”. The literature on psychotropic medications inducing mania is complex 38, but experts agree that manic symptoms emerging during the course of treatment always justify thorough evaluation of the possibility of bipolar diagnosis.
In the pediatric literature, there has been much discussion about changes to criteria for youths or alternate definitions of bipolar subtypes 6. Leibenluft and colleagues suggested the term “narrow phenotype” to indicate situations where the manic episode included symptoms of elated mood or grandiosity, consistent with the research operational definition of bipolar disorder used by Geller and colleagues 6,39. People often think that the term “narrow” connotes strict adherence to DSM criteria, but actually the “narrow” definition is more restrictive than DSM criteria (which would include hypomania or mania with predominantly irritable mood, so long as there were sufficient numbers of B-criteria symptoms co-occurring). In manysamples, there is substantial overlap between the cases that would meet DSM criteria that would also satisfy the narrow definition 37,40. There is also considerably less research available based on the narrow criteria instead of the DSM criteria 41. At the other extreme, the term “broad phenotype” has been used so widely and to refer to so many different things that it has become imprecise to the point of losing clinical utility. Because the evidence base is much stronger for DSM definitions than any alternate research definitions and DSM criteria guide clinical practice, the rest of this review will concentrate on DSM definitions (bipolar I, bipolar II, cyclothymic disorder, and bipolar NOS – clarifying whenever possible if the NOS specification is due to insufficient symptoms or insufficient duration).
Course specifiers have considerable clinical value in the context of bipolar disorder. Notations such as “bipolar II disorder, current episode depressed” provide important information about the nature of the illness and change some of the treatment options (i.e., prescription of different interventions for unipolar versus bipolar depression). There has been inconsistency about the use of the terms “cycling” and “rapid cycling.” These are often used to connote polarity switches. However, the DSM definition of rapid cycling denotes the occurrence of at least four or more distinct mood episodes (not changes in mood state within an episode) within the same year. Thus, rapid cycling might be thought of as “rapid recurrence” or “rapid relapse.” Indeed, when defined as four or more annual episodes, “rapid cycling” portends to a much more chronic course, higher rates of comorbidity and substance use, greater treatment refractoriness, and potentially greater risk of suicide. Thus the phenomenon of “rapid cycling/relapsing” satisfies both the predictive and prescriptive litmus tests for inclusion in an assessment of bipolar disorder. If the rapid switching between mood polarities that has been well-described in children is better construed as a mixed episode rather than multiple episodes; then the terms are being used consistently across the lifespan, and clinicians can better identify when there is a higher risk of relapse.
The first question that must be answered is, “How common is pediatric bipolar disorder, anyway?” At the time most practicing clinicians were trained, the conventional wisdom was that bipolar disorder affected only adults and perhaps some adolescents; and the vast majority of training programs still do not provide formal didactics about the assessment or treatment of pediatric bipolar disorder 42. The prevalence of the disorder is an important starting point for clinical evaluation.
The traditional figure has been that bipolar disorder affects 1% of the adult population. This figure was often based on rates of bipolar I, and excluded all other DSM bipolar diagnoses. More recent epidemiological studies have found lifetime prevalences of bipolar I and II to be closer to 3 or 4% 24,25, and bipolar spectrum diagnoses appear to affect from 2.6% to 8.3% or more 36 of the general population (see Goodwin and Jamison for a review of twelve international studies)8. Unfortunately, epidemiological studies tend not to use strict DSM criteria for diagnoses, making it difficult to map findings directly onto clinical labels. Despite the varying definitions, it is clearly evident that (a) the bipolar spectrum is more common than generally thought, (b) the “soft spectrum” cases occur at least as frequently as does bipolar I in both community and clinical samples, and (c) the soft spectrum is associated with both immediate impairment and long term risk of poor outcomes on multiple measures 10,36,37,43,44.
A major caveat for the clinician is that epidemiological studies describe the incidence or prevalence of bipolar disorder in the general population. This is not the same thing as the frequency with which a practitioner will encounter bipolar disorder in clinical settings. Bipolar disorder is more common in outpatient settings than in nonreferred community samples, and bipolar is more frequent in settings providing more intensive services due to the acuity of the illness. Table 2 lists prevalence rates from multiple settings. The table also includes information about how the diagnoses were made, whether both parents and youths were interviewed, and other features that might influence the comparability of the estimates. Another limitation is that different groups and settings use somewhat different definitions of bipolar disorder, which also change the rates and their generalizability to other clinical settings. However, these rates still provide meaningful benchmarks against which clinicians can compare the rate of their bipolar disorder diagnoses. They also offer some indication of whether bipolar disorder is likely to be rare or common in a given setting.
Research has identified multiple risk factors that might pertain to bipolar disorder. Tsuchiya and colleagues recently reviewed more than 100 studies evaluating more than 30 different risk factors. They concluded that only a family history of bipolar disorder is the only well-established risk factor for bipolar disorder that should warrant clinical attention45. In studies of the offspring of bipolar parents, the risk of developing bipolar disorder appears to be at least five times higher than in the comparison groups 46, and estimates of the recurrence risk in adult samples indicate that the lifetime risk may be increased ten-fold 47. A recent review recommended that clinicians treat a history of bipolar illness in a first degree relative (biological mother, father, or full sibling) as increasing the risk of developing bipolar disorder by a factor of 5.0 48,49. Bipolar history in a grandparent, aunt, uncle, or half sibling would confer half as much risk (e.g., 2.5 times higher) based on the data suggesting that bipolar is a polygenic illness. These changes in likelihood are large enough to be informative in clinical assessment. At the same time, they are not so large as to make a diagnosis of bipolar disorder automatic; in fact, most people with an affected relative will not have bipolar disorder themselves. Table 3 lists other risk factors for pediatric bipolar disorder. These risk factors are less well established than family history, but are sufficient to prompt additional assessment of the possibility of a bipolar diagnosis.
There are several concerns that arise with regard to family history as a risk factor relevant to diagnosing pediatric bipolar disorder. These include: (a) the fact that the literature cannot yet disentangle genetic versus shared environmental familial factors, (b) the low diagnostic accuracy about bipolar diagnoses in general will undermine the sensitivity of family histories of bipolar disorder, and (c) bipolar disorder has historically been underdiagnosed in minority groups in the U.S.A., with it often misdiagnosed as schizophrenia or antisocial behavior 50,51. For the purposes of formulating a diagnostic impression, it is not necessary to tease apart genetic versus environmental contributions. The poor sensitivity of family history means that failure to report a bipolar history cannot be assumed to be accurate, whereas positive reports of family history may be given greater credibility. The historical inaccuracy of bipolar diagnoses in minority groups means that clinicians need to inquire about mood symptoms whenever they hear about other mental health issues in relatives. Learning about prior treatment history of family members will also provide valuable information about attitudes towards treatment and adherence, and potentially about treatment response as well 52.
How best can a clinician utilize information such as a positive family history of bipolar disorder, or test results? Clinical decision making is usually done based on expertise and impressionistic synthesis of different pieces of information about the individual patient. Case formulation and diagnosis are highly technical skills that integrate multiple variables and involve considerable amounts of training. Within a typical assessment framework, knowledge about the family history becomes one more piece of data to blend into the global diagnostic impressions, increasing concern about the likelihood of bipolar disorder, yet not guaranteeing the diagnosis. Family history is a “red flag,” ideally triggering other assessment procedures and helping build the case for a bipolar diagnosis when other confirming evidence emerges.
It also is possible to use information about family history in a more quantitative manner. Evidence Based Medicine (EBM) advocates the use of Bayesian approaches for assessing the probability of a patient having a diagnosis 53. Bayesian methods focus on combining new information with the prior probability of a diagnosis in order to estimate a revised, posterior probability. Bayesian approaches have been available for centuries, but have not gained much popularity in clinical settings prior to the EBM movement. There are now a range of options for practitioners who want to use Bayesian methods. In addition to doing the computations by hand, there are also applets available on the Web or for personal digital assistants, and there are also “nomograms” that function like a probability slide rule, facilitating estimation of probabilities without requiring any computation, as shown in Figure 1.
Youngstrom and Duax 48 provide a detailed description of how to use a nomogram to estimate the probability of a youth having bipolar disorder when there is a family history of the illness. One first determines a starting probability, before considering the other information that will be synthesized with it. In the absence of any other information, the base rate of the diagnosis—as contained in Table 2--provides a helpful starting point 54. The base rate anchors the clinical decision with an objective consideration of whether bipolar disorder is going to be uncommon or fairly frequent in a clinical setting. The clinician locates the starting probability on the left-hand scale of the nomogram.
The middle line of the nomogram quantifies the impact of the new piece of assessment data, quantified as a “diagnostic likelihood ratio” (DLR)53. Conceptually, the DLR indexes the change in risk of a condition by comparing the rate at which the assessment event (such as a positive family history, or a high test score) occurs in cases with bipolar disorder to the rate of occurrence for the same assessment event in cases without bipolar disorder. In other words, the DLR is the ratio of the sensitivity of the assessment to bipolar disorder (out of 100 cases with bipolar disorder, how many would obtain a positive assessment result), divided by the false alarm rate (out of 100 cases that do not have bipolar disorder, how many would also “falsely” obtain a positive assessment results – the opposite of the tool’s specificity to the diagnosis). The DLR is the change in the odds of having a bipolar diagnosis. The nomogram avoids needing to perform calculations to combine the starting probability with the DLR. Instead, the clinician finds the DLR value on the middle column of the nomogram, and then connects the dots between the first line (the starting probability) and the second line (the DLR) and extends the line across to the third, right-hand scale of the nomogram, which provides the revised probability estimate.
For example, any clinician evaluating a youth coming to an outpatient clinic whose mother has been diagnosed with bipolar II could use the nomogram in the following manner. First, the clinician would select the base rate of bipolar disorder, either using local historical information about the rate of diagnosis at his/her clinic, or by finding a published estimate from a similar setting. The estimates listed in Table 2 suggested the base rate of 6% for bipolar spectrum disorders in outpatient clinics. Thus the clinician would put a dot at 6% on the left-hand line of the nomogram. Diagnosis of a bipolar disorder and a first degree relative increases the risk of a bipolar diagnosis in the youth by a factor of five to 10. The clinician opts to use the more conservative estimate, and marks the five on the middle line of the nomogram. Connecting the two dots and extending the line across the right-hand side of the nomogram yields aprobability estimate in the vicinity of 24%, indicating that the youth has approximately a one in four chance of having bipolar disorder. Alternately, this value can be interpreted as meaning that roughly 24 out of 100 youths presenting to an outpatient clinic with a family history of bipolar disorder will themselves meet criteria for a bipolar spectrum diagnosis. If the clinician had picked the more liberal estimate of a tenfold increase due to family history, then the resulting risk estimate would have been roughly 39%. Comparing these two estimates illustrates several advantages of using the nomogram (or other Bayesian methods): (1) combining probabilities and risk factors is not an intuitive or linear process; (2) it is easy for clinicians to play “what if” scenarios by changing their starting assumptions or their choice of weight to assign to risk factors --referred to as “sensitivity analysis” in the EBM literature 53; and (3) the results from the nomogram fall along a continuum and communicate more accurate information about the degree of diagnostic certainty. One of the major pitfalls of diagnostic testing is that results are prone to misinterpretation, especially when test findings are treated as black and white statements about the patient’s status. The nomogram approach keeps the shades of gray. In this example, a “black and white” approach to testing would either focus on the test positive result (family history), or on the posterior probability being below 50%. Focusing only on the family history, or treating it as if it were synonymous with a bipolar diagnosis in the child, would be inaccurate in more than 3 out of 4 cases. The alternative would be to conclude that family history is not diagnostically useful in outpatient settings, because even when present, most youths will still not have bipolar disorder. Even when conducting a sensitivity analysis using two different estimates of risk, the results are consistent in showing that this particular combination of factors put the youth at moderate risk (24-39%) of having a bipolar spectrum illness. These numbers quantify the earlier statement that positive family history is a “red flag” that should initiate more comprehensive evaluation of a possible mood disorder.
The actuarial/statistical approach to interpreting assessment information is unfamiliar to most clinicians, and also contrasts sharply with more intuitive approaches to interpretation. However, the literature is unambiguous that simple statistical approaches, such as the nomogram method, consistently outperform typical clinical judgment 54,55. The superiority of statistical approaches has been demonstrated more than 130 times, in disciplines spanning economics and education as well as clinical decision-making 56. Cognitive science is beginning to elucidate reasons why even simple statistical approaches perform better. The culprits are “heuristics,” cognitive shortcuts that facilitate the rapid identification and interpretation of information 57,58. These heuristics help the brain process large volumes of complex information swiftly, but they also lead to systematic and predictable biases. The human brain pays attention to cues of risk, for example, and to err on the side of high sensitivity at the expense of false alarms. Though highly adaptive in situations where a failure to detect risk could result in death, the high sensitivity to risk may lead to overestimates of rare but risky events in clinical settings 59. A variety of other heuristics beset clinical judgment, including availability heuristics (such as noticing more bipolar symptoms in patients after repeatedly hearing about the rise in diagnosis in the popular press).
Many of these heuristics are likely to be relevant to the clinical diagnosis of bipolar disorder, suggesting that typical decision-making would be vulnerable to at least as much bias and error as described in the larger decision-making literature. In fact, emerging evidence indicates that clinical diagnoses often have low accuracy with regard to bipolar disorder, including long delays between the emergence and recognition of symptoms 60-62, cyclical trends where the diagnosis goes in and out of “fashion” compared to schizophrenia 63, low agreement with systematic research diagnoses of bipolar disorders 64, and large regional differences in the tendency to diagnose mania or in ratings of severity 65-67. Coding videotaped interviews revealed a strong tendency for American clinicians to rate manic symptoms as more severe than British or Asian Indian clinicians 66, and ratings of clinical vignettes showed that American psychiatrists were more likely to classify ambiguous clinical presentations as “bipolar” versus the rates identified by British clinicians 65. More encouragingly, another vignette study has found that people can learn the nomogram approach quickly, and that applying the nomogram to the same clinical vignette results in significantly more accurate estimates of bipolar risk, greater consistency and agreement about the degree of risk (i.e., much smaller range of opinion, and smaller standard deviations), and a marked reduction in overdiagnosis of bipolar disorder 67. Similar improvements in decision-making have been documented in numerous other areas of medicine 53.
Questionnaires and behavior checklists are an important tool in the kit of pediatric healthcare professionals. They offer an inexpensive, systematic way of gathering information, potentially from multiple sources (e.g., teachers as well as parents or youths). The instruments can cover a broad range of areas of functioning and impairment, or they can drill deeper into more narrowly defined areas, helping to clarify diagnosis or establish the severity of problems. No single instrument will be equally suited to all of these diverse applications. What follows is a brief overview of the evidence pertaining to questionnaires and checklists with regard to pediatric bipolar disorders.
“Broadband” checklists cover a wide range of behavior problems. Both empirically derived versions (e.g., the Achenbach System of Empirically Based Assessment, including the Child Behavior Checklist)68 and DSM-oriented versions (e.g., CSI)69 include subscales dealing with aggressive behavior, depression, anxiety, attention problems, social problems, and thought disorder or psychotic symptoms. The empirically derived versions also include superordinate scales that measure more global levels of externalizing and internalizing problems. Some versions provide scoring algorithms to map onto potential DSM diagnoses 69, and others include age and sex-based norms, comparing the level of behavior problems to typical levels of functioning for peers. Few of these instruments include a mania scale, reflecting the historical fact that most item pools were generated before there was concern about the possibility of pediatric bipolar disorder. The exceptions still tend to only include the mania items in the adolescent version 69 or only on the self-report (and not parent or teacher report) versions (BASC)70.
There are at least three major roles that a broad-band instrument can play in the context of evaluating PBD: (1) high externalizing scores can trigger further assessment substituting other procedures for evaluating mania; (2) low scores can substantially reduce the probability that a case has PBD; and (3) broadband measures provide an inexpensive method of gauging the range of associated problems and comorbidities frequently seen with PBD. The CBCL is the most thoroughly investigated measure with regard to PBD, and evidence consistently shows that youths with PBD show elevated scores on multiple scales, including the Externalizing Problems broadband score 71,72. However, in spite of PBD elevating average scores on several scales, from a diagnostic perspective it is the Externalizing Problems that convey the most information. After controlling for Externalizing scores, no other scale or combination of scales provides incremental validity 73. Most cases with PBD will show high Externalizing scores (i.e., they are sensitive to PBD), but high scores are also associated with many other conditions (i.e., they are not specific to PBD). This sets up an asymmetry, where low Externalizing scores are often decisive at ruling bipolar disorder out, but high scores are ambiguous 74. Because of this, high scores should be treated as another warning sign, leading to deeper investigation of potential bipolar disorder. On the other hand, low scores will often decrease the risk enough to effectively rule bipolar disorder out, unless there are several countervailing risk factors and clinical signs. Table 4 provides the DLRs associated with low, moderate, and high scores on the CBCL as well as the Achenbach TRF and YSR. More diagnostic information can be wrung from tests by estimating DLRs for multiple segments corresponding to low, medium, and high scores (as opposed to the common practice of setting a single threshold)53. The low scores on the CBCL are more powerful at reducing probability of bipolar (DLR = .04) than extremely high scores are at increasing risk (DLR = 4, versus a 25 for a low score reducing risk)53. Finally, regardless of whether the behavior problems represent true comorbid diagnoses, versus elements of a “core phenotype” of pediatric bipolar disorder or secondary consequences of the illness, the other clinical syndrome scales on broadband measures provide valuable information about functioning and other potential targets for treatment. For example, severe attention problems and chronic hyperactivity often require adjunctive treatment with stimulants even after mood stabilization has occurred 75-77; and the social problems associated with PBD also respond well to targeted interventions 78,79.
Measures of mania for youths have proliferated over the last decade (see Table 4). They vary widely in terms of item content, reading level, and degree of validation. Although many are brief and most are in the public domain, the rarity of PBD and the false positive rates produced by all of the tests preclude recommending the use of any mania checklist as a core component of outpatient assessment batteries. However, the best available measures are markedly more specific to PBD than the broadband instruments, suggesting a cost-effective, two stage approach to assessment (see Figure 2)49. First, the clinician would gather general developmental history and family history, which would include an assessment of several risk factors for PBD. Second, they would give a broadband measure as a way of getting a rapid scouting report about a wide array of clinical domains. If the family history and externalizing scores both were low risk, then bipolar disorder is effectively ruled out (Figure 2). If either the family history is significant for bipolar disorder or the Externalizing score is high, then the clinician would supplement the assessment battery with a mania-specific measure. At present, the best validated and most discriminating instruments are the PGBI 33 and its 10 item mania form 80, the Parent MDQ 81,82, and the CMRS 83 and its 10 item form 84. Other instruments have either not performed as well, or they have not been validated under similarly generalizable clinical circumstances 85. These three instruments produce functionally interchangeable results in terms of diagnostic assessment. As new papers are published, test users should compare the instruments not only on their area under the curve (AUC) in receiver operating characteristic (ROC) analyses (which combines the diagnostic sensitivity and specificity into a single summary score), but also on the quality of the study, sample, and reporting 86.
The clinician can choose to interpret test scores from rating scales in a number of ways. The most common method is a categorical, impressionistic interpretation, where scores are grouped into “high” and “low” ranges. It is also possible to be more formal about the quantification of risk information conveyed by test results, using the same array of options as described earlier when interpreting family history. Current thinking in EBM is that the use of diagnostic likelihood ratios is a preferred strategy 53. The DLR for a test result is the percentage of cases with bipolar disorder divided by the percentage of nonbipolar cases scoring in the same range. If a publication provides the sensitivity and specificity, then it is straightforward to convert these values into a pair of DLRs for scores above and below the threshold 53. Table 4 includes the DLRs for all available tests with regard to PBD at the time of writing.
Perusal of the DLR values leads to several observations. Many of the DLRs associated with low test scores are values lower than 1.0. A DLR of 1.0 indicates that the test result or risk factor did not change the probability of a bipolar diagnosis, because the score is equally likely to occur in both bipolar and nonbipolar reference groups. DLRs smaller than 1.0 reflect that the score is much more likely to occur in nonbipolar cases, thus reducing the likelihood that the current client has bipolar disorder. Whereas a value of 2.0 would signify a doubling of the odds of a bipolar diagnosis, a value of 0.5 would convey a similar change in odds in the opposite direction. DLRs greater than 10 or smaller than 0.1 are often decisive pieces of information 53: They can change a prior probability of 50% (even odds) to more than 90% or less than 10% posterior probability. These benchmarks lead to the additional observation that available instruments are more powerful at decreasing the likelihood of bipolar disorder than at increasing it (i.e., there are many test results yielding DLRs less than 0.1, but few with DLRs greater than 10, and none that have been validated in samples with a high degree of clinical generalizability). Small DLRs can actually play a valuable role by reducing the tendency to overdiagnose PBD 1,2,67.
To use these DLRs with a nomogram, one follows the same procedure as described with the interpretation of family history, only using the test score’s DLR as the estimate on the middle line of the nomogram 87. When multiple DLRs are available, such as when both family history information and a CBCL Externalizing score are available, then all pieces of information can be combined within the nomogram framework. The sequence does not matter. Family history could be considered first, or the test score; or the DLRs could be multiplied together and the product used instead during a single pass through the nomogram. Algebraically, these are all equivalent scenarios. This degree of flexibility is extremely valuable clinically, though, as often it will not be possible to obtain some pieces of assessment information for a specific case, and the order with which clinical data become available often varies across cases. In contrast, other actuarial methods such as lookup tables, logistic regression, or decision trees require that all the component variables be measured for each case, and that they be applied in a specified combination or sequence . Building on the earlier example, if a very high CBCL Externalizing score (T= 83) was added to the case with a positive family history of bipolar disorder, then the current probability (24% for based on a first degree relative increasing the risk of a bipolar diagnosis by a factor of 5) would be entered on the lefthand line of the nomogram. Table 4 provides the DLR associated with the T-score (DLR = 4.3 for an adolescent scoring this high on the CBCL). Combining the prior probability and the DLR algebraically yields an estimate of 58%. Using the nomogram adds some imprecision, both because reference points need to be visually interpolated, and because of error connecting the dots; but clinical estimates using the nomogram still wind up being centered around the best estimate and are dramatically more precise than when clinicians interpret the same information impressionistically67.
Either a high score on a broadband measure or a positive family history of bipolar disorder would justify the addition of one of the more specific mania scales to the assessment process. However, when the same person fills out two rating scales, only one of them should be incorporated into the formal assessment process, whether it be the nomogram or another method of combining risks 14. The scores on the questionnaires will be highly correlated with each other by virtue of coming from the same source, and thus will yield redundant information. Treating multiple questionnaire scores as if each was introducing new information will create bias in the probability estimates. The bias can be substantial when the scores are highly correlated (r > .5), which will often be the case when the same person fills out multiple instruments, even when they measure different constructs 88. As a result, the clinician should take the most valid piece of information available from the informant and substitute it into the nomogram cycle, ignoring other scores from the same source 14. In our case example, the high CBCL score would cause the clinician to ask the caregiver to complete the Parent GBI. A very high score on this tool (e.g., a raw score of 51; see Table 4) has a DLR of 9.2. This would replace the DLR of 4.3 from the CBCL completed by the same caregiver. Combining the DLR of 9.2 with a prior probability of 24% (based on the family history and the base rate of PBD in outpatient settings) generates an estimate of 58% risk.
Permuting the possible combinations of DLRs from the combination of family histories and test scores yields a range of 8 to 24 probability estimates per test. When also accounting for the differences in base rate across clinical settings, the number of distinct probabilities will exceed 100 for each test x family history x setting combination. This reveals another advantage of the nomogram approach compared to generating tables of estimates for each configuration, or versus trying to weight the information sources intuitively. The tandem of family history and rating scales is powerful enough to move the probability estimate of a PBD diagnosis to less than 1% (no family history plus a low score on a broadband) or as high as 85% (using a more aggressive estimate of 10-fold increase in risk due to family history, plus a very high score on a PGBI or comparable tool). Thus, an evidence-based approach to assessment can rule PBD out in many cases, and reduce the tendency to overdiagnose bipolar disorder; but even high risk combinations are not sufficiently accurate for diagnosis of bipolar disorder to replace a careful symptom level assessment of bipolar disorder.
There has been a substantial amount of research on the validity of youth self report and teacher report as well as caregiver report (almost always mothers) about bipolar disorder. Findings consistently show the greatest validity for parent report, which shows significantly larger effect sizes than youth or teacher report in all published studies where the same instrument is available from multiple informants 73,82,89,90. Examining Table 4 reveals that the DLRs for parent report are consistently larger than the DLRs for youth or teacher report on the same instruments. The greater validity of parent report persists even when the parent has a diagnosed mood disorder 91.
That parent report outperforms self-report contradicts conventional wisdom that self-report is a better source of information about mood disorders 92. The lower validity of youth report appears to be due to a combination of mania compromising insight into one’s own behaviors 93 and manic symptoms tending to disturb others before the affected individual perceives them as problematic 94. The low validity of teacher report persists even when teachers complete mania-specific rating scales 95. Of note, the agreement between parents and youths or parents and teachers is actually significantly higher than typical for cross-informant agreement 91,96, and youths and teachers report significantly more behavior problems in PBD cases than would be predicted based on the parent’s level of concern alone 91. The challenge is the difficulty of intuitively appreciating what a cross-situational correlation of .2 or .3 might look like at a case level, so instances where dyadic agreement is actually good are often misinterpreted as one person having exaggerated concerns. Clinicians will often encounter families where the parent reports more mood issues than the teacher or youth. The evidence-based approach to these discrepancies is not to automatically discount the parent report, but rather to systematically gather additional information to evaluate the possibility of PBD 14. At a statistical level, youth and teacher report provide only modest –and often insignificant—incremental validity after controlling for parent report 73,94. However, the correlation between parents and youths or teachers is sufficiently low that these can be treated as functionally separate sources of information and combined within the nomogram framework. Evidence also suggests that cases where mood symptoms are noticeable across informants and settings may have greater impairment 97, also justifying the effort required to collect multiple perspectives. Teacher report can add useful information about the degree of problems in the school setting 95, and youth report adds data about the degree of insight into problems and motivation for treatment, both providing helpful prescriptive information to guide intervention.
The clinical process has three different options with regard to diagnosis: Ruled out, sufficiently well-established that treatment for the condition should begin, or else possible but not established. In EBM, the probability of diagnosis theoretically ranges from 0 to 100%, and there are two thresholds that separate the three different clinical options (see Figure 2) 53. The Test-Wait threshold separates the low risk zone - diagnosis is effectively ruled out - from the indeterminate middle range. The Test-Treat threshold demarcates the zone where probability is high enough to initiate treatment. Figure 2 does not specify probability levels for these thresholds. In practice, the location of the threshold should take into consideration the risks and benefits associated with treatment, as well as patient preferences. There is a formal framework for collecting and incorporating these utilities into adjusted thresholds 98, with perhaps the easiest approach described in an EBM handbook 53. As diagrammed in Figure 2, the combination of family history and questionnaire data will be sufficient to rule bipolar disorder out when both indicate low risk. High risk cases, with positive family history and high scores on mania measures, will still fall below the Test-Treat threshold – especially when considering the stigma, treatment burden, and potential side effects attendant to recommended treatments for PBD 99. Within an EBM framework, probabilities falling between the Treat and Wait thresholds indicate the need for continued assessment. New assessment data then gets combined with the current probability until the revised probability crosses the Wait threshold (ruling the diagnosis out) or the Treat threshold (ruling the diagnosis in).
Psychology has long had a model of “levels of intervention,” where primary preventions might be offered to everyone in order to avoid onset of an illness, secondary interventions might be offered only to targeted high-risk groups, and tertiary interventions would deployed for cases already manifesting a disorder 100. Preventive measures need to be low-risk and low cost if they are going to be applied widely regardless of risk. Tertiary interventions can be higher risk and expense because they have been reserved for cases with established diagnoses. This “levels of intervention” model can be mapped onto the EBM diagnostic threshold model 15, as shown in Figure 2, right-hand side. Synthesizing these two models creates a set of assessment and treatment recommendations for each of the three ranges of probability for a bipolar diagnosis. Instead of labeling cases having mid-range probabilities as “indeterminate,” they can be called “moderate risk” and treatment using low-risk methods can start. Techniques such as psychotherapy, dietary supplementation, and improved sleep hygiene all might be tried for cases in this range. So long as the burden, risks, and costs are low, then treatments that are nonspecific or potentially preventative can be used, even while assessment continues and clarifies diagnostic impressions.
Even when multiple risk factors are present, a clinician cannot assume that a PBD diagnosis has been established. How best should they proceed? The following suggestions provide an overview of strategies that help to confirm the diagnosis in high risk cases, or to rule a bipolar diagnosis in or out for cases that fall in the intermediate range of risk.
Diagnostic interviews remain the standard of practice for determining clinical diagnosis. The typical unstructured diagnostic interview is prone to a variety of heuristics that render its reliability quite low 101. For PBD, circumstances are likely to worsen the already typically poor degree of inter-rater, due to issues such as the lack of formal training in recognition of PBD, the usage of different operational definitions, and the controversy around the diagnosis. Structured diagnostic interviews avoid some of the shortcomings of informal interviews, including systematic coverage of relevant symptoms and formal algorithms to make DSM diagnoses 101. However, most structured interviews were designed before PBD was considered a serious possibility in youths, with the consequence that many pediatric structured interviews do not include a mania module, and those that do might include few if any modifications to the probes or anchors to facilitate recognition in pediatric cases 14.
For this reason, semi-structured diagnostic interviews such as the KSADS have become the accepted standard for PBD research 102. Different versions of the KSADS have demonstrated good reliability and validity with regard to PBD 103,104. There are some obstacles hindering the widespread clinical adoption of semi-structured interviews. These include the necessity for extensive training, or else the semi-structured aspect opens the door for differences in clinical judgment to undermine the reliability 66, as well as the substantial amount of time required to administer and score the interview. A full KSADS can take anywhere from 2 to 8 hours to complete with a family, with administrations by experienced clinicians often averaging around 3 hours for typical cases. However, more streamlined versions of semi-structured interviews deserve serious consideration as a potential component of clinical assessment 105. The emerging literature around clinical diagnosis of PBD suggests that there may be great need for semi-structured approaches despite the increased expense and burden involved. Medicaid and other providers will often reimburse for the diagnostic assessment time if medical necessity has been demonstrated. The framework described here – starting with a combination of rating scales and family history - provides strong documentation of medical necessity for such additional evaluation. Clinicians who want to adopt semi-structured interviews as part of their assessment portfolio should pick an instrument that covers manic and depressive symptoms thoroughly, includes developmentally appropriate anchors, and supports the diagnosis of “spectrum” conditions (such as bipolar II, cyclothymic disorder, and bipolar NOS), as these will be more common than bipolar I, yet impairing enough to represent major clinical concerns.
Not all symptoms carry equal weight towards making a diagnosis of mania. The DSM-IV and ICD criteria give greater emphasis to elated mood instead of irritable mood. Elated mood requires only three additional symptoms to support a diagnosis of mania, versus four additional symptoms for irritable mood 17. This policy acknowledges that elated mood has greater diagnostic specificity to mania, whereas irritable mood is diagnostically nonspecific. Research suggests that decreased need for sleep, unstable self esteem and grandiosity, hypersexuality, racing thoughts, and psychotic symptoms are all relatively specific to PBD 7. The sensitivity of each of these symptoms is low enough that none should be required for making a diagnosis of bipolar disorder, or else somewhere between a quarter and two thirds of bipolar cases might be excluded 7. Each of these symptoms is also liable to occur in at least one other condition likely to be encountered in many clinical settings. For example, hypersexuality can be a sign of sexual abuse, and inflated self-esteem is frequently seen in conduct disorder 7. However, a clinician can learn how these symptoms often manifest in the context of PBD 106, and careful probing around these symptoms is an important component of refining diagnostic impressions. Evidence that any of the possible manic symptoms occur episodically, as opposed to chronically, or that they fluctuate with changes in mood and energy, heighten the suspicion that they are due to a mood disorder rather than a more chronic condition such as ADHD 107. If the symptom occurs with an unusual Frequency, if the Intensity is excessive, if the Number of occurrences within an episode is extreme, or if the Duration is exceptional compared to age appropriate behavior, that also helps build the case in favor of a mood diagnosis99,108.
Another crucial strategy to improve the detection of PBD is to extend the window of assessment beyond the conventional single session of intake assessment 49. Relying on a single panel of information focused on the presenting problem will rarely be enough to allow a firm diagnosis of PBD.
Gathering a developmental history is a routine component of pediatric assessment. Its role is especially helpful when evaluating the possibility of PBD. In addition to gathering data about the family psychiatric history, pre- and perinatal risk factors and complications and temperamental characteristics all deserve consideration 108. Developmental trajectories can help distinguish between chronic conditions such as ADHD (which is formally required to have onset before age seven) versus episodic mood presentations 107. Even though not all authorities concur that PBD will always have an episodic presentation, identifying episodic presentations still carries treatment utility by suggesting different intervention strategies 52,109. The most intensive form of retrospective information gathering would be to complete a retrospective life chart 110. The retrospective life chart is a tool that asks the family to reconstruct a week-by-week summary of the youth’s past mood and energy levels, using a variety of anchors and techniques to facilitate accurate recall. The retrospective life chart can yield valuable information about the chronicity versus episodicity of mood presentation, and it can help to identify triggering events. However, the time and effort involved is substantial, and retrospective memory is subject to several sources of bias. Clinically, the costs and possible benefits need to be balanced on an individual case basis before adding a prospective life chart.
The other way of moving beyond a single-session intake is by extending the window of assessment forward in time. There are many means of accomplishing this. They include starting with a diagnosis of “rule out bipolar disorder” based on the initial intake, and following up with additional assessment to clarify the diagnosis. The EBM threshold model operationalizes this concept by indicating continued assessment for as long as the probability of PBD falls between the Wait and Treat thresholds 53.
Another approach is to shift to a “dental model” of assessment, where ongoing “checkups” are scheduled to gauge mood and energy over the course of treatment 49. At a minimum, these could consist of asking the patient at each visit about their mood and energy since the last visit. Alternately, the family could complete brief rating scales every few weeks over the course of treatment 76,77. The clinician could also quantify impressions using ratings such as the Children’s Global Assessment Scale (CGAS)111. At the most intensive, the clinician could suggest doing prospective life charting 110. In a prospective format, the patient records changes in mood and energy on a daily basis, and also notes any coinciding events. There are free prospective life charts for use with youths available on the web (Google “bipolar life chart” to find numerous examples), and sophisticated online versions are now available. If the family is willing and able to complete prospective life charts, then the information is highly useful for refining diagnosis and guiding treatment; but the demands of life charting exceed the resources and motivation of many families. The informal “mood and energy checkup” at each visit represents the minimum level of prospective information gathering that should be routine when working with mood disorders.
In addition to guiding diagnosis, a second crucial role of assessment tools is quantifying the severity of mood problems. More acute mood disturbance will require a different level of intervention services, with inpatient hospitalization providing the most intensive treatment for severest mood disturbance. Besides navigating the selection of treatment setting, the severity of the mood problems will help prescribe different treatment options. More severe mood problems will suggest the use of pharmacotherapy as a first line treatment, and combination treatments with both psychotherapy and one or more pharmacological agents may be needed to stabilize mood 99. Moderate levels of mood symptoms may result in lower dose interventions, including outpatient psychotherapy with longer intervals between sessions, or less aggressive dosing of medications. Assessments of severity are also crucial for establishing benchmarks against which to measure treatment response.
Clinical ratings of the severity of mood problems steer the treatment of PBD. Typically the assessment is informal, with all of the attendant limitations described in the discussion of informal diagnosis 101. There are global rating scales, such as the Global Assessment of Functioning 17, the CGAS 111 and the Clinician Global Impressions scale (CGI) 112, that assign a number to the clinician’s overall impression of functioning. There also is a bipolar version of the CGI, where the clinician rates manic and depressive symptoms separately 76.
The next level of sophistication would be to use a semi-structured clinical rating scale. The two most widely used in research with PBD are the Young Mania Rating Scale (YMRS)113 and the Children’s Depression Rating Scale-Revised (CDRS-R) 114. Both have shown evidence of good reliability and acceptable validity in pediatric samples 115-117. This is reassuring for the YMRS, which was not originally designed for use with children or as an interview 113. The YMRS and CDRS-R both omit symptoms that are DSM-IV criteria for mania or depression, and they omit other associated features that can be important in assessing the severity of mood disturbance 14. The YMRS does not include a grandiosity item, for example; nor does it measure threat of harm to self. There are newer rating scales designed specifically for use with children and adolescents, that provide more developmentally appropriate anchors, use consistent ratings across all items, and include all DSM symptoms of mood episodes. The KSADS-Mania Rating Scale and Depression Rating Scales have all of these refinements and show good psychometrics 118. Clinicians contemplating the use of mood rating scales should be aware that the interviews require a moderate amount of time (typically 15 to 45 minutes), and the semi-structured format is both a blessing and a bane in terms of often leading to sizeable differences in scoring of the same interview by different clinicians 66.
Checklists can complement clinician ratings in measuring severity. Checklists are inexpensive, require virtually no training to use, and incorporate no little or no clinical judgment in scoring; they are the mirror-image of clinician ratings in each of these regards. Checklists also afford the assessment of a broader range of symptom domains than can generally be accomplished via clinician ratings. Parent checklists, in particular, have demonstrated a strong correlation with clinician-rated measures of severity, and they also show good sensitivity to treatment effects 76,77.
There is growing emphasis on quality of life as a vital aspect of the burden of illness and successful outcome of treatment. Several rating scales have been used to measure quality of life in the context of PBD 119,120. The KINDL is especially attractive for clinical use because it has two parent and three youth report versions that are developmentally staged, and because it can be used free of charge (http://www.kindl.org/indexE.html).
Many of the assessment tools discussed above can contribute to process measurement during treatment. Prospective life charts, in-session mood and energy checkups, or repeated administrations of brief rating scales can chart response over the course of intervention. Of the various checklists available, the 10-item versions are generally preferable for repeated administration due to the reduced burden. An exception is the MDQ 81: Although it is a good diagnostic aid, it does not capture information about the severity of current mood problems, so it is not useful as a process or outcome measure.
Prospective life charts also can provide similar information to the three-column and five-column charts used in cognitive behavioral therapy 121. All of these assessment devices ask the patient to chart fluctuations in mood as well as associated events. The three-column chart asks the patient to then also write down what they were thinking at the time, linking the cognition to the emotional response; and the five-column chart goes further by adding an alternative cognition and the emotional response it generates 121,122. For those families that are able to do prospective life charting, it is possible to seamlessly weave the components of three- and five-column charting into the recording and in-session discussions of the data.
Other aspects of treatment process that are important to assess include adherence, risk of harm to self and others, and side effects. Measures of adherence can include regularity of kept appointments, completion rates of homework assignments, and compliance with prescribed dosing regimens. Because mood disorder is a major risk factor for suicide and self harm, it is crucial to regularly assess potential suicidal ideation, as well as the presence of a plan and means 8. Similarly, the degree of aggression and irritability that often manifest with PBD require regular, documented assessment of risk to others. Finally, the potential side effects for pharmacological treatments are both numerous and potentially quite serious, so they need careful patient education and ongoing monitoring 123. Although there are published rating scales for each of these domains, in general they are sufficiently cumbersome that it will usually be more practical to accomplish these goals via direct assessment by the clinician. At the same time, the clinician must follow through on assessing each of these domains and not fall into the trap of irregular assessment and poor documentation.
Many of the same questionnaires, checklists, and clinician rating scales used to evaluate severity also can provide good measures of outcome. The tools best suited for outcome assessment have good reliability, good content coverage of the relevant symptoms and key aspects of functioning, and are sensitive to treatment effects. Content that is useful for diagnosis may not be identical to the content most useful for outcome measures. For example, irritable mood is ambiguous when used for diagnostic purposes, but irritability is one of the most distressing and impairing features of PBD, and so it definitely merits a central role in outcome assessment. Conversely, two of the eleven items on the YMRS show weak validity when applied to pediatric cases (lack of insight, and bizarre appearance)117, and their inclusion probably dilutes the sensitivity of the YMRS to treatment effects. From a psychometric perspective, outcome measures can and should be longer than the process measures administered more frequently during treatment. The greater length improves reliability and thus the potential validity of the outcome measure 124, and the length is more likely to be tolerable if administered infrequently. Counter intuitively, statistical power to detect change can be increased by shifting more items to the post-test instead of the pre-test 125. Clinicians could take advantage of this by using more lengthy outcome measures as part of the termination assessment, where families would not also be spending substantial amounts of time on diagnostic evaluations or administrative paperwork (such as insurance forms). Of the available instruments, the CBCL and the Parent and Adolescent GBI have the most established track record as outcome measures. The CMRS also appears promising. Of the clinician rating scales, the YMRS and CDRS have by far the largest data base in the literature, but the KSADS MRS and DRS deserve consideration due to their advantages in terms of content coverage and developmental appropriateness.
In psychiatry outcome studies, treatment response has most commonly been defined using thresholds for percentage reductions in the severity of mood symptoms. For example, a 33% or 50% reduction in YMRS scores from baseline might define “response” to treatment 126,127. Such definitions of response are convenient, but they also have some major shortcomings. These limitations include (a) the fact that different patients will need to show varying amounts of change to qualify as a responder, depending upon their initial level of severity; (b) the fact that mania and depression often show different responses to treatment, and sometimes one might worsen at the same time that the other mood symptoms are improving; (c) percentage reductions in symptoms do not necessarily translate into syndromal remission or improvements compared to normative benchmarks; and (d) percentage reductions ignore the degree of precision of an instrument, potentially penalizing more accurate instruments because they are more reliable. The vulnerability of definitions of “treatment response” to unreliability could be a factor contributing to the high rates of placebo response observed in clinical trials.
Multiple refinements have been developed to address these shortcomings of the percentage change approach. These include comparisons to “rules of thumb” about thresholds for mania, hypomania, or depression; the empirical definition of thresholds that distinguish responders from nonresponders 128; the use of compound definitions of remission that integrate improvement on mania and depression simultaneously 76; and the articulation of formal clinical definitions of remission and recurrence 129. None of these has achieved clear dominance in the research arena, and few have permeated into clinical practice yet.
Perhaps the most fully articulated framework for evaluating outcomes is the “clinically significant change” model proposed by Jacobson and colleagues 130. Their definition of clinical significant change requires achieving two goals: Demonstrating reliable improvement given the psychometric precision of the outcome measure, and also passing one of three normative benchmarks defined by the range of scores observed in clinical and nonclinical reference samples. Reliable change is tied to the standard error of the difference score, which is a direct measure of the instrument’s precision at measuring change. Jacobson advocated dividing the patient’s raw change score by the standard error of the difference 130. This would standardize the change score, converting it to a z-score (with a mean of zero and standard deviation of one, the same metric as the familiar Cohen’s d effect size). Jacobson called this standardized change score the “Reliable Change Index” (RCI). Two advantages of calculating the RCI are (a) that it facilitates comparison of the magnitude of treatment response across different measures (e.g., a 5 point reduction in the YMRS does not mean the same thing as a 5 point reduction in a CDRS-R score; but converting both to RCIs would make clear whether the patient’s mania or depression was responding more to treatment); and (b) that RCIs can be compared to established benchmarks drawn from the normal distribution. RCIs larger than 1.65 are big enough to be 90% certain that the patient is responding, and RCIs larger than 1.96 are enough to be 95% confident. Three drawbacks of the RCI are (a) that it is unfamiliar to most clinicians, (b) it involves computation, and (c) it requires knowledge of the standard error of the difference for the test, which is rarely available. However, all of these problems are tractable. A recent chapter presents the standard error of the difference for several commonly used outcome measures relevant to PBD 14. Table 5 also includes “critical values” expressed in the raw or T-score metric that clinicians normally use, as a way of bypassing any computation. For example, if a patient shows at least 10 points of improvement on the YMRS, or 6 points on the CBCL Externalizing, then the clinician can be 90% sure that they are changing in this domain (or 95% sure that the patient is improving, one-tailed).
The second part of the definition of clinically significant change entails demarcating benchmarks based on reference clinical and nonclinical samples. Jacobsen defined three benchmarks, called simply A, B, and C. Youngstrom has suggested referring to them as Away from the clinical range, Back into the normal range, and Closer to the nonclinical than clinical mean14. The Away threshold is set at two standard deviations below the mean for a clinical sample on the measure 130. For PBD, estimating the Away threshold would involve finding a sample of youths with PBD and then locating the score on the instrument that fell two standard deviations below the sample mean (roughly corresponding to the 2.5th percentile for cases with PBD). Similarly, the Back into the nonclinical range is established by finding the score that falls two standard deviations above the mean for a nonclinical comparison group. The Closer threshold is found by estimating the weighted mean of the clinical and comparison means, weighting by the sample standard deviations. The technical manuals for many tests include the needed information to calculate the three thresholds, and Cooperberg 131 calculated the thresholds for several other measures pertinent to PBD. Table 5 includes the thresholds for several commonly used outcome measures.
Clinically significant change is frequently used to evaluate individual response in psychotherapy trials. It is a stringent definition of response, and it tends to produce lower estimates of response rates than the simple percentage symptom reduction approach 132. The conservative definition should cause celebration when actually achieving any of the definitions of clinically significant change. Studying Table 5 also reveals that some definitions are impossible given the distributional characteristics of a measure. The Away threshold frequently would require obtaining negative scores, because the clinical mean falls within two standard deviations of the lowest possible score on the measure. Similarly, the Back definition would accept high scores (e.g., T-scores of 70 on Externalizing or Internalizing problems) as potentially reflecting clinically significant change. This might make sense when coupled with Reliable Change – a seven point reduction in Internalizing Problems from a 76 to a 69 could constitute a substantial improvement, for example. Even so, the thresholds identified by the Closer definition will often provide the most meaningful benchmarks for outcome evaluation 14.
In summary, outcome evaluation is usually done informally, if at all 101. However, a variety of assessment tools and definitions of outcome are now available for clinicians working with PBD. Using formal outcome assessment will help better gauge treatment response and enable comparisons between clinical practice and the outcomes described in the literature. The clinically significant change model addresses many of the technical shortcomings of other outcome definitions, and it can now be used with many –but not all—outcome measures relevant to PBD. The necessary benchmarks are not available for the CMRS or KSADS-MRS, for example. On the other hand, all of the different definitions could be applied to commonly used broadband and public domain measures of mania and depression using the information contained in Table 5.
Longitudinal data indicate that PBD tends to show a recurrent course 10,11, similar to the high rates of relapse observed in adult samples 8,20. As treatment progresses, an important component will be planning strategies for monitoring against relapse. Reviewing the life charts and three- or five-column charts would expedite making a list of events likely to trigger exacerbation of mood. There are also normative developmental transitions that are likely to elicit major changes in mood, including the onset of puberty, leaving home, and major role changes such as graduating or failing out of school. Although not all events can be anticipated, many can; and a simple assessment strategy would be to list the likely events and then plan for ways of evaluating mood status when the events arise.
The later phases of treatment can also be a good time to identify warning signs of relapse, or “roughening” of mood 133. It is helpful to have manic symptoms under control in order to increase insight into the illness. It also is desirable to have some experience working together with the patient, as many of the cues of relapse are not diagnostic symptoms per se, but rather idiosyncratic aspects of the person’s functioning 134,135. A major goal would be to develop warning signs that the patient would use and trust even when getting hypomanic, with the attendant feelings of wellness and loss of insight. An example would be how psychoeducation interventions often help the client learn to use the family or a trusted roommate as a “watchdog” that helps monitor mood 136.
Given the geographic mobility of both patients and clinicians, a third assessment strategy worth considering is preparing a “care package” for the next clinician as part of the maintenance planning. This care package could be based on a review of the components of treatment, with the patient and practitioner candidly evaluating which were helpful and which were not. Preparing a list of medications tried and responses or adverse events, and a list of therapeutic or lifestyle manipulations and their perceived effectiveness would avoid a lot of guesswork and missteps when resuming active treatment in the aftermath of relapse. Patients and practitioners could keep copies of the documentation to increase the chances that it would be available when needed.
A great deal of research is investigating methods that could soon contribute to the diagnostic assessment of PBD. Some of the most exciting research includes neurocognitive testing, functional imaging techniques, and genetic testing. In spite of the promise, expert consensus is that none of these methods are currently ready for clinical use. Recent reviews of neurocognitive tests 137 and neuroimaging 138,139 have found that the replicated patterns of functioning in PBD tend not to be specific to bipolar disorder, but instead overlap with patterns of functioning seen in ADHD, schizophrenia, and other conditions.
Similarly, the present evidence indicates that bipolar disorder is a polygenic condition, with multiple genes each contributing small increases in the risk of developing the disorder 140. However, several companies now are marketing direct-to-consumer genetic testing, purportedly including tests for bipolar disorder (e.g., https://psynomics.com/). This has provoked strong criticism from the academic community, arguing that it is premature to market tests for bipolar disorder and that the results are highly prone to misinterpretation 141. The opportunity for misunderstanding is greatest when the results of any of these new methods are treated as yes/no, positive or negative tests for bipolar disorder. This review has demonstrated that although family history, questionnaires, and rating scales are also imperfect measures, once their biases and limitations become known, it is still possible to assimilate them into an evidence-based framework for assessment.
The diagnosis of pediatric bipolar disorder remains controversial in clinical practice. Evidence suggests that it is often diagnosed when not present, and yet many cases of bipolar disorder are also missed. Despite the aura of controversy, there is considerable consensus among experts about the validity of DSM-IV based diagnoses in youth. Marked progress has also been made in validating and honing assessment strategies for PBD. As this review reveals, it is possible to use the literature to inform choices of assessment techniques that contribute to diagnosis, treatment selection, monitoring progress, evaluating outcomes, and long term monitoring and relapse prevention. The review also synthesizes the available literature with the clinical decision-making framework advocated in Evidence Based Medicine and in the clinically significant change literature in psychotherapy research. Both of these frameworks emphasize decision-making about individuals, rather than groups of patients. In consequence, these models speak much more directly to individual clinical care than past research typically has done.
There is no aspect of the assessment of PBD that has been “perfected.” Every assessment tool could be bettered, and each technique offers room for improvement. At the same time, the number of tools and ideas now available for clinicians to apply immediately in their clinical practice is impressively large and diverse. The potential gains from employing evidence based strategies are only hinted at except for in the arena of diagnosis, where recent studies suggest that the contributions could be huge. The goal of clinical research is to answer gaps and uncertainties in clinical practice, and research in PBD is poised to support rapid advances in clinical assessment.